Chapter 7 Summarizing Data: Descriptive Statistics

R has built in functions for a large number of summary statistics. To illustrate the main R functions I will use the mpg dataset from the ggplot2 package. So yes, R has tons of packages and some of them are also packed with interesting data that we can use. ggplot2 package is one of them that has been developed to make high quality plots and it should be already installed in R so we just need to load it into the current session:

library(ggplot2)

Let us first see what kind of objects are included in mpg by using summary function

summary(mpg)
##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##                                                                      
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##                                                        NA's   :1      
##       hwy             fl               class              mpg_sum     
##  Min.   :12.00   Length:234         Length:234         Min.   :21.00  
##  1st Qu.:18.00   Class :character   Class :character   1st Qu.:31.00  
##  Median :24.00   Mode  :character   Mode  :character   Median :41.00  
##  Mean   :23.44                                         Mean   :40.29  
##  3rd Qu.:27.00                                         3rd Qu.:47.00  
##  Max.   :44.00                                         Max.   :79.00  
##                                                        NA's   :1      
##    cty_cat              cyl_2     
##  Length:234         Min.   :1.00  
##  Class :character   1st Qu.:1.00  
##  Mode  :character   Median :3.00  
##                     Mean   :2.59  
##                     3rd Qu.:4.00  
##                     Max.   :4.00  
## 

and let’s see what they look like

head(mpg)
## # A tibble: 6 x 14
##   manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
##   <chr>        <chr> <dbl> <int> <int> <chr> <chr> <dbl> <int> <chr> <chr>
## 1 audi         a4      1.8  1999     4 auto… f        18    29 p     comp…
## 2 audi         a4      1.8  1999     4 manu… f        21    29 p     comp…
## 3 audi         a4      2    2008     4 manu… f        20    31 p     comp…
## 4 audi         a4      2    2008     4 auto… f        21    30 p     comp…
## 5 audi         a4      2.8  1999     6 auto… f        16    26 p     comp…
## 6 audi         a4      2.8  1999     6 manu… f        18    26 p     comp…
## # ... with 3 more variables: mpg_sum <dbl>, cty_cat <chr>, cyl_2 <dbl>

The type of the descriptive statistics we use depends on whether data is numeric (continuous) or categorical and so we will look at each case separately next.

7.1 Descriptive Statistics: Numeric Data

Recall that for numeric variables, we are usually interested in measuring center tendency and spread to get a sense of data. Suppose that we are interested in hwy colum, in which gas consumption is measured as miles per gallon (highway). From the summary(mpg) table above we know that this variable is indeed a numeric data, and therefore we can measure central tendency and spread of this variable as we do in the next two tables, respectively:

Central Tendency
Measure R Code Output
——————– ——————- ———————
Mean mean(mpg$hwy) 23.4401709
Median median(mpg$hwy) 24
Spread
Measure R Code Output
——————– —————— ——————–
Minimum min(mpg$hwy) 12
Maximum max(mpg$hwy) 44
Range range(mpg$hwy) 12, 44
IQR IQR(mpg$hwy) 9
Variance var(mpg$hwy) 35.4577785
Standard Deviation sd(mpg$hwy) 5.9546434

All of these functions have optional arguments to address various complications that your data might have. For example, if your data includes some NAs, then instead of using mean(mpg$hwy) you should use mean(mpg$hwy, na.rm = TRUE), which tells R to ignore NAs in the data.

7.2 Descriptive Statistics: Categorical Data

For categorical variables, counts and percentages can be used to summarize data:

table(mpg$trans)
## 
##   auto(av)   auto(l3)   auto(l4)   auto(l5)   auto(l6)   auto(s4) 
##          5          2         83         39          6          3 
##   auto(s5)   auto(s6) manual(m5) manual(m6) 
##          3         16         58         19
table(mpg$trans)/nrow(mpg)
## 
##    auto(av)    auto(l3)    auto(l4)    auto(l5)    auto(l6)    auto(s4) 
## 0.021367521 0.008547009 0.354700855 0.166666667 0.025641026 0.012820513 
##    auto(s5)    auto(s6)  manual(m5)  manual(m6) 
## 0.012820513 0.068376068 0.247863248 0.081196581
prop.table(table(mpg$trans))
## 
##    auto(av)    auto(l3)    auto(l4)    auto(l5)    auto(l6)    auto(s4) 
## 0.021367521 0.008547009 0.354700855 0.166666667 0.025641026 0.012820513 
##    auto(s5)    auto(s6)  manual(m5)  manual(m6) 
## 0.012820513 0.068376068 0.247863248 0.081196581
table(mpg$trans, mpg$class)  
##             
##              2seater compact midsize minivan pickup subcompact suv
##   auto(av)         0       2       3       0      0          0   0
##   auto(l3)         0       1       0       1      0          0   0
##   auto(l4)         1       8      14       8     12         11  29
##   auto(l5)         0       4       5       0      8          4  18
##   auto(l6)         0       0       0       2      0          0   4
##   auto(s4)         0       2       1       0      0          0   0
##   auto(s5)         0       2       0       0      0          0   1
##   auto(s6)         1       5       6       0      0          1   3
##   manual(m5)       0      18       9       0      8         16   7
##   manual(m6)       3       5       3       0      5          3   0

Try it with margin.table as well:

margin.table(table(mpg$trans, mpg$class), 1)
## 
##   auto(av)   auto(l3)   auto(l4)   auto(l5)   auto(l6)   auto(s4) 
##          5          2         83         39          6          3 
##   auto(s5)   auto(s6) manual(m5) manual(m6) 
##          3         16         58         19
margin.table(table(mpg$trans, mpg$class), 2)
## 
##    2seater    compact    midsize    minivan     pickup subcompact 
##          5         47         41         11         33         35 
##        suv 
##         62

Question: how to find the row and column frequencies?

prop.table(margin.table(table(mpg$trans, mpg$class), 1))
## 
##    auto(av)    auto(l3)    auto(l4)    auto(l5)    auto(l6)    auto(s4) 
## 0.021367521 0.008547009 0.354700855 0.166666667 0.025641026 0.012820513 
##    auto(s5)    auto(s6)  manual(m5)  manual(m6) 
## 0.012820513 0.068376068 0.247863248 0.081196581
prop.table(margin.table(table(mpg$trans, mpg$class), 2))
## 
##    2seater    compact    midsize    minivan     pickup subcompact 
## 0.02136752 0.20085470 0.17521368 0.04700855 0.14102564 0.14957265 
##        suv 
## 0.26495726

7.3 Correlation

cor(mpg[c(3, 8, 9)])
##          displ cty      hwy
## displ  1.00000  NA -0.76602
## cty         NA   1       NA
## hwy   -0.76602  NA  1.00000

plot(mpg[c(3, 8, 9)])