Chapter 7 Summarizing Data: Descriptive Statistics
R has built in functions for a large number of summary statistics. To illustrate the main R functions I will use the mpg dataset from the ggplot2 package. So yes, R has tons of packages and some of them are also packed with interesting data that we can use. ggplot2 package is one of them that has been developed to make high quality plots and it should be already installed in R so we just need to load it into the current session:
library(ggplot2)Let us first see what kind of objects are included in mpg by using summary function
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
##
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## NA's :1
## hwy fl class mpg_sum
## Min. :12.00 Length:234 Length:234 Min. :21.00
## 1st Qu.:18.00 Class :character Class :character 1st Qu.:31.00
## Median :24.00 Mode :character Mode :character Median :41.00
## Mean :23.44 Mean :40.29
## 3rd Qu.:27.00 3rd Qu.:47.00
## Max. :44.00 Max. :79.00
## NA's :1
## cty_cat cyl_2
## Length:234 Min. :1.00
## Class :character 1st Qu.:1.00
## Mode :character Median :3.00
## Mean :2.59
## 3rd Qu.:4.00
## Max. :4.00
## and let’s see what they look like
head(mpg)
## # A tibble: 6 x 14
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <dbl> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## # ... with 3 more variables: mpg_sum <dbl>, cty_cat <chr>, cyl_2 <dbl>The type of the descriptive statistics we use depends on whether data is numeric (continuous) or categorical and so we will look at each case separately next.
7.1 Descriptive Statistics: Numeric Data
Recall that for numeric variables, we are usually interested in measuring center tendency and spread to get a sense of data. Suppose that we are interested in hwy colum, in which gas consumption is measured as miles per gallon (highway). From the summary(mpg) table above we know that this variable is indeed a numeric data, and therefore we can measure central tendency and spread of this variable as we do in the next two tables, respectively:
| Central Tendency | ||
|---|---|---|
| Measure | R Code |
Output |
| ——————– | ——————- | ——————— |
| Mean | mean(mpg$hwy) |
23.4401709 |
| Median | median(mpg$hwy) |
24 |
| Spread | ||
|---|---|---|
| Measure | R Code |
Output |
| ——————– | —————— | ——————– |
| Minimum | min(mpg$hwy) |
12 |
| Maximum | max(mpg$hwy) |
44 |
| Range | range(mpg$hwy) |
12, 44 |
| IQR | IQR(mpg$hwy) |
9 |
| Variance | var(mpg$hwy) |
35.4577785 |
| Standard Deviation | sd(mpg$hwy) |
5.9546434 |
All of these functions have optional arguments to address various complications that your data might have. For example, if your data includes some NAs, then instead of using mean(mpg$hwy) you should use mean(mpg$hwy, na.rm = TRUE), which tells R to ignore NAs in the data.
7.2 Descriptive Statistics: Categorical Data
For categorical variables, counts and percentages can be used to summarize data:
table(mpg$trans)
##
## auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4)
## 5 2 83 39 6 3
## auto(s5) auto(s6) manual(m5) manual(m6)
## 3 16 58 19
table(mpg$trans)/nrow(mpg)
##
## auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4)
## 0.021367521 0.008547009 0.354700855 0.166666667 0.025641026 0.012820513
## auto(s5) auto(s6) manual(m5) manual(m6)
## 0.012820513 0.068376068 0.247863248 0.081196581
prop.table(table(mpg$trans))
##
## auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4)
## 0.021367521 0.008547009 0.354700855 0.166666667 0.025641026 0.012820513
## auto(s5) auto(s6) manual(m5) manual(m6)
## 0.012820513 0.068376068 0.247863248 0.081196581
table(mpg$trans, mpg$class)
##
## 2seater compact midsize minivan pickup subcompact suv
## auto(av) 0 2 3 0 0 0 0
## auto(l3) 0 1 0 1 0 0 0
## auto(l4) 1 8 14 8 12 11 29
## auto(l5) 0 4 5 0 8 4 18
## auto(l6) 0 0 0 2 0 0 4
## auto(s4) 0 2 1 0 0 0 0
## auto(s5) 0 2 0 0 0 0 1
## auto(s6) 1 5 6 0 0 1 3
## manual(m5) 0 18 9 0 8 16 7
## manual(m6) 3 5 3 0 5 3 0Try it with margin.table as well:
margin.table(table(mpg$trans, mpg$class), 1)
##
## auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4)
## 5 2 83 39 6 3
## auto(s5) auto(s6) manual(m5) manual(m6)
## 3 16 58 19
margin.table(table(mpg$trans, mpg$class), 2)
##
## 2seater compact midsize minivan pickup subcompact
## 5 47 41 11 33 35
## suv
## 62Question: how to find the row and column frequencies?
prop.table(margin.table(table(mpg$trans, mpg$class), 1))
##
## auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4)
## 0.021367521 0.008547009 0.354700855 0.166666667 0.025641026 0.012820513
## auto(s5) auto(s6) manual(m5) manual(m6)
## 0.012820513 0.068376068 0.247863248 0.081196581
prop.table(margin.table(table(mpg$trans, mpg$class), 2))
##
## 2seater compact midsize minivan pickup subcompact
## 0.02136752 0.20085470 0.17521368 0.04700855 0.14102564 0.14957265
## suv
## 0.264957267.3 Correlation
cor(mpg[c(3, 8, 9)])
## displ cty hwy
## displ 1.00000 NA -0.76602
## cty NA 1 NA
## hwy -0.76602 NA 1.00000
plot(mpg[c(3, 8, 9)])