class: center, middle, inverse, title-slide # Summarizing Data Part 1 ## DATA 606 - Statistics & Probability for Data Analytics ### Jason Bryer, Ph.D., Angela Lui, Ph.D., and George Hagstrom, Ph.D. ### September 11, 2024 --- # One Minute Paper Results .pull-left[ **What was the most important thing you learned during this class?** <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> ] .pull-right[ **What important question remains unanswered for you?** <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] --- # Workflow .center[ <img src='images/data-science-wrangle.png' alt = 'Data Science Workflow' width='1000' /> ] .font80[Source: [Wickham & Grolemund, 2017](https://r4ds.had.co.nz)] --- # Tidy Data .center[ <img src='images/tidydata_1.jpg' height='500' /> ] See Wickham (2014) [Tidy data](https://vita.had.co.nz/papers/tidy-data.html). --- # Types of Data .pull-left[ * Numerical (quantitative) * Continuous * Discrete ] .pull-right[ * Categorical (qualitative) * Regular categorical * Ordinal ] .center[ <img src='images/continuous_discrete.png' height='400' /> ] --- # Data Types in R <img src="images/DataTypesConceptModel.png" width="1000" style="display: block; margin: auto;" /> --- # Data Types / Descriptives / Visualizations Data Type | Descriptive Stats | Visualization -------------|-----------------------------------------------|-------------------| Continuous | mean, median, mode, standard deviation, IQR | histogram, density, box plot Discrete | contingency table, proportional table, median | bar plot Categorical | contingency table, proportional table | bar plot Ordinal | contingency table, proportional table, median | bar plot Two quantitative | correlation | scatter plot Two qualitative | contingency table, chi-squared | mosaic plot, bar plot Quantitative & Qualitative | grouped summaries, ANOVA, t-test | box plot --- # Statistics .pull-left[ When describing a quantitative variable we are often interested in two things: 1. A measure of center 2. A measure of spread The most common measures we will use in this class is the mean and median. $$ \bar{x} = \frac{\Sigma(x_i)}{n} $$ $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ ] .pull-right[ Note that in the numerator for the variance calculation we square the differences (also known as deviations). Squaring terms is common practice in statistics that serves two purposes: 1. It makes all the values positive. 2. It weighs observations that are further from the center more. ] --- # Variance .pull-left[ Population Variance: $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ Consider a dataset with five values (black points in the figure). For the largest value, the deviance is represented by the blue line ( `\(x_i - \bar{x}\)` ). See also: https://shiny.rit.albany.edu/stat/visualizess/ https://github.com/jbryer/VisualStats/ ] .pull-right[ <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> ] --- # Variance (cont.) .pull-left[ Population Variance: $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ In the numerator, we square each of these deviances. We can conceptualize this as a square. Here, we add the deviance in the *y* direction. ] .pull-right[ <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> ] --- # Variance (cont.) .pull-left[ Population Variance: $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ We end up with a square. ] .pull-right[ <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> ] --- # Variance (cont.) .pull-left[ Population Variance: $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ We can plot the squared deviance for all the data points. That is, each component in the numerator is the area of each of these squares. ] .pull-right[ <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> ] --- # Variance (cont.) .pull-left[ Population Variance: $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ The variance is therefore the average of the area of all these squares, here represented by the orange square. ] .pull-right[ <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> ] --- # Population versus Sample Variance .pull-left[ Typically we want the sample variance. The difference is we divide by `\(n - 1\)` to calculate the sample variance. This results in a slightly larger area (variance) then if we divide by `\(n\)`. Population Variance (yellow): $$ S^2 = \frac{\Sigma (x_i - \bar{x})^2}{N}$$ Sample Variance (green): $$ s^2 = \frac{\Sigma (x_i - \bar{x})^2}{n-1}$$ ] .pull-right[ <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> ] --- # Robust Statistics Consider the following data randomly selected from the normal distribution: .pull-left[ ``` r set.seed(41) x <- rnorm(30, mean = 100, sd = 15) mean(x); sd(x) ``` ``` ## [1] 103.1934 ``` ``` ## [1] 16.8945 ``` ``` r median(x); IQR(x) ``` ``` ## [1] 103.9947 ``` ``` ## [1] 25.68004 ``` ] .pull-right[ <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ] --- # Robust Statistics <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- # Robust Statistics <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> Let's add an extreme value: ``` r x <- c(x, 1000) ``` -- <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- # Robust Statistics Median and IQR are more robust to skewness and outliers than mean and SD. Therefore, * for skewed distributions it is often more helpful to use median and IQR to describe the center and spread * for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread --- class: font80 # About `legosets` <img src="images/hex/brickset.png" class="title-hex"> To install the `brickset` package: ``` r remotes::install_github('jbryer/brickset') ``` To load the load the `legosets` dataset. ``` r data('legosets', package = 'brickset') ``` The `legosets` data has 19409 observations of 36 variables. .code70[ ``` r names(legosets) ``` ``` ## [1] "setID" "number" "numberVariant" ## [4] "name" "year" "theme" ## [7] "themeGroup" "subtheme" "category" ## [10] "released" "pieces" "minifigs" ## [13] "bricksetURL" "rating" "reviewCount" ## [16] "packagingType" "availability" "agerange_min" ## [19] "thumbnailURL" "imageURL" "US_retailPrice" ## [22] "US_dateFirstAvailable" "US_dateLastAvailable" "UK_retailPrice" ## [25] "UK_dateFirstAvailable" "UK_dateLastAvailable" "CA_retailPrice" ## [28] "CA_dateFirstAvailable" "CA_dateLastAvailable" "DE_retailPrice" ## [31] "DE_dateFirstAvailable" "DE_dateLastAvailable" "height" ## [34] "width" "depth" "weight" ``` ] --- # Structure (`str`) <img src="images/hex/brickset.png" class="title-hex"> .code50[ ``` r str(legosets) ``` ``` ## 'data.frame': 19409 obs. of 36 variables: ## $ setID : int 7693 7695 7697 7698 25534 7418 7419 6020 22704 7421 ... ## $ number : chr "1" "2" "3" "4" ... ## $ numberVariant : int 8 8 6 4 6 1 1 1 3 4 ... ## $ name : chr "Small house set" "Medium house set" "Medium house set" "Large house set" ... ## $ year : int 1970 1970 1970 1970 1970 1970 1970 1970 1970 1970 ... ## $ theme : chr "Minitalia" "Minitalia" "Minitalia" "Minitalia" ... ## $ themeGroup : chr "Vintage" "Vintage" "Vintage" "Vintage" ... ## $ subtheme : chr NA NA NA NA ... ## $ category : chr "Normal" "Normal" "Normal" "Normal" ... ## $ released : logi TRUE TRUE TRUE TRUE TRUE TRUE ... ## $ pieces : int 67 109 158 233 NA 1 1 60 65 NA ... ## $ minifigs : int NA NA NA NA NA NA NA NA NA NA ... ## $ bricksetURL : chr "https://brickset.com/sets/1-8" "https://brickset.com/sets/2-8" "https://brickset.com/sets/3-6" "https://brickset.com/sets/4-4" ... ## $ rating : num 0 0 0 0 0 0 0 0 0 0 ... ## $ reviewCount : int 0 0 1 0 0 0 0 0 0 0 ... ## $ packagingType : chr "{Not specified}" "{Not specified}" "{Not specified}" "{Not specified}" ... ## $ availability : chr "{Not specified}" "{Not specified}" "{Not specified}" "{Not specified}" ... ## $ agerange_min : int NA NA NA NA NA NA NA NA NA NA ... ## $ thumbnailURL : chr "https://images.brickset.com/sets/small/1-8.jpg" "https://images.brickset.com/sets/small/2-8.jpg" "https://images.brickset.com/sets/small/3-6.jpg" "https://images.brickset.com/sets/small/4-4.jpg" ... ## $ imageURL : chr "https://images.brickset.com/sets/images/1-8.jpg" "https://images.brickset.com/sets/images/2-8.jpg" "https://images.brickset.com/sets/images/3-6.jpg" "https://images.brickset.com/sets/images/4-4.jpg" ... ## $ US_retailPrice : num NA NA NA NA NA NA NA NA NA NA ... ## $ US_dateFirstAvailable: Date, format: NA NA ... ## $ US_dateLastAvailable : Date, format: NA NA ... ## $ UK_retailPrice : num NA NA NA NA NA NA NA NA NA NA ... ## $ UK_dateFirstAvailable: Date, format: NA NA ... ## $ UK_dateLastAvailable : Date, format: NA NA ... ## $ CA_retailPrice : num NA NA NA NA NA NA NA NA NA NA ... ## $ CA_dateFirstAvailable: Date, format: NA NA ... ## $ CA_dateLastAvailable : Date, format: NA NA ... ## $ DE_retailPrice : num NA NA NA NA NA NA NA NA NA NA ... ## $ DE_dateFirstAvailable: Date, format: NA NA ... ## $ DE_dateLastAvailable : Date, format: NA NA ... ## $ height : num NA NA NA NA NA ... ## $ width : num NA NA NA NA NA ... ## $ depth : num NA NA NA NA NA NA NA NA 5.08 NA ... ## $ weight : num NA NA NA NA NA NA NA NA NA NA ... ``` ] --- # RStudio Eenvironment tab can help <img src="images/hex/rstudio.png" class="title-hex"> <img src="images/legosets_rstudio_environment.png" width="500" style="display: block; margin: auto;" /> --- class: hide-logo # Table View .font60[
] --- # Data Wrangling Cheat Sheet <img src="images/hex/dplyr.png" class="title-hex"> .center[ <a href='https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf' target='_new'><img src='images/data-transformation.png' width='700' /></a> ] --- # Tidyverse vs Base R <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/pipe.png" class="title-hex"> .center[ <a href='images/R_Syntax_Comparison.jpeg' target='_new'><img src="images/R_Syntax_Comparison.jpeg" width='700' /></a> ] --- class: font90 # Pipes `%>%` and `|>` <img src="images/hex/magrittr.png" class="title-hex"> <img src='images/magrittr_pipe.jpg' align='right' width='200' /> .font90[ The pipe operator (`%>%`) introduced with the `magrittr` R package allows for the chaining of R operations. As of version 4.1, R now has a native pipe operator (`|>`). They take the output from the left-hand side and passes it as the first parameter to the function on the right-hand side. ] .pull-left[ You can do this in two steps: ``` r tab_out <- table(legosets$category) prop.table(tab_out) ``` Or as nested function calls. ``` r prop.table(table(legosets$category)) ``` ] .pull-right[ Using the pipe (`|>`) operator we can chain these calls in a what is arguably a more readable format: ``` r table(legosets$category) |> prop.table() ``` ] <hr /> ``` ## ## Book Collection Extended Gear Normal Other ## 0.034468546 0.031377196 0.028749549 0.154515946 0.684682364 0.062599825 ## Random ## 0.003606574 ``` --- # Filter <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> .center[ <img src='images/dplyr_filter_sm.png' width='800' /> ] --- # Logical Operators * `!a` - TRUE if a is FALSE * `a == b` - TRUE if a and be are equal * `a != b` - TRUE if a and b are not equal * `a > b` - TRUE if a is larger than b, but not equal * `a >= b` - TRUE if a is larger or equal to b * `a < b` - TRUE if a is smaller than be, but not equal * `a <= b` - TRUE if a is smaller or equal to b * `a %in% b` - TRUE if a is in b where b is a vector ``` r which( letters %in% c('a','e','i','o','u') ) ``` ``` ## [1] 1 5 9 15 21 ``` * `a | b` - TRUE if a *or* b are TRUE * `a & b` - TRUE if a *and* b are TRUE * `isTRUE(a)` - TRUE if a is TRUE --- # Filter <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> ### `dplyr` ``` r mylego <- legosets %>% filter(themeGroup == 'Educational' & year > 2015) ``` ### Base R ``` r mylego <- legosets[legosets$themeGroup == 'Educational' & legosets$year > 2015,] ``` <hr /> ``` r nrow(mylego) ``` ``` ## [1] 99 ``` --- # Select <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> ### `dplyr` ``` r mylego <- mylego %>% select(setID, pieces, theme, availability, US_retailPrice, minifigs) ``` ### Base R ``` r mylego <- mylego[,c('setID', 'pieces', 'theme', 'availability', 'US_retailPrice', 'minifigs')] ``` <hr /> ``` r head(mylego, n = 4) ``` ``` ## setID pieces theme availability US_retailPrice minifigs ## 1 26803 103 Education {Not specified} NA 6 ## 2 26689 142 Education {Not specified} NA 4 ## 3 26804 98 Education {Not specified} NA 6 ## 4 26277 188 Education Educational 94.95 NA ``` --- # Relocate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> .center[ <img src='images/dplyr_relocate.png' width='800' /> ] --- # Relocate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> ### `dplyr` ``` r mylego %>% relocate(where(is.numeric), .after = where(is.character)) %>% head(n = 3) ``` ``` ## theme availability setID pieces US_retailPrice minifigs ## 1 Education {Not specified} 26803 103 NA 6 ## 2 Education {Not specified} 26689 142 NA 4 ## 3 Education {Not specified} 26804 98 NA 6 ``` ### Base R ``` r mylego2 <- mylego[,c('theme', 'availability', 'setID', 'pieces', 'US_retailPrice', 'minifigs')] head(mylego2, n = 3) ``` ``` ## theme availability setID pieces US_retailPrice minifigs ## 1 Education {Not specified} 26803 103 NA 6 ## 2 Education {Not specified} 26689 142 NA 4 ## 3 Education {Not specified} 26804 98 NA 6 ``` --- # Rename <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> .center[ <img src='images/rename_sm.jpg' width='1000' /> ] --- # Rename <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> ### `dplyr` ``` r mylego %>% dplyr::rename(USD = US_retailPrice) %>% head(n = 3) ``` ``` ## setID pieces theme availability USD minifigs ## 1 26803 103 Education {Not specified} NA 6 ## 2 26689 142 Education {Not specified} NA 4 ## 3 26804 98 Education {Not specified} NA 6 ``` ### Base R ``` r names(mylego2)[5] <- 'USD' head(mylego2, n = 3) ``` ``` ## theme availability setID pieces USD minifigs ## 1 Education {Not specified} 26803 103 NA 6 ## 2 Education {Not specified} 26689 142 NA 4 ## 3 Education {Not specified} 26804 98 NA 6 ``` --- # Mutate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> .center[ <img src='images/dplyr_mutate.png' width='700' /> ] --- # Mutate <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> ### `dplyr` ``` r mylego %>% filter(!is.na(pieces) & !is.na(US_retailPrice)) %>% mutate(Price_per_piece = US_retailPrice / pieces) %>% head(n = 3) ``` ``` ## setID pieces theme availability US_retailPrice minifigs Price_per_piece ## 1 26277 188 Education Educational 94.95 NA 0.5050532 ## 2 25949 280 Education Educational 224.95 NA 0.8033929 ## 3 25954 1 Education Educational 14.95 NA 14.9500000 ``` ### Base R ``` r mylego2 <- mylego[!is.na(mylego$US_retailPrice) & !is.na(mylego$Price_per_piece),] mylego2$Price_per_piece <- mylego2$Price_per_piece / mylego2$US_retailPrice head(mylego2, n = 3) ``` ``` ## [1] setID pieces theme availability ## [5] US_retailPrice minifigs Price_per_piece ## <0 rows> (or 0-length row.names) ``` --- # Group By and Summarize <img src="images/hex/tidyverse.png" class="title-hex"><img src="images/hex/dplyr.png" class="title-hex"> .code80[ ``` r legosets %>% group_by(themeGroup) %>% summarize(mean_price = mean(US_retailPrice, na.rm = TRUE), sd_price = sd(US_retailPrice, na.rm = TRUE), median_price = median(US_retailPrice, na.rm = TRUE), n = n(), missing = sum(is.na(US_retailPrice))) ``` ``` ## # A tibble: 17 × 6 ## themeGroup mean_price sd_price median_price n missing ## <chr> <dbl> <dbl> <dbl> <int> <int> ## 1 Action/Adventure 40.2 38.9 30.0 1474 779 ## 2 Art and crafts 34.9 47.7 17.5 97 9 ## 3 Basic 21.6 19.2 15.0 873 733 ## 4 Constraction 16.4 12.4 13.0 502 284 ## 5 Educational 182. 188. 130. 503 465 ## 6 Girls 35.8 24.0 23.0 240 227 ## 7 Historical 34.2 32.4 20.0 473 400 ## 8 Junior 22.0 10.1 20.0 228 165 ## 9 Licensed 53.3 71.7 30.0 2775 1066 ## 10 Miscellaneous 20.7 29.2 13.0 6253 3961 ## 11 Model making 74.3 92.1 40.0 771 384 ## 12 Modern day 38.2 35.6 30.0 2469 1535 ## 13 Pre-school 30.8 22.7 25.0 1562 1103 ## 14 Racing 26.8 26.5 15.0 270 176 ## 15 Technical 82.8 95.3 50.0 607 327 ## 16 Vintage NaN NA NA 306 306 ## 17 <NA> NaN NA NA 6 6 ``` ] --- # Describe and Describe By ``` r library(psych) describe(legosets$US_retailPrice) ``` ``` ## vars n mean sd median trimmed mad min max range skew kurtosis ## X1 1 7483 38.96 56.5 19.99 27.7 17.79 1.49 849.99 848.5 5.32 44.74 ## se ## X1 0.65 ``` ``` r describeBy(legosets$US_retailPrice, group = legosets$availability, mat = TRUE, skew = FALSE) ``` ``` ## item group1 vars n mean sd median min max range se ## X11 1 {Not specified} 1 1831 26.84733 39.96747 19.99 1.49 789.99 788.5 0.9340335 ## X12 2 Educational 1 12 212.86667 105.88283 222.45 14.95 399.95 385.0 30.5657410 ## X13 3 LEGO exclusive 1 1039 57.21203 106.63125 12.99 1.99 849.99 848.0 3.3080857 ## X14 4 LEGOLAND exclusive 1 2 4.99000 0.00000 4.99 4.99 4.99 0.0 0.0000000 ## X15 5 Not sold 1 1 12.99000 NA 12.99 12.99 12.99 0.0 NA ## X16 6 Promotional 1 5 4.79000 0.83666 4.99 3.99 5.99 2.0 0.3741657 ## X17 7 Promotional (Airline) 1 0 NaN NA NA Inf -Inf -Inf NA ## X18 8 Retail 1 4290 37.55889 38.44918 24.99 1.99 699.99 698.0 0.5870275 ## X19 9 Retail - limited 1 302 63.54381 70.91908 39.99 2.49 449.99 447.5 4.0809343 ## X110 10 Unknown 1 1 3.99000 NA 3.99 3.99 3.99 0.0 NA ``` --- # Additional Resources For data wrangling: * `dplyr` website: https://dplyr.tidyverse.org * R for Data Science book: https://r4ds.had.co.nz/wrangle-intro.html * Wrangling penguins tutorial: https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome * Data transformation cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf --- class: left, font140 # One Minute Paper .pull-left[ 1. What was the most important thing you learned during this class? 2. What important question remains unanswered for you? ] .pull-right[ <img src="02-Summarizing_Data1_files/figure-html/unnamed-chunk-44-1.png" style="display: block; margin: auto;" /> ] https://forms.gle/ESBAdHRhzT65fW6c6