class: center, middle, inverse, title-slide # Linear Regression Part 2 ## DATA 606 - Statistics & Probability for Data Analytics ### Jason Bryer, Ph.D., Angela Lui, Ph.D., and George Hagstrom, Ph.D. ### November 6, 2024 --- # One Minute Paper Results .pull-left[ **What was the most important thing you learned during this class?** <img src="08-Linear_Regression2_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> ] .pull-right[ **What important question remains unanswered for you?** <img src="08-Linear_Regression2_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ] --- # NYS Report Card NYS publishes data for each school in the state. We will look at the grade 8 math scores for 2012 and 2013. 2013 was the first year the tests were aligned with the Common Core Standards. There was a lot of press about how the passing rates for most schools dropped. Two questions we wish to answer: 1. Did the passing rates drop in a predictable manner? 2. Were the drops different for charter and public schools? ``` r load('../course_data/NYSReportCard-Grade7Math.Rda') names(reportCard) ``` ``` ## [1] "BEDSCODE" "School" "NumTested2012" "Mean2012" ## [5] "Pass2012" "Charter" "GradeSubject" "County" ## [9] "BOCES" "NumTested2013" "Mean2013" "Pass2013" ``` --- # `reportCard` Data Frame .font70[
] --- # Descriptive Statistics ``` r summary(reportCard$Pass2012) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00 46.00 65.00 61.73 80.00 100.00 ``` ``` r summary(reportCard$Pass2013) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00 7.00 20.00 22.83 33.00 99.00 ``` --- # Histograms ``` r melted <- melt(reportCard[,c('Pass2012', 'Pass2013')]) ggplot(melted, aes(x=value)) + geom_histogram() + facet_wrap(~ variable, ncol=1) ``` <img src="08-Linear_Regression2_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- # Log Transformation Since the distribution of the 2013 passing rates is skewed, we can log transfor that variable to get a more reasonably normal distribution. ``` r reportCard$LogPass2013 <- log(reportCard$Pass2013 + 1) ggplot(reportCard, aes(x=LogPass2013)) + geom_histogram(binwidth=0.5) ``` <img src="08-Linear_Regression2_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- # Scatter Plot ``` r ggplot(reportCard, aes(x=Pass2012, y=Pass2013, color=Charter)) + geom_point(alpha=0.5) + coord_equal() + ylim(c(0,100)) + xlim(c(0,100)) ``` <img src="08-Linear_Regression2_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- # Scatter Plot (log transform) ``` r ggplot(reportCard, aes(x=Pass2012, y=LogPass2013, color=Charter)) + geom_point(alpha=0.5) + xlim(c(0,100)) + ylim(c(0, log(101))) ``` <img src="08-Linear_Regression2_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- # Correlation ``` r cor.test(reportCard$Pass2012, reportCard$Pass2013) ``` ``` ## ## Pearson's product-moment correlation ## ## data: reportCard$Pass2012 and reportCard$Pass2013 ## t = 47.166, df = 1360, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.7667526 0.8071276 ## sample estimates: ## cor ## 0.7877848 ``` --- # Correlation (log transform) ``` r cor.test(reportCard$Pass2012, reportCard$LogPass2013) ``` ``` ## ## Pearson's product-moment correlation ## ## data: reportCard$Pass2012 and reportCard$LogPass2013 ## t = 56.499, df = 1360, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.8207912 0.8525925 ## sample estimates: ## cor ## 0.8373991 ``` --- # Linear Regression .code80[ ``` r lm.out <- lm(Pass2013 ~ Pass2012, data=reportCard) summary(lm.out) ``` ``` ## ## Call: ## lm(formula = Pass2013 ~ Pass2012, data = reportCard) ## ## Residuals: ## Min 1Q Median 3Q Max ## -35.484 -6.878 -0.478 5.965 51.675 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -16.68965 0.89378 -18.67 <2e-16 *** ## Pass2012 0.64014 0.01357 47.17 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 11.49 on 1360 degrees of freedom ## Multiple R-squared: 0.6206, Adjusted R-squared: 0.6203 ## F-statistic: 2225 on 1 and 1360 DF, p-value: < 2.2e-16 ``` ] --- # Linear Regression (log transform) .code80[ ``` r lm.log.out <- lm(LogPass2013 ~ Pass2012, data=reportCard) summary(lm.log.out) ``` ``` ## ## Call: ## lm(formula = LogPass2013 ~ Pass2012, data = reportCard) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.3880 -0.2531 0.0776 0.3461 2.7368 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.307692 0.046030 6.685 3.37e-11 *** ## Pass2012 0.039491 0.000699 56.499 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.5915 on 1360 degrees of freedom ## Multiple R-squared: 0.7012, Adjusted R-squared: 0.701 ## F-statistic: 3192 on 1 and 1360 DF, p-value: < 2.2e-16 ``` ] --- # Did the passing rates drop in a predictable manner? Yes! Whether we log tranform the data or not, the correlations are statistically significant with regression models with `\(R^2\)` creater than 62%. To answer the second question, whether the drops were different for public and charter schools, we'll look at the residuals. ``` r reportCard$residuals <- resid(lm.out) reportCard$residualsLog <- resid(lm.log.out) ``` --- # Distribution of Residuals ``` r ggplot(reportCard, aes(x=residuals, color=Charter)) + geom_density() ``` <img src="08-Linear_Regression2_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- # Distribution of Residuals ``` r ggplot(reportCard, aes(x=residualsLog, color=Charter)) + geom_density() ``` <img src="08-Linear_Regression2_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- # Null Hypothesis Testing `\(H_0\)`: There is no difference in the residuals between charter and public schools. `\(H_A\)`: There is a difference in the residuals between charter and public schools. ``` r t.test(residuals ~ Charter, data=reportCard) ``` ``` ## ## Welch Two Sample t-test ## ## data: residuals by Charter ## t = 6.5751, df = 77.633, p-value = 5.091e-09 ## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0 ## 95 percent confidence interval: ## 6.411064 11.980002 ## sample estimates: ## mean in group FALSE mean in group TRUE ## 0.479356 -8.716177 ``` --- # Null Hypothesis Testing (log transform) ``` r t.test(residualsLog ~ Charter, data=reportCard) ``` ``` ## ## Welch Two Sample t-test ## ## data: residualsLog by Charter ## t = 4.7957, df = 74.136, p-value = 8.161e-06 ## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0 ## 95 percent confidence interval: ## 0.2642811 0.6399761 ## sample estimates: ## mean in group FALSE mean in group TRUE ## 0.02356911 -0.42855946 ``` --- # Polynomial Models (e.g. Quadratic) It is possible to fit quatric models fairly easily in R, say of the following form: $$ y = b_1 x^2 + b_2 x + b_0 $$ ``` r quad.out <- lm(Pass2013 ~ I(Pass2012^2) + Pass2012, data=reportCard) summary(quad.out)$r.squared ``` ``` ## [1] 0.7065206 ``` ``` r summary(lm.out)$r.squared ``` ``` ## [1] 0.6206049 ``` --- # Quadratic Model .code80[ ``` r summary(quad.out) ``` ``` ## ## Call: ## lm(formula = Pass2013 ~ I(Pass2012^2) + Pass2012, data = reportCard) ## ## Residuals: ## Min 1Q Median 3Q Max ## -46.258 -4.906 -0.507 5.430 43.509 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.0466153 1.4263773 4.940 8.77e-07 *** ## I(Pass2012^2) 0.0092937 0.0004659 19.946 < 2e-16 *** ## Pass2012 -0.3972481 0.0533631 -7.444 1.72e-13 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 10.11 on 1359 degrees of freedom ## Multiple R-squared: 0.7065, Adjusted R-squared: 0.7061 ## F-statistic: 1636 on 2 and 1359 DF, p-value: < 2.2e-16 ``` ] --- # Scatter Plot ``` r ggplot(reportCard, aes(x=Pass2012, y=Pass2013)) + geom_point(alpha=0.2) + geom_smooth(method='lm', formula=y ~ x, size=2, se=FALSE) + geom_smooth(method='lm', formula=y ~ I(x^2) + x, size=2, se=FALSE) + coord_equal() + ylim(c(0,100)) + xlim(c(0,100)) ``` <img src="08-Linear_Regression2_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> ``` r reportCard$resid_linear <- resid(lm.out) ggplot(reportCard, aes(x = resid_linear)) + geom_histogram() reportCard$resid_quad <- resid(quad.out) ggplot(reportCard, aes(x = resid_quad)) + geom_histogram() ggplot(reportCard) + geom_density(aes(x = resid_linear), color = 'blue') + geom_density(aes(x = resid_quad), color = 'maroon') sum(reportCard$resid_linear^2) /sum(reportCard$resid_quad^2) ``` --- # Let's go crazy, cubic! ``` r cube.out <- lm(Pass2013 ~ I(Pass2012^3) + I(Pass2012^2) + Pass2012, data=reportCard) summary(cube.out)$r.squared ``` ``` ## [1] 0.7168206 ``` <img src="08-Linear_Regression2_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> --- # Be careful of overfitting... .pull-left[ ![](images/Overfitting_print.png) ] .pull-right[ ![](images/Overfit_Vs_Underfit_print.png) ] .font60[Source: Chris Albon [@chrisalbon](https://twitter.com/chrisalbon) [MachineLearningFlashCards.com](https://machinelearningflashcards.com)] --- # Loess Regression ``` r ggplot(reportCard, aes(x=Pass2012, y=Pass2013)) + geom_point(alpha=0.2) + geom_smooth(method='lm', formula=y~poly(x,2,raw=TRUE), size=2, se=FALSE) + geom_smooth(method='loess', formula = y ~ x, size=2, se=FALSE, color = 'purple') + coord_equal() + ylim(c(0,100)) + xlim(c(0,100)) ``` .pull-left[ <img src="08-Linear_Regression2_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> ] .pull-right[ ``` r library('VisualStats') library('ShinyDemo') shiny_demo('loess', package = 'VisualStats') ``` See this site for more info: https://jbryer.github.io/VisualStats/articles/loess.html ] --- # Shiny App ``` r shiny::runGitHub('NYSchools','jbryer',subdir='NYSReportCard') ``` See also the Github repository for more information: https://github.com/jbryer/NYSchools --- class: left, font140 # One Minute Paper .pull-left[ 1. What was the most important thing you learned during this class? 2. What important question remains unanswered for you? ] .pull-right[ <img src="08-Linear_Regression2_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> ] https://forms.gle/ESBAdHRhzT65fW6c6