Required libraries:
library(dplyr)
Read the SAT data, needed for multiple problems:
sat <- read.csv('../data/sat.csv')
colnames(sat)
## [1] "ï..sex" "SATV" "SATM" "SATSum" "HSGPA" "FYGPA"
colnames(sat)[1] <- 'sex'
n <- dim(sat)[1]
sat %>% filter(SATM > SATV) %>% summarize(meanMathHigher = mean(SATSum))
## meanMathHigher
## 1 103.6573
sat %>% filter(SATM < SATV) %>% summarize(meanVerbalHigher = mean(SATSum))
## meanVerbalHigher
## 1 102.1535
Students who do better on math do slightly better on average, though there’s not much difference practically speaking.
Two ways to do this: filtering and regression. I’ll show both. Filtering:
t.test(sat %>% filter(sex == 1) %>% pull(SATSum),
sat %>% filter(sex == 2) %>% pull(SATSum))
##
## Welch Two Sample t-test
##
## data: sat %>% filter(sex == 1) %>% pull(SATSum) and sat %>% filter(sex == 2) %>% pull(SATSum)
## t = 5.5064, df = 990.8, p-value = 4.666e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.160173 6.659771
## sample estimates:
## mean of x mean of y
## 105.7054 100.7955
t.test(sat %>% filter(sex == 1) %>% pull(HSGPA),
sat %>% filter(sex == 2) %>% pull(HSGPA))
##
## Welch Two Sample t-test
##
## data: sat %>% filter(sex == 1) %>% pull(HSGPA) and sat %>% filter(sex == 2) %>% pull(HSGPA)
## t = -4.508, df = 997.97, p-value = 7.321e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.21925006 -0.08625958
## sample estimates:
## mean of x mean of y
## 3.124167 3.276921
In these data, men do better on the SAT on average and women get better grades in high school on average. Both differences are statistically significant with p << .05. Harder to say if they’re practically/scientifically significant.
Can also use regression:
SATmod <- lm(SATSum ~ sex, data = sat)
summary(SATmod)
##
## Call:
## lm(formula = SATSum ~ sex, data = sat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.705 -9.705 0.205 9.295 43.205
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 110.6154 1.3953 79.28 < 2e-16 ***
## sex -4.9100 0.8911 -5.51 4.56e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.08 on 998 degrees of freedom
## Multiple R-squared: 0.02952, Adjusted R-squared: 0.02855
## F-statistic: 30.36 on 1 and 998 DF, p-value: 4.563e-08
GPAmod <- lm(HSGPA ~ sex, data = sat)
summary(GPAmod)
##
## Call:
## lm(formula = HSGPA ~ sex, data = sat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.32417 -0.42417 -0.02417 0.42308 1.37583
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.97141 0.05316 55.9 < 2e-16 ***
## sex 0.15275 0.03395 4.5 7.61e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5365 on 998 degrees of freedom
## Multiple R-squared: 0.01988, Adjusted R-squared: 0.0189
## F-statistic: 20.25 on 1 and 998 DF, p-value: 7.61e-06
The coefficient estimate for “sex” is the same as the difference in means(!) and the p-value (in the 4th column of the coefficient table) is the same as the t test p-value. Coincidence???
plot(sat$HSGPA, sat$SATM,
col = 'red', xlab = 'High School GPA', ylab = 'SAT Score',
main = 'HS GPA vs. SAT Subject Scores')
points(sat$HSGPA, sat$SATV, col = 'blue')
legend(4.02, 70, c('SAT Math', 'SAT Verbal'), fill = c('blue', 'red'))
satMc <- (sat$SATM %in% 20:40) + 2*(sat$SATM %in% 41:60) + 3*(sat$SATM %in% 61:80)
satVc <- (sat$SATV %in% 20:40) + 2*(sat$SATV %in% 41:60) + 3*(sat$SATV %in% 61:80)
unique(satMc)
## [1] 3 2 1
unique(satVc)
## [1] 3 2 1
crossTab <- matrix(0, 3, 3)
for (i in 1:n) {
crossTab[satMc[i], satVc[i]] <- crossTab[satMc[i], satVc[i]] + 1
}
crossTab
## [,1] [,2] [,3]
## [1,] 31 29 0
## [2,] 107 567 44
## [3,] 9 162 51
table(satMc, satVc)
## satVc
## satMc 1 2 3
## 1 31 29 0
## 2 107 567 44
## 3 9 162 51
n - sum(diag(crossTab))
## [1] 351
About 1/3 of students fell in different categories for the verbal and math sections, which makes sense, as this will include both students who scored quite differently on the two sections and those that scored close but on either side of a cutoff.
The answer was at the end of the lecture8.RMD code file!
f <- function(b) sum((sat$FYGPA - (b[1] + b[2]*sat$SATSum))^2)
optim(c(0,0), f)$par
## [1] 0.00195403 0.02386409
lm(FYGPA ~ SATSum, data = sat)
##
## Call:
## lm(formula = FYGPA ~ SATSum, data = sat)
##
## Coefficients:
## (Intercept) SATSum
## 0.001927 0.023866
myLm <- function(y, X, parInit) {
f <- function(b) sum((y - cbind(1, X)%*%b)^2)
optim(parInit, f)$par
}
myLm(sat$FYGPA, cbind(sat$SATM,
sat$SATV,
sat$HSGPA),
rep(0,4))
## [1] -0.85937180 0.01206681 0.01684094 0.57683140
lm(FYGPA ~ SATM + SATV + HSGPA, data = sat)
##
## Call:
## lm(formula = FYGPA ~ SATM + SATV + HSGPA, data = sat)
##
## Coefficients:
## (Intercept) SATM SATV HSGPA
## -0.86693 0.01240 0.01646 0.58007
Unfortunately, this question became confused by 1) the estimated intercept is extremely close to zero and 2) in the previous question, being “close enough” was good enough. So we’re giving credit for it if you attempted it. But here’s the “actual” solution…
lm(FYGPA ~ SATSum - 1, data = sat)
##
## Call:
## lm(formula = FYGPA ~ SATSum - 1, data = sat)
##
## Coefficients:
## SATSum
## 0.02388
summary(lm(FYGPA ~ SATSum, data = sat))
##
## Call:
## lm(formula = FYGPA ~ SATSum, data = sat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1976 -0.4495 0.0315 0.4557 1.6115
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.001927 0.151991 0.013 0.99
## SATSum 0.023866 0.001457 16.379 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.658 on 998 degrees of freedom
## Multiple R-squared: 0.2119, Adjusted R-squared: 0.2111
## F-statistic: 268.3 on 1 and 998 DF, p-value: < 2.2e-16
FYGPAc <- sat$FYGPA - mean(sat$FYGPA)
SATc <- sat$SATSum - mean(sat$SATSum)
summary(lm(FYGPAc ~ SATc - 1))
##
## Call:
## lm(formula = FYGPAc ~ SATc - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1976 -0.4495 0.0315 0.4557 1.6115
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## SATc 0.023866 0.001456 16.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6577 on 999 degrees of freedom
## Multiple R-squared: 0.2119, Adjusted R-squared: 0.2111
## F-statistic: 268.5 on 1 and 999 DF, p-value: < 2.2e-16
If you don’t estimate an intercept, you assume it’s zero, so your \(\hat{\beta}\) will be different (in this case, by a small amount). Once you center the data, the coefficient estimates will be the same regardless of whether you estimate an intercept or not, because the estimated intercept must pass through the mean of the data, which is \((0,0)\) when the data are centered. However, if you look at the t values for the centered model with no intercept vs. the original model with intercept, they’re a bit different. So inference and hypothesis testing will be different, even though the coefficient estimate is the same!
triMod <- lm(FYGPA ~ SATM + SATV + HSGPA, data = sat)
coef(triMod)
## (Intercept) SATM SATV HSGPA
## -0.86692662 0.01239756 0.01645881 0.58006828
Holding SAT Matb and Verbal scores fixed, an increase of 0.1 points in HS GPA is associated with an increase of 0.058 points in FY College GPA.
When we don’t hold SAT scores fixed, we need to think about how they are expected to change with a change in HS GPA. We can do this with regression!
satMincrease <- lm(SATM ~ HSGPA, data = sat)$coef[2]*0.1
satVincrease <- lm(SATV ~ HSGPA, data = sat)$coef[2]*0.1
Now we include these expected increases with the 0.1 HS GPA increase in the original tri-variate model:
coef(triMod)[2:4] %*% c(satMincrease, satVincrease, 0.1)
## [,1]
## [1,] 0.07431385
coef(lm(FYGPA ~ HSGPA, data = sat))[2]*0.1
## HSGPA
## 0.07431385
It’s the same as if we just regressed on HSGPA!