Required libraries:

library(dplyr)

Read the SAT data, needed for multiple problems:

sat <- read.csv('../data/sat.csv')
colnames(sat)
## [1] "ï..sex" "SATV"   "SATM"   "SATSum" "HSGPA"  "FYGPA"
colnames(sat)[1] <- 'sex'
n <- dim(sat)[1]

Question 1

Part (a)

sat %>% filter(SATM > SATV) %>% summarize(meanMathHigher = mean(SATSum))
##   meanMathHigher
## 1       103.6573
sat %>% filter(SATM < SATV) %>% summarize(meanVerbalHigher = mean(SATSum))
##   meanVerbalHigher
## 1         102.1535

Students who do better on math do slightly better on average, though there’s not much difference practically speaking.

Part (b)

Two ways to do this: filtering and regression. I’ll show both. Filtering:

t.test(sat %>% filter(sex == 1) %>% pull(SATSum),
       sat %>% filter(sex == 2) %>% pull(SATSum))
## 
##  Welch Two Sample t-test
## 
## data:  sat %>% filter(sex == 1) %>% pull(SATSum) and sat %>% filter(sex == 2) %>% pull(SATSum)
## t = 5.5064, df = 990.8, p-value = 4.666e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.160173 6.659771
## sample estimates:
## mean of x mean of y 
##  105.7054  100.7955
t.test(sat %>% filter(sex == 1) %>% pull(HSGPA),
       sat %>% filter(sex == 2) %>% pull(HSGPA))
## 
##  Welch Two Sample t-test
## 
## data:  sat %>% filter(sex == 1) %>% pull(HSGPA) and sat %>% filter(sex == 2) %>% pull(HSGPA)
## t = -4.508, df = 997.97, p-value = 7.321e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.21925006 -0.08625958
## sample estimates:
## mean of x mean of y 
##  3.124167  3.276921

In these data, men do better on the SAT on average and women get better grades in high school on average. Both differences are statistically significant with p << .05. Harder to say if they’re practically/scientifically significant.

Can also use regression:

SATmod <- lm(SATSum ~ sex, data = sat)
summary(SATmod)
## 
## Call:
## lm(formula = SATSum ~ sex, data = sat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.705  -9.705   0.205   9.295  43.205 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 110.6154     1.3953   79.28  < 2e-16 ***
## sex          -4.9100     0.8911   -5.51 4.56e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.08 on 998 degrees of freedom
## Multiple R-squared:  0.02952,    Adjusted R-squared:  0.02855 
## F-statistic: 30.36 on 1 and 998 DF,  p-value: 4.563e-08
GPAmod <- lm(HSGPA ~ sex, data = sat)
summary(GPAmod)
## 
## Call:
## lm(formula = HSGPA ~ sex, data = sat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.32417 -0.42417 -0.02417  0.42308  1.37583 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.97141    0.05316    55.9  < 2e-16 ***
## sex          0.15275    0.03395     4.5 7.61e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5365 on 998 degrees of freedom
## Multiple R-squared:  0.01988,    Adjusted R-squared:  0.0189 
## F-statistic: 20.25 on 1 and 998 DF,  p-value: 7.61e-06

The coefficient estimate for “sex” is the same as the difference in means(!) and the p-value (in the 4th column of the coefficient table) is the same as the t test p-value. Coincidence???

Part (c)

plot(sat$HSGPA, sat$SATM, 
     col = 'red', xlab = 'High School GPA', ylab = 'SAT Score',
     main = 'HS GPA vs. SAT Subject Scores')
points(sat$HSGPA, sat$SATV, col = 'blue')
legend(4.02, 70, c('SAT Math', 'SAT Verbal'), fill = c('blue', 'red'))

Part (d)

satMc <- (sat$SATM %in% 20:40) + 2*(sat$SATM %in% 41:60) + 3*(sat$SATM %in% 61:80)
satVc <- (sat$SATV %in% 20:40) + 2*(sat$SATV %in% 41:60) + 3*(sat$SATV %in% 61:80)
unique(satMc)
## [1] 3 2 1
unique(satVc)
## [1] 3 2 1
crossTab <- matrix(0, 3, 3)
for (i in 1:n) {
  crossTab[satMc[i], satVc[i]] <- crossTab[satMc[i], satVc[i]] + 1
}
crossTab
##      [,1] [,2] [,3]
## [1,]   31   29    0
## [2,]  107  567   44
## [3,]    9  162   51
table(satMc, satVc)
##      satVc
## satMc   1   2   3
##     1  31  29   0
##     2 107 567  44
##     3   9 162  51
n - sum(diag(crossTab))
## [1] 351

About 1/3 of students fell in different categories for the verbal and math sections, which makes sense, as this will include both students who scored quite differently on the two sections and those that scored close but on either side of a cutoff.

Question 2

Part (a)

The answer was at the end of the lecture8.RMD code file!

f <- function(b) sum((sat$FYGPA - (b[1] + b[2]*sat$SATSum))^2)
optim(c(0,0), f)$par
## [1] 0.00195403 0.02386409
lm(FYGPA ~ SATSum, data = sat)
## 
## Call:
## lm(formula = FYGPA ~ SATSum, data = sat)
## 
## Coefficients:
## (Intercept)       SATSum  
##    0.001927     0.023866

Part (b)

myLm <- function(y, X, parInit) {
  f <- function(b) sum((y - cbind(1, X)%*%b)^2)
  optim(parInit, f)$par
}
myLm(sat$FYGPA, cbind(sat$SATM,
                      sat$SATV,
                      sat$HSGPA),
     rep(0,4))
## [1] -0.85937180  0.01206681  0.01684094  0.57683140
lm(FYGPA ~ SATM + SATV + HSGPA, data = sat)
## 
## Call:
## lm(formula = FYGPA ~ SATM + SATV + HSGPA, data = sat)
## 
## Coefficients:
## (Intercept)         SATM         SATV        HSGPA  
##    -0.86693      0.01240      0.01646      0.58007

Question 3

Part (a)

Unfortunately, this question became confused by 1) the estimated intercept is extremely close to zero and 2) in the previous question, being “close enough” was good enough. So we’re giving credit for it if you attempted it. But here’s the “actual” solution…

lm(FYGPA ~ SATSum - 1, data = sat)
## 
## Call:
## lm(formula = FYGPA ~ SATSum - 1, data = sat)
## 
## Coefficients:
##  SATSum  
## 0.02388
summary(lm(FYGPA ~ SATSum, data = sat))
## 
## Call:
## lm(formula = FYGPA ~ SATSum, data = sat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1976 -0.4495  0.0315  0.4557  1.6115 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.001927   0.151991   0.013     0.99    
## SATSum      0.023866   0.001457  16.379   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.658 on 998 degrees of freedom
## Multiple R-squared:  0.2119, Adjusted R-squared:  0.2111 
## F-statistic: 268.3 on 1 and 998 DF,  p-value: < 2.2e-16
FYGPAc <- sat$FYGPA - mean(sat$FYGPA)
SATc <- sat$SATSum - mean(sat$SATSum)
summary(lm(FYGPAc ~ SATc - 1))
## 
## Call:
## lm(formula = FYGPAc ~ SATc - 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1976 -0.4495  0.0315  0.4557  1.6115 
## 
## Coefficients:
##      Estimate Std. Error t value Pr(>|t|)    
## SATc 0.023866   0.001456   16.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6577 on 999 degrees of freedom
## Multiple R-squared:  0.2119, Adjusted R-squared:  0.2111 
## F-statistic: 268.5 on 1 and 999 DF,  p-value: < 2.2e-16

If you don’t estimate an intercept, you assume it’s zero, so your \(\hat{\beta}\) will be different (in this case, by a small amount). Once you center the data, the coefficient estimates will be the same regardless of whether you estimate an intercept or not, because the estimated intercept must pass through the mean of the data, which is \((0,0)\) when the data are centered. However, if you look at the t values for the centered model with no intercept vs. the original model with intercept, they’re a bit different. So inference and hypothesis testing will be different, even though the coefficient estimate is the same!

Part (b)

triMod <- lm(FYGPA ~ SATM + SATV + HSGPA, data = sat)
coef(triMod)
## (Intercept)        SATM        SATV       HSGPA 
## -0.86692662  0.01239756  0.01645881  0.58006828

Holding SAT Matb and Verbal scores fixed, an increase of 0.1 points in HS GPA is associated with an increase of 0.058 points in FY College GPA.

When we don’t hold SAT scores fixed, we need to think about how they are expected to change with a change in HS GPA. We can do this with regression!

satMincrease <- lm(SATM ~ HSGPA, data = sat)$coef[2]*0.1
satVincrease <- lm(SATV ~ HSGPA, data = sat)$coef[2]*0.1

Now we include these expected increases with the 0.1 HS GPA increase in the original tri-variate model:

coef(triMod)[2:4] %*% c(satMincrease, satVincrease, 0.1)
##            [,1]
## [1,] 0.07431385
coef(lm(FYGPA ~ HSGPA, data = sat))[2]*0.1
##      HSGPA 
## 0.07431385

It’s the same as if we just regressed on HSGPA!