Collaborators: moi, yo, myself

Required libraries:

library(dplyr)

Read the SAT and coronavirus data:

sat <- read.csv('../data/sat.csv')
colnames(sat)[1] <- 'sex'
colnames(sat)
## [1] "sex"    "SATV"   "SATM"   "SATSum" "HSGPA"  "FYGPA"
sat$sex <- sat$sex - 1  # 0/1 encoding

covid <- read.csv("../data/coronavirus.csv")
colnames(covid)[1] <- 'Case_Type'
colnames(covid)
##  [1] "Case_Type"         "Cases"             "Difference"       
##  [4] "Date"              "Country_Region"    "Province_State"   
##  [7] "Admin2"            "Combined_Key"      "FIPS"             
## [10] "Lat"               "Long"              "Table_Names"      
## [13] "Prep_Flow_Runtime"
covid$Date <- as.Date(covid$Date, format = '%m/%d/%Y')

Question 1

Part (a)

greet <- function(greeting) {
  if (toupper(greeting) == greeting) {
    return(paste('Hello', greeting))
  } else {
    stop('Not uppercase')
  }
}

Part (b)

satSex <- lm(SATSum ~ sex, data = sat)
HSgpaSex <- lm(HSGPA ~ sex, data = sat)
summary(satSex)
## 
## Call:
## lm(formula = SATSum ~ sex, data = sat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.705  -9.705   0.205   9.295  43.205 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 105.7054     0.6199  170.51  < 2e-16 ***
## sex          -4.9100     0.8911   -5.51 4.56e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.08 on 998 degrees of freedom
## Multiple R-squared:  0.02952,    Adjusted R-squared:  0.02855 
## F-statistic: 30.36 on 1 and 998 DF,  p-value: 4.563e-08
summary(HSgpaSex)
## 
## Call:
## lm(formula = HSGPA ~ sex, data = sat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.32417 -0.42417 -0.02417  0.42308  1.37583 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.12417    0.02362   132.3  < 2e-16 ***
## sex          0.15275    0.03395     4.5 7.61e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5365 on 998 degrees of freedom
## Multiple R-squared:  0.01988,    Adjusted R-squared:  0.0189 
## F-statistic: 20.25 on 1 and 998 DF,  p-value: 7.61e-06
coef(summary(satSex))[2,4]
## [1] 4.562573e-08
coef(summary(HSgpaSex))[2,4]
## [1] 7.610412e-06

beta, in both cases, is equal to the difference in means between the two groups, and the p-values for beta are the same (up to small numerical precision errors that you have seen are common in optimization problems) as those of the t-test. That’s because beta is the difference in the outcome associated with a 1-unit difference in sex, which in this case is just a difference between male and female.

Part (c)

x <- rnorm(10^5)
z <- rnorm(10^5, x)
y <- rnorm(10^5, x+2*z-10*x*z)

mod1 <- lm(y ~ x + z)
mod2 <- lm(y ~ x*z)
confint(mod1, c(2,3))
##       2.5 %   97.5 %
## x 0.8680927 1.172473
## z 1.9909442 2.206031
confint(mod2, c(2,3))
##       2.5 %   97.5 %
## x 0.9933997 1.010954
## z 1.9941859 2.006591

Both intervals contain the truth, in both models (misspecified or correctly specified)–but the intervals are shorter in the correct (versus the misspecified) model. You will be graded on whether you wrote the correct code, not whether you got the exact same outputs, because they are random!

Question 2

Part (a)

mod2a <- lm(FYGPA ~ SATSum + HSGPA, data = sat)
betas2a <- coef(mod2a)
a2a <- c(1,100,3)
se2a <- sqrt(a2a%*%vcov(mod2a)%*%a2a)
cat('90% CI for FYGPA of average student:',
    c(a2a%*%betas2a + qnorm(.05)*se2a, a2a%*%betas2a + qnorm(.95)*se2a),
    '\nwith expectation', a2a%*%betas2a, '\n\n')
## 90% CI for FYGPA of average student: 2.272201 2.338219 
## with expectation 2.30521

Part (b)

mod2b <- lm(FYGPA ~ HSGPA*sex, data = sat)
betas2b <- coef(mod2b)
plot(sat %>% filter(sex == 0) %>% pull(HSGPA),
     sat %>% filter(sex == 0) %>% pull(FYGPA),
     col = 'red', main = 'HS vs. College GPA by Sex',
     xlab = 'HS GPA', ylab = 'College GPA')
points(sat %>% filter(sex == 1) %>% pull(HSGPA),
     sat %>% filter(sex == 1) %>% pull(FYGPA),
     col = 'blue')
abline(betas2b[1], betas2b[2], col = 'red')
abline(betas2b[1]+betas2b[3], betas2b[2]+betas2b[4], col = 'blue')
legend(4.025, 1.2, c('Male', 'Female'), c('red', 'blue'))

Part (c)

a2c <- c(0,1,0,1)
se2c <- sqrt(a2c %*% vcov(mod2b) %*% a2c)
c(betas2b %*% a2c + qnorm(0.05)*se2c, betas2b %*% a2c + qnorm(0.95)*se2c)
## [1] 0.7827302 0.9609019

Part (d)

infer <- function(lmod, a, conf = 0.95) {
  betas <- coef(lmod)
  se <- sqrt(a %*% vcov(lmod) %*% a)
  return(c(betas %*% a + qnorm((1-conf)/2)*se,
           betas %*% a + qnorm(conf + (1-conf)/2)*se))
}
infer(mod2a, a2a, 0.9)
## [1] 2.272201 2.338219
infer(mod2b, a2c, 0.9)
## [1] 0.7827302 0.9609019

Question 3

Part (a)

covid %>%
  filter(Admin2 == 'San Francisco',
         Case_Type == 'Confirmed',
         Cases > 0,
         Date < '2020-04-01',
         Date >= '2020-03-01') %>%
  select(Cases, Date) %>%
  arrange(Date) ->
  sf
head(sf)
##   Cases       Date
## 1     2 2020-03-05
## 2     2 2020-03-06
## 3     9 2020-03-07
## 4     9 2020-03-08
## 5     9 2020-03-09
## 6    14 2020-03-10

It’s okay if they weren’t ordered, we’re checking for both (apologies to the grader–I should have asked for ordering by Date).

Part (b)

plot(sf$Date, sf$Cases, type = 'l',
     xlab = 'Date', ylab = 'Confirmed Cases',
     main = 'SF COVID-19 Cases in March')

plot(sf$Date, log(sf$Cases), type = 'l',
     xlab = 'Date', ylab = 'Confirmed Cases',
     main = 'SF COVID-19 Cases in March')

Except for the little kink in the first part of the month, linearity appears pretty reasonable for the log of the confirmed cases. Linearity is clearly inappropriate on the original scale because the slope is continuously and rapidly increasing.

Part (c)

sf$logCases <- log(sf$Cases)
sfMod <- lm(logCases ~ Date, data = sf)
summary(sfMod)
## 
## Call:
## lm(formula = logCases ~ Date, data = sf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.91271 -0.01323  0.05552  0.16770  0.40233 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.463e+03  1.295e+02  -26.73   <2e-16 ***
## Date         1.890e-01  7.063e-03   26.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2859 on 25 degrees of freedom
## Multiple R-squared:  0.9663, Adjusted R-squared:  0.9649 
## F-statistic: 716.4 on 1 and 25 DF,  p-value: < 2.2e-16

The expected increase in log cumulative case count associated with a step forward in time of 1 day is 0.19.

plot(sfMod, which = 1)

If we ignore the first two data points, the residuals appear to bear little relationship to the Date variable. However, we should not ignore those two points, especially with 27 observations total! There is some concern that they artificially increased the slope of the regression line somewhat.

plot(sfMod, which = 2)

The first 3-5 residuals make the observed distribution substantially non-normal, so standard error estimates of model coefficients should be treated conservatively.

apr1log <- predict(sfMod)[dim(sf)[1]] + coef(sfMod)[2]
apr1 <- exp(apr1log)
apr1log
##       27 
## 6.520888
apr1
##       27 
## 679.1813
covid %>% 
  filter(Admin2 == 'San Francisco',
         Date == '2020-04-01',
         Case_Type == 'Confirmed') %>%
  select(Cases)
##   Cases
## 1   434

We predict 679 cases on April 1, far more than the 434 that were observed. Let’s investigate–our examination of the residuals earlier suggested that our regression line’s slope might be too high. We plot the predicted cases (on the original scale) against the true number of cases, along with the line \(y=x\):

plot(sf$Cases, exp(predict(sfMod)))
abline(0,1)

Indeed, we see that our model begins to overpredict towards the end of March. And because we regressed on the log scale, on the original scale these overpredictions are quite large. Modeling pandemics ain’t easy!