Collaborators: moi, yo, myself
Required libraries:
library(dplyr)
Read the SAT and coronavirus data:
sat <- read.csv('../data/sat.csv')
colnames(sat)[1] <- 'sex'
colnames(sat)
## [1] "sex" "SATV" "SATM" "SATSum" "HSGPA" "FYGPA"
sat$sex <- sat$sex - 1 # 0/1 encoding
covid <- read.csv("../data/coronavirus.csv")
colnames(covid)[1] <- 'Case_Type'
colnames(covid)
## [1] "Case_Type" "Cases" "Difference"
## [4] "Date" "Country_Region" "Province_State"
## [7] "Admin2" "Combined_Key" "FIPS"
## [10] "Lat" "Long" "Table_Names"
## [13] "Prep_Flow_Runtime"
covid$Date <- as.Date(covid$Date, format = '%m/%d/%Y')
RMSE helper function:
rmse <- function(y,yhat) sqrt(mean((y-yhat)^2))
skew <- function(x) mean(((x-mean(x))/sd(x))^3)
getMat <- function(n) matrix(runif(1000*n), n, 1000)
qSkew <- function(q, x) skew(apply(x, 2, function(y) quantile(y, q)))
ns <- c(5, 20, 100, 500, 2000)
qs <- seq(0,1,0.01)
weks <- t(sapply(qs, function(q) sapply(lapply(ns, getMat), function(x) qSkew(q, x))))
lapply(1:length(ns), function(i) plot(qs, weks[,i]))
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
For every \(n\), the skewness is decreasing in the quantile. For smaller sample sizes, the rate of decrease is relatively smooth, whereas for larger sample sizes the curvature is greater, to the point that for large samples most quantiles are not very skewed except the min and max.
conf <- 0.999
x <- rnorm(10^4)
z <- rnorm(10^4, x)
ys <- lapply(c(1,5,20,100), function(a) rnorm(10^4, x+2*z-a*x*z))
lapply(ys, function(y) confint(lm(y ~ x+z), 2, conf))
## [[1]]
## 0.05 % 99.95 %
## x 0.9173887 1.103285
##
## [[2]]
## 0.05 % 99.95 %
## x 0.7036813 1.511723
##
## [[3]]
## 0.05 % 99.95 %
## x -0.03135821 3.180746
##
## [[4]]
## 0.05 % 99.95 %
## x -4.119277 11.94281
lapply(ys, function(y) confint(lm(y ~ x*z), 2, conf))
## [[1]]
## 0.05 % 99.95 %
## x 0.9341326 1.028446
##
## [[2]]
## 0.05 % 99.95 %
## x 0.9153703 1.008981
##
## [[3]]
## 0.05 % 99.95 %
## x 0.9455026 1.039556
##
## [[4]]
## 0.05 % 99.95 %
## x 0.9527297 1.046192
All the intervals contain the true \(\beta\) (it turns out the confidence interval coverage rate isn’t bad even for the wrong model), but for the misspecified model (no interactions) the stronger the misspecification the wider the intervals (which is bad–we want to be confident about a short interval). For the correct model, the intervals are all the same width, approximately.
satMod1 <- lm(FYGPA ~ SATSum, data = sat %>% filter(sex == 0))
satMod2 <- lm(FYGPA ~ HSGPA, data = sat %>% filter(sex == 0))
sigma(satMod1)
## [1] 0.6285975
sigma(satMod2)
## [1] 0.6278814
satFemale <- sat %>% filter(sex == 1)
rmse(satFemale$FYGPA,predict(satMod1, satFemale))
## [1] 0.7132287
rmse(satFemale$FYGPA,predict(satMod2, satFemale))
## [1] 0.6237994
The two models have a very similar quality of fit to the data, but the RMSE is notably lower (better) for the model that predicts with HS GPA. Thus, the relationship between HSGPA and college GPA generalizes better than that between SAT and college GPA.
n <- dim(sat)[1]
# indices <- sample(n)
indices <- 1:n
nTrain <- floor(1/2*n)
train <- sat[indices[1:nTrain],]
test <- sat[indices[(nTrain+1):n],]
satMod1t <- lm(FYGPA ~ SATSum, data = train)
satMod2t <- lm(FYGPA ~ HSGPA, data = train)
rmse(test$FYGPA,predict(satMod1t, test))
## [1] 0.6426578
rmse(test$FYGPA,predict(satMod2t, test))
## [1] 0.6106624
It’s generally a bit lower. This is because we’re not extrapolating (fitting the model to a relationship in one population–male students–and testing it in another). Rather, we’re training and testing within the same population (all students).
covid %>%
filter(Admin2 == 'San Francisco',
Case_Type == 'Confirmed',
Cases > 0,
Date < '2020-04-01',
Date >= '2020-03-01') %>%
select(Cases, Date) %>%
arrange(Date) ->
sf
covid %>%
filter(Admin2 == 'San Francisco',
Case_Type == 'Confirmed',
Cases > 0,
Date >= '2020-04-01',
Date <= '2020-04-07') %>%
select(Cases, Date) %>%
arrange(Date) ->
sfApr
sf$logCases <- log(sf$Cases)
sfApr$logCases <- log(sfApr$Cases)
nSF <- dim(sf)[1]
sfMods <- lapply(1:5, function(p) lm(logCases ~ poly(Date,p), data = sf))
sapply(sfMods, sigma)
## [1] 0.2858542 0.2312058 0.2162774 0.1800250 0.1786009
sapply(1:5, function(i) sqrt(1/(nSF - (1+i))*sum((sfApr$Cases - predict(sfMods[[i]], sfApr))^2)))
## [1] 276.7602 282.8668 288.5014 296.3326 302.2102
sapply(sfMods, function(mod) rmse(sfApr$logCases, predict(mod, sfApr)))
## [1] 0.8683922 0.1194757 0.9764040 1.6823836 0.5316415
sapply(sfMods, function(mod) rmse(sfApr$Cases, exp(predict(mod, sfApr))))
## [1] 864.85521 73.31791 1178.58433 387.39381 529.85537
The Residual Standard Error is decreasing in \(p\) on the log scale, but increasing in \(p\) on the original scale(!!!) so the best fit on the log scale is \(p=5\) but on the original scale \(p=1\). For prediction, however, the quadratic model is best on both scales. By this point, you shouldn’t be surprised–complex models can be terrible at extrapolation, but the simplest model is often too simple and could be improved.
squaredLoss <- function(y, yhat) (y-yhat)^2
sigma(sfMods[[3]])
## [1] 0.2162774
sqrt(1/(nSF-4)*sum(squaredLoss(sf$logCases, predict(sfMods[[3]]))))
## [1] 0.2162774
asymmetricLoss <- function(y,yhat) (y>yhat)*4*(y-yhat)^2 + (y<yhat)*(y-yhat)^2
lapply(sfMods, function(mod) sqrt(mean(asymmetricLoss(sfApr$logCases,
predict(mod, sfApr)))))
## [[1]]
## [1] 0.8683922
##
## [[2]]
## [1] 0.1194757
##
## [[3]]
## [1] 0.976404
##
## [[4]]
## [1] 3.364767
##
## [[5]]
## [1] 0.5316415
lapply(sfMods, function(mod) sqrt(mean(asymmetricLoss(sfApr$Cases,
exp(predict(mod, sfApr))))))
## [[1]]
## [1] 864.8552
##
## [[2]]
## [1] 73.31791
##
## [[3]]
## [1] 1178.584
##
## [[4]]
## [1] 774.7876
##
## [[5]]
## [1] 529.8554
Best model same as part (a). Only model 4 ever under-predicts, interestingly.
You’d add an argument lossFunc, and then the function you’d minimize wouldn’t be the sum of (y-yhat)^2, but rather sum(lossFunc(y,yhat)).
Consider two estimates of the true quantity y, yhatU (underestimate) and yhatO (overestimate). Suppose y-yhatU = q and y-yhatO = -q. Then (exp(y) - exp(yhatU))^2 < (exp(y) - exp(yhatO)^2 because the 2nd derivative of exp is positive (the negative residual gets “blown up”). So a larger-magnitude negative residual (on the original scale) yields the same loss (on the log scale) as a smaller-magnitude positive residual (on the original scale).
Thus underestimates on the original scale are penalized more. However, this means overestimates are penalized more on the log scale, where the optimization happens, which is inappropriate because underestimating COVID-19 cases is worse than overestimating them (better overstate the seriousness of the situation than understate it).
(This was not an easy question. It’s okay if this is confusing.)