The data comes from surveys conducted on students in a math class and a Portuguese class. Each observation represents a student in the class. The relevant variables are each student’s quality of family relationships, frequency of going out with friends, amount of free time after school, current health status, weekday alcohol consumption, and grades in the courses. I studied the different factors that affect student alcohol consumption, and how alcohol consumption, in turn, affects students’ grades.
mat <- read_csv("student-mat.csv")
por <- read_csv("student-por.csv")
mat <- mat %>% select(famrel, freetime, goout, Dalc, health, G1, G2, G3)
por <- por %>% select(famrel, freetime, goout, Dalc, health, G1, G2, G3)
joint <- rbind(mat, por)
head(mat)
## # A tibble: 6 x 8
## famrel freetime goout Dalc health G1 G2 G3
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 3 4 1 3 5 6 6
## 2 5 3 3 1 3 5 5 6
## 3 4 3 2 2 3 7 8 10
## 4 3 2 2 1 5 15 14 15
## 5 4 3 2 1 5 6 10 10
## 6 5 4 2 1 5 15 15 15
head(por)
## # A tibble: 6 x 8
## famrel freetime goout Dalc health G1 G2 G3
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 3 4 1 3 0 11 11
## 2 5 3 3 1 3 9 11 11
## 3 4 3 2 2 3 12 13 12
## 4 3 2 2 1 5 14 14 14
## 5 4 3 2 1 5 11 13 13
## 6 5 4 2 1 5 12 12 13
Since level of alcohol consumption is on an index of 1 to 5, and the original data set does not describe these levels, I will assume the following: Dalc=1 means very little, Dalc=2 means little, Dalc=3 means moderate, Dalc=4 means high, and Dalc=5 means very high alcohol consumption.
Variables to be compared: quality of family relationships, frequency of going out with friends, amount of free time after school, current health status. The students answered each of these variables on a scale of 1 to 5.
surveyMod <- lm(Dalc ~ goout + famrel + freetime + health, data = joint)
coef(surveyMod)
## (Intercept) goout famrel freetime health
## 0.95006929 0.18849893 -0.11138409 0.06843537 0.04757770
The frequency that students go out is best at explaining weekday alcohol consumption. Holding all other variables constant, an increase of 1 unit in level of going out leads to an increase of about 0.188 in weekday alcohol consumption. While frequency of going out, amount of free time, and health are positively correlated with weekday alcohol consumption holding other variables constant, quality of family relationships is negatively correlated with the response variable. The more the students went out, the more alcohol they tend to consume (similarly for amount of free time and health), but the better family relationships were, the less alcohol they consumed.
summary(surveyMod)
##
## Call:
## lm(formula = Dalc ~ goout + famrel + freetime + health, data = joint)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2499 -0.5132 -0.2858 0.2298 3.6544
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.95007 0.15305 6.208 7.76e-10 ***
## goout 0.18850 0.02485 7.585 7.38e-14 ***
## famrel -0.11138 0.02942 -3.786 0.000162 ***
## freetime 0.06844 0.02800 2.444 0.014687 *
## health 0.04758 0.01916 2.483 0.013170 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8737 on 1039 degrees of freedom
## Multiple R-squared: 0.08514, Adjusted R-squared: 0.08162
## F-statistic: 24.17 on 4 and 1039 DF, p-value: < 2.2e-16
All of the p-values are less than 0.05, which means that for each variable, we can reject the null hypothesis that the coefficient is zero. All four of the variables are not completely unrelated to weekday alcohol consumption.
gooutMod <- lm(Dalc ~ goout, data = joint) # going out model
famrelMod <- lm(Dalc ~ famrel, data = joint) # family relationships model
freetimeMod <- lm(Dalc ~ freetime, data = joint) # free time model
healthMod <- lm(Dalc ~ health, data = joint) # health model
gooutPoints <- joint %>% group_by(Dalc) %>% count(goout)
ggplot(gooutPoints, mapping = aes(x = goout, y = Dalc)) +
geom_point(aes(size = n)) +
geom_abline(slope = coef(gooutMod)[2], intercept = coef(gooutMod)[1]) +
ggtitle('Weekday Alcohol Consumption vs. Going Out') +
xlab('Frequency of Going Out') + ylab('Weekday Alcohol Consumption')
famrelPoints <- joint %>% group_by(Dalc) %>% count(famrel)
ggplot(famrelPoints, mapping = aes(x = famrel, y = Dalc)) +
geom_point(aes(size = n)) +
geom_abline(slope = coef(famrelMod)[2], intercept = coef(famrelMod)[1]) +
ggtitle('Weekday Alcohol Consumption vs. Family Relationships') +
xlab('Quality of Family Relationships') + ylab('Weekday Alcohol Consumption')
freetimePoints <- joint %>% group_by(Dalc) %>% count(freetime)
ggplot(freetimePoints, mapping = aes(x = freetime, y = Dalc)) +
geom_point(aes(size = n)) +
geom_abline(slope = coef(freetimeMod)[2], intercept = coef(freetimeMod)[1]) +
ggtitle('Weekday Alcohol Consumption vs. Free Time') +
xlab('Amount of Free Time') + ylab('Weekday Alcohol Consumption')
healthPoints <- joint %>% group_by(Dalc) %>% count(health)
ggplot(healthPoints, mapping = aes(x = health, y = Dalc)) +
geom_point(aes(size = n)) +
geom_abline(slope = coef(healthMod)[2], intercept = coef(healthMod)[1]) +
ggtitle('Weekday Alcohol Consumption vs. Health') +
xlab('Current Health Status') + ylab('Weekday Alcohol Consumption')
Next, we investigate the predictive power of these four variables.
n <- nrow(joint)
ind <- sample(n)
trainInd <- ind[1:(n/2)]
testInd <- ind[(n/2+1):n]
train <- joint[trainInd,]
test <- joint[testInd,]
gooutTrainMod <- lm(Dalc ~ goout, data = train)
famrelTrainMod <- lm(Dalc ~ famrel, data = train)
freetimeTrainMod <- lm(Dalc ~ freetime, data = train)
healthTrainMod <- lm(Dalc ~ health, data = train)
# rmse: returns the root mean squared error, given the measured data and predicted data
rmse <- function(y,yhat) sqrt(mean((y-yhat)^2))
rmse(test$Dalc, predict(gooutTrainMod, test))
## [1] 0.8633891
rmse(test$Dalc, predict(famrelTrainMod, test))
## [1] 0.8925197
rmse(test$Dalc, predict(freetimeTrainMod, test))
## [1] 0.8866679
rmse(test$Dalc, predict(healthTrainMod, test))
## [1] 0.8939638
The RMSE for all four models are pretty close. The ‘going out’ model has the lowest RMSE, which means that the predicted points based on the ‘going out’ model are closest to the observed data points in the test data. The frequency that students go out is best at predicting weekday alcohol consumption.
We first look at students in the math course.
ggplot(data = mat, mapping = aes(x = as.factor(Dalc), y = G3)) +
geom_boxplot() +
ggtitle('Math Course Grades vs. Weekday Alcohol Consumption') +
xlab('Weekday Alcohol Consumption') +
ylab('Final Grades in Math Course')
The graph itself is not enough to determine whether the grades in each group are different.
aovModMat <- aov(G3 ~ as.factor(Dalc), data = mat)
summary(aovModMat)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Dalc) 4 132 33.04 1.584 0.178
## Residuals 390 8138 20.87
The p-value given by the ANOVA model regression is greater than 0.05. There is not enough evidence to say that there is a statistically significant difference between the final math grades of the groups of students (grouped by level of alcohol consumption).
Next, we examine the Portuguese class.
ggplot(data = por, mapping = aes(x = as.factor(Dalc), y = G3)) +
geom_boxplot() +
ggtitle('Portuguese Course Grades vs. Weekday Alcohol Consumption') +
xlab('Weekday Alcohol Consumption') +
ylab('Final Grades in Portuguese Course')
Again, the graph is not enough to see if the grades are different in each group. However, compared to the previous graphs from the math course, there seems to be more of a downward trend of grades as alcohol consumption rises.
aovModPor <- aov(G3 ~ as.factor(Dalc), data = por)
summary(aovModPor)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Dalc) 4 328 81.88 8.193 1.9e-06 ***
## Residuals 644 6436 9.99
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For the Portuguese course, the p-value is much less than 0.05. There is a statistically significant difference between the final grades in the groups of students (grouped by alcohol consumption).
TukeyHSD(aovModPor)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = G3 ~ as.factor(Dalc), data = por)
##
## $`as.factor(Dalc)`
## diff lwr upr p adj
## 2-1 -0.9356984 -1.821046 -0.05035048 0.0322947
## 3-1 -1.1597999 -2.539987 0.22038700 0.1465582
## 4-1 -3.3581583 -5.494679 -1.22163752 0.0001914
## 5-1 -2.0640407 -4.200562 0.07248013 0.0640388
## 3-2 -0.2241015 -1.759397 1.31119391 0.9946373
## 4-2 -2.4224599 -4.662312 -0.18260784 0.0264927
## 5-2 -1.1283422 -3.368194 1.11150981 0.6420152
## 4-3 -2.1983584 -4.675860 0.27914278 0.1093649
## 5-3 -0.9042408 -3.381742 1.57326043 0.8560823
## 5-4 1.2941176 -1.671994 4.26022915 0.7551583
There are only statistically significant differences in the grades between Groups 1 and 2, Groups 1 and 4, and Groups 2 and 4 (“Group #” means the group of students with weekday alcohol consumption equal to #). There is a statistically significant difference in Portuguese grades between students who drink very little alcohol and students who drink some alcohol. The mean Portuguese course grade of students who drink very little alcohol is 0.94 points higher than students who drink little alcohol and 3.36 points higher than students who drink high amounts of alcohol. The mean Portuguese course grade for students who drink little alcohol is 2.42 points higher than students who drink high amounts of alcohol.
Students’ grades in the math course did not vary much with the level of alcohol consumption, while students’ grades in the Portuguese course varied with the level of alcohol consumption more. In general, students who drink very little alcohol on weekdays have higher Portuguese course grades than many of the other groups (little alcohol and high alcohol).