Data obtained from: https://www.kaggle.com/harlfoxem/housesalesprediction

Exploratory Analysis

Dataset contains 21 variables in total. We removed two repeated measure variables that were not used in our analysis (sqft_living15, sqft_lot15).

## tibble [21,613 Ă— 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id           : chr [1:21613] "7129300520" "6414100192" "5631500400" "2487200875" ...
##  $ date         : POSIXct[1:21613], format: "2014-10-13" "2014-12-09" ...
##  $ price        : num [1:21613] 221900 538000 180000 604000 510000 ...
##  $ bedrooms     : num [1:21613] 3 3 2 4 3 4 3 3 3 3 ...
##  $ bathrooms    : num [1:21613] 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
##  $ sqft_living  : num [1:21613] 1180 2570 770 1960 1680 ...
##  $ sqft_lot     : num [1:21613] 5650 7242 10000 5000 8080 ...
##  $ floors       : num [1:21613] 1 2 1 1 1 1 2 1 1 2 ...
##  $ waterfront   : num [1:21613] 0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : num [1:21613] 0 0 0 0 0 0 0 0 0 0 ...
##  $ condition    : num [1:21613] 3 3 3 5 3 3 3 3 3 3 ...
##  $ grade        : num [1:21613] 7 7 6 7 8 11 7 7 7 7 ...
##  $ sqft_above   : num [1:21613] 1180 2170 770 1050 1680 ...
##  $ sqft_basement: num [1:21613] 0 400 0 910 0 1530 0 0 730 0 ...
##  $ yr_built     : num [1:21613] 1955 1951 1933 1965 1987 ...
##  $ yr_renovated : num [1:21613] 0 1991 0 0 0 ...
##  $ zipcode      : Factor w/ 70 levels "98001","98002",..: 67 56 17 59 38 30 3 69 61 24 ...
##  $ lat          : num [1:21613] 47.5 47.7 47.7 47.5 47.6 ...
##  $ long         : num [1:21613] -122 -122 -122 -122 -122 ...

Variable of Interest

Price plays a big role in the decision to purchase a home. This analysis will use techniques such as multiple linear regression and random forest to study the relationship between the price and other variables present in the dataset, with the hope of helping potential home buyers understand the King County housing market and estimate a price for a home they desire.

Correlation Plot

Price has a strong correlation with sqft_living, which suggests price per sqft would be interesting to examine, and with grade, which is the government’s assessment on house quality (higher is better). So, the correlation is explained there.

New Price by Sqft Variables Introduced and Another Correlation Plot

There appears to be some correlation between latitutde and longtitude with the new price variables, which suggests price may be related to location.

Map of House Price

Map of House Price Adjusted by Sqft

In both maps, there is a clustering of high priced properties around Mercer Island, Bellevue, and Capital Hill. Redmond and Sammamish also has expensive properties, but their cost is high because of larger living spaces.

So, location is not the only variable influencing the house price. We use multiple regression below to find what predictors are most influential to house price.

Choosing Multiple Linear Regression model

Looking at the residual qq-plots of multiple linear regression, notice the assumption that the error terms should be i.i.d normal with mean at 0 is violated.

But, after a log transformation, the linear model seems reasonable.

Multiple Linear Regression of Log Price on 16 Other Variables

After comparing the normal qq-plot, we choose to build a multiple linear regression of log transformed price on the other 16 variables (excluding id and date). We examine its performance by using a 5 fold cross validation.

## Linear Regression 
## 
## 21613 samples
##    15 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 17290, 17291, 17289, 17292, 17290 
## Resampling results:
## 
##   RMSE      Rsquared   MAE      
##   0.257517  0.7610001  0.1980651
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

The 16 other variables explains approximately 76% of the variation in price.

Multiple Linear Regression of Log Price Per Sqft on 16 Other Variables

The we change the outcome to price_sqft_living and price_sqft_lot.

## [1] "Price per Sqft Living:"
## Linear Regression 
## 
## 21613 samples
##    15 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 17290, 17290, 17291, 17291, 17290 
## Resampling results:
## 
##   RMSE       Rsquared  MAE      
##   0.2698961  0.532751  0.2070588
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## [1] "Price per Sqft Lot:"
## Linear Regression 
## 
## 21613 samples
##    15 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 17291, 17291, 17290, 17289, 17291 
## Resampling results:
## 
##   RMSE     Rsquared   MAE      
##   0.60063  0.6264597  0.4576578
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Notice the Rsquared statistic is significantly lower when we adjust price by sqft living or lot. Guessing by the variables present in the dataset, this is because most of these variables are strongly correlated with the property size. By normalizing price with a size variable, we are removing some correlation.

Relative Variable Importance

Which variables have a large influence on the house price? We will first look at the magnitude of the standardized coefficients of the multiple linear regression. It is important we standardize our covariates so that their coefficient’s magnitude can be meaingfully compared.

Sqft of living appears to have the largest standardized coefficient in the linear model, which suggests it has the largest influence on price. It is followed by grade and lat. Grade is previously explained, and latitude can be seen from the maps.

Of course, there are other ways to determine the importance of the covariates. For example, how does the addition of that variable increase the R squared statistic. The ranking below is generated from a function in a R package using a different method. This is to show that different method produces slightly different rankings. Sometimes these methods will produce contradictory rankings, so we need to also use our subject area knowledge.

Best Subset Selection

What is the best subset of variables to predict house price? To find the best subset, we will do an exhaustive search of subsets of different sizes up to 14 variables.

## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 1 linear dependencies found
## Reordering variables and trying again:
## Subset selection object
## Call: regsubsets.formula(log(price) ~ ., data = house1, nvmax = 14)
## 15 Variables  (and intercept)
##               Forced in Forced out
## bedrooms          FALSE      FALSE
## bathrooms         FALSE      FALSE
## sqft_living       FALSE      FALSE
## sqft_lot          FALSE      FALSE
## floors            FALSE      FALSE
## waterfront        FALSE      FALSE
## view              FALSE      FALSE
## condition         FALSE      FALSE
## grade             FALSE      FALSE
## sqft_above        FALSE      FALSE
## yr_built          FALSE      FALSE
## yr_renovated      FALSE      FALSE
## lat               FALSE      FALSE
## long              FALSE      FALSE
## sqft_basement     FALSE      FALSE
## 1 subsets of each size up to 14
## Selection Algorithm: exhaustive
##           bedrooms bathrooms sqft_living sqft_lot floors waterfront view
## 1  ( 1 )  " "      " "       " "         " "      " "    " "        " " 
## 2  ( 1 )  " "      " "       "*"         " "      " "    " "        " " 
## 3  ( 1 )  " "      " "       "*"         " "      " "    " "        " " 
## 4  ( 1 )  " "      " "       "*"         " "      " "    " "        " " 
## 5  ( 1 )  " "      " "       "*"         " "      " "    " "        "*" 
## 6  ( 1 )  " "      "*"       "*"         " "      " "    " "        "*" 
## 7  ( 1 )  " "      "*"       "*"         " "      " "    " "        "*" 
## 8  ( 1 )  " "      "*"       "*"         " "      " "    "*"        "*" 
## 9  ( 1 )  " "      "*"       "*"         " "      "*"    "*"        "*" 
## 10  ( 1 ) " "      "*"       "*"         "*"      "*"    "*"        "*" 
## 11  ( 1 ) " "      "*"       "*"         "*"      "*"    "*"        "*" 
## 12  ( 1 ) "*"      "*"       "*"         "*"      "*"    "*"        "*" 
## 13  ( 1 ) "*"      "*"       "*"         "*"      "*"    "*"        "*" 
## 14  ( 1 ) "*"      "*"       "*"         "*"      "*"    "*"        "*" 
##           condition grade sqft_above sqft_basement yr_built yr_renovated lat
## 1  ( 1 )  " "       "*"   " "        " "           " "      " "          " "
## 2  ( 1 )  " "       " "   " "        " "           " "      " "          "*"
## 3  ( 1 )  " "       "*"   " "        " "           " "      " "          "*"
## 4  ( 1 )  " "       "*"   " "        " "           "*"      " "          "*"
## 5  ( 1 )  " "       "*"   " "        " "           "*"      " "          "*"
## 6  ( 1 )  " "       "*"   " "        " "           "*"      " "          "*"
## 7  ( 1 )  "*"       "*"   " "        " "           "*"      " "          "*"
## 8  ( 1 )  "*"       "*"   " "        " "           "*"      " "          "*"
## 9  ( 1 )  "*"       "*"   " "        " "           "*"      " "          "*"
## 10  ( 1 ) "*"       "*"   " "        " "           "*"      " "          "*"
## 11  ( 1 ) "*"       "*"   " "        " "           "*"      "*"          "*"
## 12  ( 1 ) "*"       "*"   " "        " "           "*"      "*"          "*"
## 13  ( 1 ) "*"       "*"   " "        " "           "*"      "*"          "*"
## 14  ( 1 ) "*"       "*"   "*"        " "           "*"      "*"          "*"
##           long
## 1  ( 1 )  " " 
## 2  ( 1 )  " " 
## 3  ( 1 )  " " 
## 4  ( 1 )  " " 
## 5  ( 1 )  " " 
## 6  ( 1 )  " " 
## 7  ( 1 )  " " 
## 8  ( 1 )  " " 
## 9  ( 1 )  " " 
## 10  ( 1 ) " " 
## 11  ( 1 ) " " 
## 12  ( 1 ) " " 
## 13  ( 1 ) "*" 
## 14  ( 1 ) "*"

Lat(lattitude), grade, yr_built, and sqft_living are identified to be the best variables for subset size 4.

We also want to determine how many variables should be subsetting so that we can obtain a simple model with good performance. Therefore, we did 5-fold cross validation on 14 best subset model and calculate the cross validation error (zipcode was omitted to avoid interpreting results with a factor variable):

A lower cross validation error is desired, but lower cross validation error means higher cost complexity. The plot above shows that 5 variables seem to be ideal, as beyond that, CV doesn’t decrease by much.

Interaction Between sqft_living and grade:

Looking back to the correlation plot, there is a strong correlation between sqft_living and grade. Since both variables are determined important predictors of price, their interaction is examined.

## The correlation between price and sqft_living:  0.7020351
## The correlation between price and grade:  0.6674343
## The correlation between sqft_living and grade:  0.7627045

Our Null Hypothesis is that there is no interaction between this two variable with \(\alpha = 0.001\).

We first fit the price with these 2 variables without interaction:

## 
## Call:
## lm(formula = log(price) ~ sqft_living + grade, data = house)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.70741 -0.25607  0.00049  0.23536  1.42320 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.117e+01  1.865e-02  599.26   <2e-16 ***
## sqft_living 2.175e-04  4.022e-06   54.08   <2e-16 ***
## grade       1.856e-01  3.143e-03   59.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3512 on 21610 degrees of freedom
## Multiple R-squared:  0.5553, Adjusted R-squared:  0.5553 
## F-statistic: 1.349e+04 on 2 and 21610 DF,  p-value: < 2.2e-16

We can see that as we showed before, sqft_living and grade are both significant to the model. Then we fit the price with these 2 variable with interaction:

## 
## Call:
## lm(formula = log(price) ~ sqft_living * grade, data = house)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.43418 -0.25380  0.00004  0.23488  1.40708 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.104e+01  3.244e-02 340.298  < 2e-16 ***
## sqft_living        2.826e-04  1.338e-05  21.115  < 2e-16 ***
## grade              2.020e-01  4.491e-03  44.979  < 2e-16 ***
## sqft_living:grade -7.479e-06  1.467e-06  -5.097 3.48e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.351 on 21609 degrees of freedom
## Multiple R-squared:  0.5558, Adjusted R-squared:  0.5558 
## F-statistic:  9014 on 3 and 21609 DF,  p-value: < 2.2e-16

We can see that the p-value for the interaction term is way less than 0.001. Therefore, we can reject the null hypothesis and claim that there is an interaction between variable sqft_living and grade.

Comparing Multiple Regression Model with Regression Tree

Now, instead of looking at the genearl King County region, our next multiple regression model focuses on the areas surrounding the university.

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4188 -0.1658  0.0009  0.1645  1.3313 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -4.683e+01  6.812e-01  -68.74   <2e-16 ***
## view                 8.738e-02  2.499e-03   34.96   <2e-16 ***
## sqft_living          3.906e-04  1.018e-05   38.38   <2e-16 ***
## grade                2.371e-01  3.654e-03   64.87   <2e-16 ***
## yr_built            -3.220e-03  7.251e-05  -44.41   <2e-16 ***
## lat                  1.344e+00  1.343e-02  100.05   <2e-16 ***
## `sqft_living:grade` -2.048e-05  1.122e-06  -18.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2637 on 21606 degrees of freedom
## Multiple R-squared:  0.7493, Adjusted R-squared:  0.7493 
## F-statistic: 1.077e+04 on 6 and 21606 DF,  p-value: < 2.2e-16
## RMSE for test data:  222079.9

As we could see in the model summary, all the variable is significant to the model. A unit increase in view is associated with \(e^{0.08738}\) times increase in price. A unit increase in sqft_living is associated with \(e^{0.0004}\) times increase in price. A unit increase in grade is associated with \(e^{0.23}\) times increase in price. A unite increase in yr_built is associated with \(e^{-0.00322}\) times increase in price. A unit increase in lat is associated with \(e^{1.344}\) times increase in price. Finally, a unit of increase in the interaction part between sqft_living and grade is associated with \(e^{-0.00002}\) times increase in price. We also get the RMSE for test data is 222079.9.

So far, we’ve looked at multiple regression models. But multiple regression model is difficult to visualize. To leave the reader with something more digestible, we introduce the regression tree, which doesn’t require normalization of data, no scaling of data, and is very easy to explain and graph.

10-fold cross validation is performed in the background to help us identify the number of terminal nodes to most efficiently use the tree. y-axis is cross validation error, lower x-axis is cost complexity, top x-axis is number of terminal nodes. Notice the dotted line cross the plot when the number of terminal nodes equals to 4. Thus, we could use a tree with 4 terminal nodes to expect similar accuracy as a tree with more nodes.

Example: Buying a House Near the University

## linear regression prediction: 1161709
## regression tree prediction: 1350393

We took this example from redfin: https://www.redfin.com/WA/Seattle/4423-Latona-Ave-NE-98105/home/118624

In the end, we provide the reader two different models/tools to estimate the price of their desired home when knowing the latitude, grade, yr_built and sqft_living. We took a random house currently on sale near UW and used our subject knowledge to estimate the corresponding predictor values. The linear regression model gives a closer result than the regression tree, given the values we provided for the five variables.

Of course this one example doesn’t demonstrate the accuracy of our models, and the tree is especially problematic because its form can change significantly when there is a slight change in the training sampling. Therefore, for further interest, we may look at random forest, which reduces the variability in the estimations due to change in training data sampling.