R语言统计模型代写：建模预测Cincinnati和Oxford地区的房价。

# Abstract

**Prices of Houses in Oxford-Cincinnati Area**

# Abstract

In order to determine the price of housing in both the Cincinnati and Oxford areas, we used Realtor.com to obtain our data for research. Using the multiple regression model, we developed a model with the ability to predict the price of a house in either location based upon location, square footage, number of bedrooms, lot size, and the interaction between the location and lot size. We originally started with a full model that included all predictors and all possible interaction terms. Using the AIC criterion, we were able to find a less complicated model that still represented our data well. Based upon the model we developed, we determined living in Oxford was cheaper than living in Cincinnati.

# Introduction

For this project, we set out to research the price of houses in Cincinnati versus the price of houses in Oxford, based on square footage, acreage of the house, and the number of bedrooms, bathrooms, and floors .The objective was to develop a multiple regression model to predict the price of a house in the Oxford and in the Cincinnati area, based on its given characteristics. The data was gathered by researching online listings for available housing in Oxford and Cincinnati. By using current listings, we were able to make our data up to date and as relevant as possible for a consumer. With the data, we ran various tests to acquire the best fitting model and interpreted our final results.

The relevancy of this research applies to individuals searching for a house in the general area. With the results from the project, homebuyers can be equipped with financial knowledge when deciding to move into Oxford or Cincinnati. This information would also be very useful for potential college students using housing prices to decide between attending Miami University or The University of Cincinnati or possibly deciding between job offers located in both Cincinnati and Oxford.

# Methods

We started by running a pair-wise comparison between each predictor and the response variable in order to check if a linear relationship between price and the various variables exists. A full model with all predictors and all interaction terms was created. Then, we used backward elimination to arrive with the best model based on the minimum AIC values. The plot of R- squared adjusted was created (see Figure 5), and the model with the highest R-squared adjusted value should be chosen as the best fit since the R-square adjusted value is used to compare different models. Our final model is defined by the following equation:

Y=β0+β1x1+β2x2+β3x3+β4x4+β5x1x4+ε, ε~N (0, σ^2), where y is price, x1 is the dummy variable for the location, x2 is size, x3 is number of

bathrooms, and x4 represents lot size. The two-way interaction terms in this model represent:

location*lot size.

The location was a qualitative variable with two levels, having code 0 if in Oxford and code 1 if in Cincinnati. The p-values of all coefficient estimates are significant and the results are shown in table 1.

Several diagnostic plots were examined to determine if the model assumptions were met. If the linearity assumption is violated, then a quadratic model or a cubic model should be tested. If the normality assumption is violated, then the transformation should be done. Since there was a dummy variable with two levels, then the individual lines for different location would also be different. Setting Oxford as 0 and Cincinnati as 1 would affect the intercept and the slope of two lines. In order to take a look at the two lines, we needed to do an enhanced scatterplot (see Figure 4).

Checking for a multicollinearity issue in the multiple regression is very important. For the best fitting model, VIF values show that for each individual predictor and interaction term for the best fitting model are less than 10, indicating our best fitting model does not have multicollinearity issues (Table 2).

# Analysis

The pairwise plot did not provide a clear answer of whether we should include interaction terms or revise our simple linear model (see Figure 1 in the Appendix). The trend lines on the graph appeared somewhat flat.

By running multiple regressions, we found a significant linear relationship between price of a house and location, square footage, lot size, and the number of bedrooms, bathrooms, and floors due to the p-value from the F-test being very close to zero. To come to a final decision

about our model, we used backwards eliminations to further examine other models. We first started with the model including all interaction terms. We gradually removed any interaction term with a p-value greater than 0.05. After that, we ran backwards regression once again with interaction being only sets of two, rather than five, and found a model that has a lower AIC than the simple linear model; therefore we decided to use this equation to describe our model. Then, the R-squared adjusted values for different models were compared (see Figure 5). The R- squared adjusted value of the simple linear one was 0.7065. This R-squared was a relatively good value, and it demonstrated that 70.65% of the sample variability could be explained by fitting the simple linear model. Thus, we could assume our first simple linear model was useful. However, after we did the backward elimination, the final model we selected had an even higher R-squared adjusted value of 0.77. This means that the model selected could explain 77% of the variability of price. Therefore, the model with R-squared adjusted value of 0.77 should be our best fitting model.

Checking the multicollinearity issue is important. Based on the results from the VIF, we discovered that size, location, number of bedrooms,lot size, and location:lot size all had VIFs less than 10, indicating multicollinearity issues did not exist for the chosen model. Also, based on the Pearson’s correlation matrix, the result indicated that the pairwise correlation between size and number of bathrooms is slightly greater than 0.7. However, since the p-values for each individual predictors and interaction terms were significant, and the VIFs were all less than 10, we did not remove these two variables because we wanted our model to be more informative.

In addition, the diagnostic plots were checked (please refer to Figure 2 in the Appendix). It showed the normality assumption was violated. Some outliers from the Normal Q-Q plot were observed indicating the linearity assumption was violated. Thus, we tried a transformation and found the best value for λ, which was 0.5(see Figure 3). However, the normality assumption was still violated for fitting the transformed model (see Figure 3), and the R^2-adj, which was 0.71, actually decreased compared to the untransformed model. Thus, the untransformed model should be kept. Since the R-square adjusted value for the untransformed model was already high and the predictor variables were all significant, it was not worth the complexity to fit the quadratic model.

Since the untransformed model had several outliers, the sample data was reviewed and showed several houses had very high prices compare to other houses. The thought was these high prices would explain the violation of normality assumption. However, these data points were not deleted in order to include prices from all ranges. Only included houses with similar prices would lead the model to not be informative enough due to the variability seen in real world pricing.

In the enhanced scatter plot (please refer to Figure 4 in the Appendix) Oxford was coded as zero and Cincinnati was coded as one. It can be observed that the Oxford data follows a linear trend with most of its data concentrated below a price of $500000 and a size of 4. The Cincinnati line had a much steeper slope than Oxford, but did not follow a linear trend as well as the Oxford data. The data for Cincinnati was concentrated in the same area as Oxford, but had a couple more extreme values. Based on the plot for R2-adjusted for all models (see Figure 5), it seemed model 6 had the highest R2-adj. Model 6 was what we selected for the best fit.

Overall, the equation for the selected model would be:

price-hat=-62166-124887(location)+120463(size)+41964(bathrooms)-84751(lot size)+138724(location*lot size).

Then the individual line for Oxford would be:

price-hat=-62166+120463(size)+41964(bathrooms)-84751(lot size) Furthermore, the individual line for Cincinnati would be:

price-hat=-187053+120463(size)+41964(bathrooms)+53973(lot size).

# Conclusion

Overall, based on our data and statistical analysis, we came to the conclusion that purchasing a house in Oxford was cheaper than purchasing a house in Cincinnati, on average. For each location, we determined the price would increase from additional square footage if all other variables were held constant. The price on a house would increase with each additional bathroom if all the respective variables were held constant. Our data set included multiple outliers for both Cincinnati and Oxford on purpose. Rather than having a limited data set that only described a small portion of real estate, we wanted to have a broader model, allowing for a model that produced information for anyone looking for a house in both locations instead of people looking only in a specific price range.

For a person looking to buy a house in Oxford, looking at the individual models based on location, the best way to minimize price would be to decrease the lot size while limiting the amount of bathrooms and total square footage. For a person looking to buy a house in Cincinnati, the best way to minimize price would be to decrease the size, number of bathrooms, and the lot size. Being able to find a house with these specific recommendations is not entirely practical and the best option may be to focus the size of the house compared to the location.

The interaction between size and location was found to be the most significant term within our model and will have the most effect when being factored in when determining a price.

# Appendix

**References**

Realtor.com. Real estate information. Available at: http://www.realtor.com/realestateandhomes-detail/9093-Eldora- Dr_Cincinnati_OH_45236_M43142-82121

In order to determine the price of housing in both the Cincinnati and Oxford areas, we used Realtor.com to obtain our data for research. Using the multiple regression model, we developed a model with the ability to predict the price of a house in either location based upon location, square footage, number of bedrooms, lot size, and the interaction between the location and lot size. We originally started with a full model that included all predictors and all possible interaction terms. Using the AIC criterion, we were able to find a less complicated model that still represented our data well. Based upon the model we developed, we determined living in Oxford was cheaper than living in Cincinnati.

# Introduction

For this project, we set out to research the price of houses in Cincinnati versus the price of houses in Oxford, based on square footage, acreage of the house, and the number of bedrooms, bathrooms, and floors .The objective was to develop a multiple regression model to predict the price of a house in the Oxford and in the Cincinnati area, based on its given characteristics. The data was gathered by researching online listings for available housing in Oxford and Cincinnati. By using current listings, we were able to make our data up to date and as relevant as possible for a consumer. With the data, we ran various tests to acquire the best fitting model and interpreted our final results.

The relevancy of this research applies to individuals searching for a house in the general area. With the results from the project, homebuyers can be equipped with financial knowledge when deciding to move into Oxford or Cincinnati. This information would also be very useful for potential college students using housing prices to decide between attending Miami University or The University of Cincinnati or possibly deciding between job offers located in both Cincinnati and Oxford.

# Methods

We started by running a pair-wise comparison between each predictor and the response variable in order to check if a linear relationship between price and the various variables exists. A full model with all predictors and all interaction terms was created. Then, we used backward elimination to arrive with the best model based on the minimum AIC values. The plot of R- squared adjusted was created (see Figure 5), and the model with the highest R-squared adjusted value should be chosen as the best fit since the R-square adjusted value is used to compare different models. Our final model is defined by the following equation:

Y=β0+β1x1+β2x2+β3x3+β4x4+β5x1x4+ε, ε~N (0, σ^2), where y is price, x1 is the dummy variable for the location, x2 is size, x3 is number of

bathrooms, and x4 represents lot size. The two-way interaction terms in this model represent:

location*lot size.

The location was a qualitative variable with two levels, having code 0 if in Oxford and code 1 if in Cincinnati. The p-values of all coefficient estimates are significant and the results are shown in table 1.

Several diagnostic plots were examined to determine if the model assumptions were met. If the linearity assumption is violated, then a quadratic model or a cubic model should be tested. If the normality assumption is violated, then the transformation should be done. Since there was a dummy variable with two levels, then the individual lines for different location would also be different. Setting Oxford as 0 and Cincinnati as 1 would affect the intercept and the slope of two lines. In order to take a look at the two lines, we needed to do an enhanced scatterplot (see Figure 4).

Checking for a multicollinearity issue in the multiple regression is very important. For the best fitting model, VIF values show that for each individual predictor and interaction term for the best fitting model are less than 10, indicating our best fitting model does not have multicollinearity issues (Table 2).

# Analysis

The pairwise plot did not provide a clear answer of whether we should include interaction terms or revise our simple linear model (see Figure 1 in the Appendix). The trend lines on the graph appeared somewhat flat.

By running multiple regressions, we found a significant linear relationship between price of a house and location, square footage, lot size, and the number of bedrooms, bathrooms, and floors due to the p-value from the F-test being very close to zero. To come to a final decision

about our model, we used backwards eliminations to further examine other models. We first started with the model including all interaction terms. We gradually removed any interaction term with a p-value greater than 0.05. After that, we ran backwards regression once again with interaction being only sets of two, rather than five, and found a model that has a lower AIC than the simple linear model; therefore we decided to use this equation to describe our model. Then, the R-squared adjusted values for different models were compared (see Figure 5). The R- squared adjusted value of the simple linear one was 0.7065. This R-squared was a relatively good value, and it demonstrated that 70.65% of the sample variability could be explained by fitting the simple linear model. Thus, we could assume our first simple linear model was useful. However, after we did the backward elimination, the final model we selected had an even higher R-squared adjusted value of 0.77. This means that the model selected could explain 77% of the variability of price. Therefore, the model with R-squared adjusted value of 0.77 should be our best fitting model.

Checking the multicollinearity issue is important. Based on the results from the VIF, we discovered that size, location, number of bedrooms,lot size, and location:lot size all had VIFs less than 10, indicating multicollinearity issues did not exist for the chosen model. Also, based on the Pearson’s correlation matrix, the result indicated that the pairwise correlation between size and number of bathrooms is slightly greater than 0.7. However, since the p-values for each individual predictors and interaction terms were significant, and the VIFs were all less than 10, we did not remove these two variables because we wanted our model to be more informative.

In addition, the diagnostic plots were checked (please refer to Figure 2 in the Appendix). It showed the normality assumption was violated. Some outliers from the Normal Q-Q plot were observed indicating the linearity assumption was violated. Thus, we tried a transformation and found the best value for λ, which was 0.5(see Figure 3). However, the normality assumption was still violated for fitting the transformed model (see Figure 3), and the R^2-adj, which was 0.71, actually decreased compared to the untransformed model. Thus, the untransformed model should be kept. Since the R-square adjusted value for the untransformed model was already high and the predictor variables were all significant, it was not worth the complexity to fit the quadratic model.

Since the untransformed model had several outliers, the sample data was reviewed and showed several houses had very high prices compare to other houses. The thought was these high prices would explain the violation of normality assumption. However, these data points were not deleted in order to include prices from all ranges. Only included houses with similar prices would lead the model to not be informative enough due to the variability seen in real world pricing.

In the enhanced scatter plot (please refer to Figure 4 in the Appendix) Oxford was coded as zero and Cincinnati was coded as one. It can be observed that the Oxford data follows a linear trend with most of its data concentrated below a price of $500000 and a size of 4. The Cincinnati line had a much steeper slope than Oxford, but did not follow a linear trend as well as the Oxford data. The data for Cincinnati was concentrated in the same area as Oxford, but had a couple more extreme values. Based on the plot for R2-adjusted for all models (see Figure 5), it seemed model 6 had the highest R2-adj. Model 6 was what we selected for the best fit.

Overall, the equation for the selected model would be:

price-hat=-62166-124887(location)+120463(size)+41964(bathrooms)-84751(lot size)+138724(location*lot size).

Then the individual line for Oxford would be:

price-hat=-62166+120463(size)+41964(bathrooms)-84751(lot size) Furthermore, the individual line for Cincinnati would be:

price-hat=-187053+120463(size)+41964(bathrooms)+53973(lot size).

# Conclusion

Overall, based on our data and statistical analysis, we came to the conclusion that purchasing a house in Oxford was cheaper than purchasing a house in Cincinnati, on average. For each location, we determined the price would increase from additional square footage if all other variables were held constant. The price on a house would increase with each additional bathroom if all the respective variables were held constant. Our data set included multiple outliers for both Cincinnati and Oxford on purpose. Rather than having a limited data set that only described a small portion of real estate, we wanted to have a broader model, allowing for a model that produced information for anyone looking for a house in both locations instead of people looking only in a specific price range.

For a person looking to buy a house in Oxford, looking at the individual models based on location, the best way to minimize price would be to decrease the lot size while limiting the amount of bathrooms and total square footage. For a person looking to buy a house in Cincinnati, the best way to minimize price would be to decrease the size, number of bathrooms, and the lot size. Being able to find a house with these specific recommendations is not entirely practical and the best option may be to focus the size of the house compared to the location.

The interaction between size and location was found to be the most significant term within our model and will have the most effect when being factored in when determining a price.

# Appendix

**References**

Realtor.com. Real estate information. Available at: http://www.realtor.com/realestateandhomes-detail/9093-Eldora- Dr_Cincinnati_OH_45236_M43142-82121