Posted July 15th, 2007 at 6:12 PM in the Essays category; there are no comments yet

1.0 EXECUTIVE SUMMARY

  • This report shows the statistical findings of a linear regression on the most likely variables that affect the number of wins of an NFL team. The initial independent variables are: quarterback salary, average ticket price, rush yards per game, first downs per game, and a dummy variable for location. The pooled (panel) data has a sample of all 31 NFL teams over a three-year period, including 1999, 2000, and 2001. The dependent variable is number of wins.
  • The initial model included financial measures such as average ticket price and average quarterback salaries. Statistically, economic factors seem to have little effect on the number of wins. In addition, the initial model proposed in this study did not factor in relevant offensive and defensive data, thus providing very poor regression results with an adjusted R-squared of 0.305.
  • The final model dropped the insignificant financial measures and incorporated additional variables for defensive and offensive team statistics. The model has an adjusted R-squared of 0.798, meaning that the model explains roughly 80% of the variation in the data. This percentage is satisfactory because football is a game that needs to be played out on the field. No model can predict exactly how many games a team will win given a limited number of factors. The p-value for the F-statistic is 0.0000, which means that the overall model is significant. The final set of independent variables including first downs per game, quarterback rating, opposition points per game, interceptions per game, and rush yards per game have t-statistic p-values meeting the 10% level of significance. The final model explains the statistically significant variables that contribute to a team’s overall number of wins.

2.0 INTRODUCTION

In professional sports there is a mystique surrounding the variables that influence the performance of a winning team. Sports fans would like to reasonably ascertain (with some statistical backing) that their team is going to win. With so much available data, a solid model to predict the number of wins would be informative. The objective of this project is to answer, with reasonable statistical significance, what variables affect the number of games an NFL team will win in any given season.

3.0 DATA

The time period of data is for the NFL’s 1999 – 2001 football seasons. The sample data is not time-series but instead, the selected data is pooled data (or panel data). There are 93 observations that include 31 teams per year. Data for 2002 was not used because of a newly-added team and because the season is not yet complete. The dependent variable is the number of wins for an NFL football team. The initial independent variables for each NFL team are:

  • quarterback salary
  • average ticket price
  • rush yards per game
  • first downs per game
  • north, south, and west (dummy variables for geographic location)

The main data sources for these variables are nflarchives.com, teammarketing.com, and usatoday.com. NFL Archives provides team-relevant statistics. USA Today provides the salaries for quarterbacks. Team Marketing is a research firm that evaluates the average ticket prices for all NFL stadiums. The location of the teams was determined by examining a map of the U.S. and separating teams into four (roughly) equal-sized areas. A map of the partitioned geographic areas is in Appendix H.

4.0 REGRESSION ESTIMATIONS

Our initial model was estimated using the least squares method of linear regression. The results for the initial and final models are shown in Table 1 and Table 2. There is one intermediate model. All data was generated with the statistical package, EViews 3.1 Student Version, unless stated otherwise. All regression models are located in Appendix A.

The first model generated has a very low adjusted R-squared value of 0.305. The F-statistic has a highly significant p-value-effectively zero. Each individual p-value for the t-statistic is far above the 10% level of significance, except for rush yards per game and first downs per game with 0.0011 and 0.0018, respectively. The estimated coefficients are all positive except for WEST, SALARYQB, and the constant term. The constant, which represents the base number of team wins, should not start with a negative number.

The final model has a much better adjusted R-squared value of 0.7986. The F-statistic has, again, a zero level of significance. Each individual p-value for the t-statistic has greatly improved. The estimated coefficients have logical signs. Naturally, the fewer opponents points per game, the higher number of wins for the dependent variable team. Thus, OPPPPG has a negative coefficient. The variable, interceptions per game, represents defensive turnovers created by interceptions. The intercept is positive and is a reasonable, minimum number of wins for an NFL team.

4.1 Refining the Model

4.1.1 Functional Form and Omitted Variables

The initial regression yielded disappointing results. To boost the adjusted R-squared value, additional variables were added to form Model B. The dependent variable was plotted against each independent variable to look for non-linear relationships. These plots are located in Appendix B. Because of the low adjusted R-squared value, the functional form was modified and no further tests were run on the original model. The next evaluation of the model led to a more balanced view with offensive and defensive variables. These variables were added to Model A to form Model B: quarterback rating, opposition points per game, total defensive yards per game, interceptions per game, and passing yards per game.

4.1.2 Multicollinearity and Variance Inflation Factor (VIF)

Due to the omitted variables in Model A, Model B is the first model to have the VIF calculations. The results indicate that pass yards per game should be dropped due to a high VIF value of 10.9762. The correlation matrix shows that total defense per game has a high rho value (almost 0.8). The t-statistic p-value of total defense per game is also high, indicating that it is a candidate to be dropped. Correlation matrices are located in Appendix C; VIF values are in Appendix D.

4.1.3 Heteroskedasticity and Serial Correlation

The sample data is neither time series nor cross sectional-it is panel data. The data for this research is only for a three-year time period. Thus, serial correlation was not tested on the data. However, the White Test for heteroskedasticity using cross products was used to check for inconsistent variances. The tests indicate that neither Model B nor Model C suffer from heteroskedasticity. The high F-statistic p-values (Model B: 0.636; Model C: 0.608) are a good indication that no corrections for heteroskedasticity are necessary. The White Tests for heteroskedasticity are in Appendix E.

4.1.4 Wald Test for Joint Significance

The t-test and Wald test for joint significance reveals that first downs per game and rush yards per game, in Model B, are jointly significant. Therefore, reject the null hypothesis that these two variables have a zero coefficient and keep the jointly significant variables. The other Wald Tests for Model B did not reveal any level of joint significance that could adversely affect the final model.

4.1.5 Dropped Variables

Based upon the previous statistical tests and testing the t-statistics, the list of independent variables for Model C is selected. The high t-statistic p-value for average ticket price is a good indication to drop the variable; likewise for the location dummy variable (north, south, and west). Quarterback salary has a high t-statistic p-value; however, some investigation revealed that NFL teams have salary caps. Knowing this, the variable was dropped. The high VIF and high t-statistic p-value for pass yards per game was a good indication to drop that variable.

5.0 FINAL MODEL EVALUATION

Model C, the final model, has been tested using the same criteria as Model B, the intermediate model. The constant coefficient indicates that each NFL team will achieve a minimum number of wins, usually 2.8. Every first down per game contributes 0.179 wins to the team, while interceptions per game contribute 0.77 wins to the team. Rush yards per game adds to the number of wins, much like the quarterback’s rating. As the opposition’s points per game increases, the likelihood of an NFL team winning a game decreases, thus the negative coefficient. Contrary to the original assumption, financial measures do not influence the wins of an NFL team. Location does not significantly affect the outcome of wins since a team does not play all 16 games in one geographic area. The final model indicates that a balance of offensive and defensive statistics is necessary to explain the number games that a team will win.

6.0 SUMMARY

The final model dropped the insignificant financial measures and incorporated additional variables for defensive and offensive team statistics. It has an adjusted R-squared of 0.798, meaning that the model explains roughly 80% of the variation in the data. This percentage is satisfactory because football is a game that needs to be played out on the field. No model can predict exactly how many games a team will win given a limited number of factors. The p-value for the F-statistic is 0.0000, which means that the overall model is significant. The final set of independent variables including first downs per game, quarterback rating, opposition points per game, interceptions per game, and rush yards per game have t-statistic p-values meeting the 10% level of significance. The final model developed explains the statistically significant variables that contribute to a team’s overall number of wins.

APPENDIX A: REGRESSION OUTPUT

Table 3: Least Squares Regression for Model A
Table 4: Least Squares Regression for Model B
Table 5: Least Squares Regression for Model C

APPENDIX B: CHECK FOR NON-LINEAR RELATIONSHIPS

Omitted from the online version of this essay.

APPENDIX C: CORRELATION MATRICES

Table 6: Correlation Matrix for Model B
Table 7: Correlation Matrix for Model C

APPENDIX D: VARIANCE INFLATION FACTOR

Table 8: VIF Values for Model B
Table 9: VIF Values for Model C

APPENDIX E: HETEROSKEDASTICITY

Table 10: White Test for Heteroskedasticity for Model B
Table 11: White Test for Heteroskedasticity for Model C

APPENDIX F: WALD TESTS

The following tables have results from the Wald Test for joint significance. In Model B, first downs per game and rush yards per game are jointly significant. They are included in the final model, Model C.

Table 12: Wald Test for Model B
Table 13: Wald Test for Model C

APPENDIX G: HISTOGRAM

Graph 1 is for the dependent variable, number of wins. The normal distribution (or Gaussian distribution) indicates this is an adequate sample of data, both in size and its applicability to the central limit theorem.

APPENDIX H: DUMMY VARIABLE

Figure 1 indicates how the NFL teams were partitioned to form discrete geographic areas. The final model, Model C, does not use a dummy variable for location. Both models A and B, however, incorporated this geographic variable. The team locations were separated into four areas, three of which were used for the linear regression. The west area has seven teams and the mid, south, and north each have eight teams.

  • Twitter
  • Delicious
  • Digg
  • Facebook
  • Google Reader
  • LinkedIn
  • Netvibes Share
  • Yahoo Buzz
  • Hotmail
  • Yahoo Mail
  • Google Gmail
  • Share/Bookmark

Write a Comment

Validation Image