A modern approach to regression with r pdf download
The book is aimed at first-year graduate students in statistics. It could also be used for a senior undergraduate class. I am grateful to the students who took these courses.
Charles Lindsey wrote the Stata code that appears in the Stata primer that accompanies the book. Brad Barney and Andrew Redd contributed some of the R code used in the book.
Readers of this book will find that the work of Cook and Weisberg has had a profound influence on my thinking about regression. In particular, this book contains many references to the books by Cook and Weisberg b and Weisberg The content of the book has also been influenced by a number of people.
Robert Kohn and Geoff Eagleson, my colleagues for more than 10 years at the University of New South Wales, taught me a lot about regression but more importantly about the importance of thoroughness when it comes to scholarship.
My long-time collaborators on nonparametric statistics, Tom Hettmansperger and Joe McKean have helped me enormously both professionally and personally for more than 20 years. Lively discussions with Mike Speed about valid models and residual plots lead to dramatic changes to the examples and the discussion of this subject in Chapter 6.
A number of reviewers provided valuable comments and x Preface suggestions. Finally, I am grateful to Jennifer South who painstakingly proofread the whole manuscript.
The web site that accompanies the book contains R, SAS and Stata code and primers, along with all the data sets from the book can be found at www.
Also available at the book web site are online tutorials on matrices, R and SAS. Motivating Examples Inferences About the Slope and the Intercept Confidence Intervals for the Population Regression Line Prediction Intervals for the Actual Value of Y Analysis of Variance Dummy Variable Regression Derivations of Results Estimation and Inference in Multiple Linear Regression Analysis of Covariance Assessing the Predictive Ability of Regression Models Case Study We shall see that a key step in any regression analysis is assessing the validity of the given model.
When weaknesses in the model are identified the next step is to address each of these weaknesses. Plots will be an important tool for both building regression models and assessing their validity.
The following examples provide examples of four such data sets and thus provide an indication of what is to come. In other words, any conclusion is only as sound as the model on which it is based.
In order to examine the claim we consider data on the 19 NFL field goal kickers who made at least ten field goal attempts in each of the , , , seasons and at the completion of games on Sunday, November 12, in the season.
The data are available on the book web site, in the file FieldGoalsto Figure 1. It can be shown that the resulting correlation in Figure 1. In other words this approach is based on an invalid model. In order to take account of the potentially different abilities of the 19 kickers we used linear regression to analyze the data in Figure 1. In particular, a separate regression line can be fit for each of the 19 kickers. Details on how to perform these calculations will be provided in Chapter 5.
Thus, a valid way of summarizing the data in Figure 1. This slope is estimated to be —0. Imagine that the company that publishes a weekday newspaper in a mid-size American city has asked for your assistance in an investigation into the feasibility of introducing a Sunday edition of the paper.
Interest centers on developing a regression model that enables you to predict the Sunday circulation of a newspaper with a weekday circulation of , Actual circulation data from September 30, are available for 89 US newspapers that publish both weekday and Sunday editions. The first 15 rows of the data are given in Table 1. The data are available on the book web site, in the file circulation. The situation is further complicated by the fact that in some cities there is more than one newspaper.
The last column in Table 1. In this case it takes value 1 when the newspaper is a tabloid with a serious competitor in the same city and value 0 otherwise. Given in Figure 1. We see from Figure 1. Given below in Figure 1. Taking logs has made the variability much more constant.
We shall return to this example in Chapter 6. Table 1. It will be discussed in detail in Chapters 5 and 6. Imagine that you have been asked to join the team supporting a young New York City chef who plans to create a new Italian restaurant in Manhattan.
The creation and the initial operation of the restaurant will be the basis of a reality TV show for the US and international markets including Australia. You have been told that the restaurant is going to be located no further south than the Flatiron District and it will be either east or west of Fifth Avenue.
You have been asked to determine the pricing of the restaurants dinner menu such that it is competitively positioned with other high-end Italian restaurants in the target area. In particular, your role in the team is to analyze the pricing data that have been collected in order to produce a regression model to predict the price of dinner.
Actual data from surveys of customers of Italian restaurants in the target area are available. Whilst the situation described above is imaginary, the data are real ratings of New York City diners.
The data are given on the book web site in the file nyc. According to www. The survey was an immediate success and the Zagats have produced a guide to New York City restaurants each year since. In less than 30 years, Zagat Survey has expanded to cover restaurants in more than 85 cities worldwide and other activities including travel, nightlife, shopping, golf, theater, movies and music. Is this effect also the most statistically significant? We shall return to this example in Chapters 5 and 6.
Background information on each appears below: The most influential critic in the world today happens to be a critic of wine. Rarely, Parker has given wine a perfect score of — seventy-six times out of , wines tasted.
Prior to his career as an author, Coates spent twenty years as a professional wine merchant. Clive Coates is very serious and well respected, but in terms of commercial impact his influence is zero.
Brook Parker and Coates each contain numerical ratings and reviews of the wines of Bordeaux. In particular, we consider the prices for 72 wines from the vintage in Bordeaux. The prices are taken from Coates , Appendix One. Robert Parker uses a point rating system with wines given a whole number score between 50 and as follows: 1.
This variable is included as potential predictor in view of the comment by Hardy Rodenstock. First Growth is the highest classification given to a wine from Bordeaux. Thus, first-growth wines are expected to achieve higher prices than other wines.
Cult wines such as Le Pin have limited availability and as such demand way outstrips supply. As such cult wines are among the most expensive wines of Bordeaux. According to Parker , page : The smallest of the great red wine districts of Bordeaux, Pomerol produces some of the most expensive, exhilarating, and glamorous wines in the world.
Superstar status is awarded by Robert Parker to a few wines in certain vintages. For example, Robert Parker , page describes the La Mission Haut-Brion as follows: A superstar of the vintage, the La Mission Haut-Brion is as profound as such recent superstars as , and Using the regression model developed in part 1 , decide which of the predictor variables ParkerPoints and CoatesPoints has the largest estimated percentage effect on Price.
Using your regression model developed in part 1 , comment on the following claim from Eric Samazeuilh: Parker is the wine writer who matters. Using your regression model developed in part 1 , decide whether there is a statistically significant extra price premium paid for Bordeaux wines from the vintage with a Parker score of 95 and above.
Identify the wines in the data set which, given the values of the predictor variables, are: i Unusually highly priced ii Unusually lowly priced In Chapters 3 and 6, we shall see that a log transformation will enable us to estimate percentage effects.
As such, Figure 1. An important component of this understanding will come from the mathematical properties of regression procedures. In this chapter we consider problems involving modeling the relationship between two variables. These problems are commonly referred to as simple linear regression or straight-line regression. In later chapters we shall consider problems involving modeling the relationship between three or more variables.
In particular we next consider problems involving modeling the relationship between two variables as a straight line, that is, when Y is modeled as a linear function of X.
Example: A regression model for the timing of production runs We shall consider the following example taken from Foster, Stine and Waterman , pages — throughout this chapter. The original data are in the form of the time taken in minutes for a production run, Y, and the number of items produced, X, for 20 randomly selected orders as supervised by three managers. At this stage we shall only consider the data for one of the managers see Table 2. We wish to develop an equation to model the relationship between Y, the run time, and X, the run size.
A scatter plot of the data like that given in Figure 2. The X-variable is called the explanatory or predictor variable, while the Y-variable is called the response variable or the dependent variable. The X-variable often has a different status to the Y-variable in that: S. The regression of Y on X is linear if 2. Suppose that Y1, Y2, …, Yn are independent realizations of the random variable Y that are observed at the values x1, x2, …, xn of a random variable X.
The random error term is there since there will almost certainly be some variation in Y due strictly to random phenomenon that cannot be predicted or explained.
In other words, all unexplained variation is called random error. Thus, the random error term does not depend on x, nor does it contain any information about Y otherwise it would be a systematic error. In practice, we usually have a sample of data instead of the whole population. The slope b1 and intercept b0 are unknown, since these are the values for the whole population. Thus, we wish to use the given data to estimate the slope and the intercept.
Figure 2. Least squares line of best fit A very popular method of choosing b0 and b1 is called the method of least squares. These last two equations are called the normal equations. SXX 2. Error t Intercept Let us look at the results that we have obtained from the line of best fit in Figure 2. The intercept in Figure 2. The slope of the line in Figure 2. Thus, we say that each additional unit to be produced is predicted to add 0.
The intercept in the model has the following interpretation: for any production run, the average set up time is Estimating the variance of the random error term Consider the linear regression model with constant variance given by 2. These residuals can be used to estimate s 2.
Two points to note are: 1. The divisor in S 2 is n — 2 since we have estimated two parameters, namely b 0 and b1. The errors e1 , e2 , In addition, since the regression model is conditional on X we can assume that the values of the predictor variable, x1, x2, …, xn are known fixed constants.
Under the above assumptions, we shall show in Section 2. This is an important fact to note if the experimenter has control over the choice of the values of the X variable. Standardizing 2. In this case we are estimating two such parameters, namely, b0 and b1. First, recall from 2.
Important Notes: 1. A confidence interval is always reported for a parameter e. Regression Output from R Ninety-five percent confidence intervals for the population regression line i.
First, we introduce some terminology. We next look at the hypothetical situation in Figure 2. It is apparent from Figure 2. Under the assumption that e1 , e2 , However, all statistical packages report the corresponding p-value. It is arguably one of the most commonly misused statistics. We shall see in Chapter 5 that Analysis of Variance overcomes the problems associated with multiple t-tests which occur when there are many predictor variables.
We next consider so-called dummy variable regression, which is used in its simplest form when a predictor is categorical 2. The resulting regression models allow us to test for the difference between the means of two groups.
We shall see in a later topic that the concept of a dummy variable can be extended to include problems involving more than two groups. Using dummy variable regression to compare new and old methods We shall consider the following example throughout this section. It is taken from Foster, Stine and Waterman , pages — In this example, we consider a large food processing center that needs to be able to switch from one type of package to another quickly to react to changes in order patterns.
Consultants have developed a new method for changing the production line and used it to produce a sample of 48 change-over times in minutes. Also available is an independent sample of 72 change-over times in minutes for the existing method. The first three and the last three rows of the data from this file are reproduced below in Table 2. Plots of the data appear in Figure 2. We wish to develop an equation to model the relationship between Y, the change-over time and X, the dummy variable corresponding to New and hence test whether the mean change-over time is reduced using the new method.
In this case it is sufficient to look at the scatter plots in Figure 3. However, when we consider situations in which there is more than one predictor variable, we shall need some additional tools in order to check the appropriateness of the fitted model. These plots will enable us to assess visually whether an appropriate model has been fit to the data no matter how many predictor variables are used. Figure 3. There is no discernible pattern in the plot of the residuals from data set 1 against x1.
We shall see that a plot of residuals against X that produces a random pattern indicates an appropriate model has been fit to the data. Additionally, we shall see that a plot of residuals against X that produces a discernible pattern indicates an incorrect model has been fit to the data.
If no pattern is found then this indicates that the model provides an adequate summary of the data, i. If a pattern is found then the shape of the pattern provides information on the function of x that is missing from the model. If the residuals vary with x then this indicates that an incorrect model has been fit.
In Chapter 6 we will study the properties of least squares residuals more carefully. Then, the residuals from the straight-line fit of Y and X will have a quadratic pattern. Hence, we can conclude that there is need for a quadratic term to be added to the original straight-line regression model. As expected, a clear quadratic pattern is evident in the residuals in Figure 3.
When fitting a regression model we will discover that it is important to: 1. Determine whether the proposed regression model is a valid model i. Regression Diagnostics: Tools for Checking the Validity of a Model 51 use to validate regression assumptions are plots of standardized residuals. Determine which if any of the data points have x-values that have an unusually large effect on the estimated regression model such points are called leverage points.
Determine which if any of the data points are outliers, that is, points which do not follow the pattern set by the bulk of the data, when one takes into account the given model. If leverage points exist, determine whether each is a bad leverage point.
If a bad leverage point exists we shall assess its influence on the fitted model. Examine whether the assumption of constant variance of the errors is reasonable. If not, we shall look at how to overcome this problem. If the data are collected over time, examine whether the data are correlated over time. If the sample size is small or prediction intervals are of interest, examine whether the assumption that the errors are normally distributed is reasonable. We begin by looking at the second item of the above list, leverage points, as these will be needed in the explanation of standardized residuals.
The applet randomly generates 20 points from a known straight-line regression model. It produces a plot like that shown in Figure 3. One of the 20 points has an x-value which makes it distant from the other points on the x-axis. We shall see that this point, which is marked on the plot, is a good leverage point. Next we use the applet to drag one of the points away from the true population regression line. In particular, we focus on the point with the largest x-value.
Dragging this point vertically down so that its x-value stays the same produces the results shown in Figure 3. Notice how in the least squares regression has changed 1 Standardized residuals will be defined later in this section. The least squares regression line has been levered down by single point. Hence we call this point a leverage point. It is a bad leverage point since its Y-value does not follow the pattern set by the other 19 points.
In summary, a leverage point is a point whose x-value is distant from the other x-values. A point is a bad leverage point if its Y-value does not follow the pattern set by the other data points.
In other words, a bad leverage point is a leverage point which is also an outlier. Returning to Figure 3. In other words, a good leverage point is a leverage point which is NOT also an outlier. We use the applet to drag one of these points away from the true population regression line.
In particular, we focus on the point with the 11th largest x-value. Dragging this point vertically up so that its x-value stays the same produces the results shown in Figure 3. Notice how in the least squares regression has changed relatively little in response to changing the Y-value of centrally located x. This point is said to be an outlier that is not a leverage point.
The data given in Table 3. Notice that the values of x in Table 3. The data in Table 3. Regression output from R for the straight-line fits to the two data sets is given below. Error 0. Next, recall that the only difference between the data in the two plots in Figure 3. Comparing the plots in Figure 3. This change in Y has produced dramatic changes in the equation of the least squares line. For example looking at the regression output from R above, we see that the slope of the regression for YGood is —0.
In addition, this change in a single Y value has had a dramatic effect on the value of R2 0. Our aim is to arrive at a numerical rule that will identify xi as a leverage point i. Recall from 2. Consider, for a moment, this formula for leverage hii. Note that the leverage values are the same for both data sets i. Recall that a point is a bad leverage point if its Y-value does not follow the pattern set by the other data points.
Remove invalid data points Question the validity of the data points corresponding to bad leverage points, that is: Are these data points unusual or different in some way from the rest of the data?
If so, consider removing these points and refitting the model without them. For example, later in this chapter we will model the price of Treasury bonds. We will discover three leverage points. Thus, a reasonable strategy is to remove these cases from the data and refit the model without them. Fit a different regression model Question the validity of the regression model that has been fitted, that is: Has an incorrect model been fitted to the data?
If so, consider trying a different model by including extra predictor variables e. See Figure 3. Table 3. Error In practice, there is a large gray area between leverage points which do not follow the pattern suggested by the rest of the data i. However, as we shall next show, there is a complication that we need to consider, namely, that residuals do not have the same variance.
The problem of the residuals having different variances can be overcome by standardizing each residual by dividing it by an estimate of its standard deviation. When points of high leverage do not exist, there is generally little difference in the patterns seen in plots of residuals when compared with those in plots of standardized residuals. The other advantage of standardized residuals is that they immediately tell us how many estimated standard deviations any point is away from the fitted regression model.
For example, suppose that the 6th point has a standardized residual of 4. If the errors are normally distributed, then observing a point 4. Such a point would commonly be referred to as an outlier and as such it should be investigated. We shall follow the common practice of labelling points as outliers in small- to moderate-size data sets if the standardized residual for the point falls outside the interval from —2 to 2. In very large data sets, we shall change this rule to —4 to 4.
Otherwise, many points will be flagged as potential outliers. Identification and examination of any outliers is a key part of regression analysis. In summary, an outlier is a point whose standardized residual falls outside the interval from —2 to 2. Recall that a bad leverage point is a leverage point which is also an outlier. Thus, a bad leverage point is a leverage point whose standardized residual falls outside the interval from —2 to 2.
On the other hand, a good leverage point is a leverage point whose standardized residual falls inside the interval from —2 to 2. There is a small amount of correlation present in standardized residuals, even if the errors are independent. Derivation of the variance of the ith residual and fitted value Recall from 3. We shall look at effect of removing these outliers and refitting the model, producing dramatically different point estimates and confidence intervals. The example is from Siegel , pp.
According to Siegel: 62 3 Diagnostics and Transformations for Simple Linear Regression US Treasury bonds are among the least risky investments, in terms of the likelihood of your receiving the promised payments.
In addition to the primary market auctions by the Treasury, there is an active secondary market in which all outstanding issues can be traded. You would expect to see an increasing relationship between the coupon of the bond, which indicates the size of its periodic payment twice a year , and the current selling price.
Half of the coupon rate is paid every six months. The data are given in Table 3. They can be found on the book web site in the file bonds. We wish to model the relationship Table 3. Regression output from R is given below. For the bonds data, cases 4, 5, 13 and 35 have leverage values n greater than 0. Cases 4, 13 and 35 correspond to the three left-most points in Figure 3. Recall that we classify points as outliers if their standardized residuals have absolute value greater than 2.
Cases 13, 34 and 35 have standardized residuals with absolute value greater than 2, while case 4 has a standardized residual equal to 1. We next decide whether any of the leverage points are outliers, that is, whether any so-called bad leverage points exist.
Cases 13 and 35 and to a lesser extent case 4 are points of high leverage that are also outliers, i. There is a clear non-random pattern evident in this plot. The three points marked in the top left hand corner of Figure 3.
These three points are not well-fitted by the model, and should be investigated to see if there was any reason why they do not follow the overall pattern set by the rest of the data. It is evident from Figure 3. Marked on Figure 3. An analyst for the auto industry has asked for your help in modeling data on the prices of new cars. Interest centers on modeling suggested retail price as a function of the cost to the dealer for new cars. The data set, which is available on the book website in the file cars Provide a detailed critique of this conclusion.
For each shortcoming, describe the steps needed to overcome the shortcoming. If so, please describe all the ways in which it is an improvement. Regression output from R for model 3. Example: Newspaper circulation Recall from Chapter 1 that the company that publishes a weekday newspaper in a mid size American city has asked for your assistance in an investigation into the feasibility of introducing a Sunday edition of the paper. Interest focuses on developing a regression model that enables you to predict the Sunday circulation of a newspaper with a weekday circulation of , Circulation data from September 30, are available for 89 US newspapers that publish both weekday and Sunday editions.
The data are available on the book website, in the file circulation. As such the data contains a dummy variable, which takes value 1 when the newspaper is a tabloid with a serious competitor in the same city and value 0 othervise. Figure 6. On the basis of Figure 6. Each of the plots in Figure 6. Thus, model 6. The straight-line fit to this plot provides a reasonable fit. This provides further evidence that model 6. Tabloid dummy variable 0 1 log Sunday Circulation These plots further confirm that model 6.
The dashed vertical line in the bottom right-hand plot of Figure 6. The points with the largest leverage correspond to the cases where the dummy variable is 1. The output from R associated with fitting model 6. Because of the log transformation model 6. Competitor Coefficients: Estimate Std. The fact that both predictor variables are highly statistically significant is evident from the added variable plots.
Finally, we are now able to predict the Sunday circulation of a newspaper with a weekday circulation of , There are the following two cases to consider corresponding to whether the newspaper is a tabloid with a serious competitor or not. Can you think of a way of improving model 6.
Table 6. Competitor o Figure 6. One way to assess this is to compare the fit from 6. We shall use a popular estimator called loess, which is based on local linear or locally quadratic regression fits. Further details on nonparametric regression in general and loess in particular can be found in Appendix A. Under model 6. Thus, we shall decide that model 6. The data points can be found on the book web site in the file profsalary.
The two fits differ markedly indicating that model 6. The two fits are virtually indistinguishable. This implies that model 6. Salary 70 60 50 40 0 5 10 15 20 25 30 35 Years of Experience Figure 6. In what follows we shall describe the approach proposed and developed by Cook and Weisberg Marginal Model Plots Consider the situation when there are just two predictors x1 and x2.
However, it is easy and informative to demonstrate the result in this special case. Utilizing 6. If the two nonparametric estimates agree then we conclude that x1 is modelled correctly by model M1. If not then we conclude that x1 is not modelled correctly by model M1. Example: Modelling defective rates cont.
Recall from earlier in Chapter 6 that interest centres on developing a model for Y, Defective, based on the predictors x1, Temperature; x2, Density and x3, Rate. The data can be found on the book web site in the file defects. The right-hand plot in Figure 6. The two curves in Figure 6. Thus, we decide that x1 is not modelled correctly by model 6. In general, it is difficult to compare curves in different plots.
Thus, following Cook and Weisberg we shall from this point on include both nonparametric curves on the plot of Y against x1. It is once again clear that these two curves do not agree well. The two fits in each of the plots in Figure 6. In particular, each of the nonparametric estimates in Figure 6. Thus, we again conclude that 6. We found earlier that in this case, both the inverse response plot and the BoxCox transformation method point to using a square root transformation of Y. Thus, we next consider the following multiple linear regression model Y 0.
These plots again point to the conclusion that 6. We shall use the following example to illustrate these issues. Example: Bridge construction The following example is adapted from Tryfos , pp. According to Tryfos: Before construction begins, a bridge project goes through a number of stages of production, one of which is the design stage. This phase is composed of various activities, 6 Diagnostics and Transformations for Multiple Linear Regression each of which contributes directly to the overall design time.
In short, predicting the design time is helpful for budgeting and internal as well as external scheduling purposes.
Information from 45 bridge projects was compiled for use in this study. The data are partially listed in Table 6. The response variable and a number of the predictor variables are highly skewed. There is also evidence of nonconstant variance in the top row of plots.
Thus, we need to consider transformations of the response and the five predictor variables. The multivariate version of the Box-Cox transformation method can be used to transform all variables simultaneously. Given below is the output from R using the bctrans command from alr3. Output from R box. Power Std. Thus, Table 6. The pairwise relationships in Figure 6. There is no longer any evidence of nonconstant variance in the top row of plots. The straightline fit to this plot provides a reasonable fit.
Thus, there is a bad leverage point i. The nonparametric estimates of each pair-wise relationship are marked as solid curves, while the smooths of the fitted values are marked as dashed curves. There is some curvature present in the top three plots which is not present in the smooths of the fitted values. However, at this stage we shall continue under the assumption that 6. Another consequence of highly correlated predictor variables is that some of the coefficients in the regression model are of the opposite sign than expected.
The output from R below gives the correlations between the predictors in model 6. Notice how most of the correlations are greater than 0. Output from R: Correlations between the predictors in 6. Thus, correlation amongst the predictors increases the variance of the estimated regression coefficients.
The variance inflation factors for the bridge construction example are as follows: log DArea log CCost 7. We shall return to this example in Chapter 7.
In particular, we are interested in the effects of an American wine critic, Robert Parker and an English wine critic, Clive Coates on the London auction prices of Bordeaux wines from the vintage. The plots are in the form of scatter plots for real valued predictors and box plots for predictors in the form of dummy variables.
Each of the scatter plots in Figure 6. In addition, the box plots show that the variability of the standardized residuals is relatively constant across both values of each dummy predictor variable. Case 67, Le Pin is a bad leverage point. Notice that the nonparametric estimates of each pair-wise relationship are marked as solid curves, while the smooths of the fitted values are marked as dashed curves.
The two curves in each plot match very well thus providing further evidence that 6. Given below is the output from R associated with fitting model 6.
Notice that the overall F-test for model 6. Case 53 Pavie appears to be highly influential in the added variable plot for log CoatesPoints , and, as such, it should be investigated. Other outliers are evident from the added variable plots in Figure 6. We shall continue under the assumption that 6. Since 6. Notice how similar the estimated regression coefficients are in models 6. Note that there is no real need to redo the diagnostic plots for model 6.
Alternatively, we could consider a partial F-test to compare models 6. This is due to the fact that only one predictor has been removed from 6. Part b Based on model 6. This effect is also the most statistically significant, since the corresponding t-value is the largest in magnitude or alternatively, the corresponding p-value is the smallest. In particular, Clive Coates ratings have a statistically significant impact on price, even after adjusting for the influence of Robert Parker.
Part e Based on the regression model in a , there is no evidence of a statistically significant extra price premium paid for Bordeaux wines from the vintage that score 95 and above from Robert Parker since the coefficient of 95andAbove in the regression model is not statistically significant.
These are given in Table 6. The only such wine is given in Table 6. An observational study is one in which outcomes are observed and no attempt is made to control or influence the variables of interest. As such there may be systematic differences that are not included in the regression model, which we shall discover, raises the issue of omitted variables. According to Stigler , p.
S89 : … Pearson studied measurements of a large collection of skulls from the Paris Catacombs, with the goal of understanding the interrelationships among the measurements. For each skull, his assistant measured the length and the breadth, and computed … the correlation coefficient between these measures … The correlation … turned out to be significantly greater than zero … But … the discovery was deflated by his noticing that if the skulls were divided into male and female, the correlation disappeared.
Pearson recognized the general nature of this phenomenon and brought it to the attention of the world. When two measurements are correlated, this may be because they are both related to a third factor that has been omitted from the analysis. Neyman , pp. According to Kronmal , p. This fictitious data set was reported in Kronmal , p. Fitting the following straight-line regression model to these data produces the output shown below. Storks 3. It is apparent that there is a strong positive linear association between each of the three variables.
Notice that the estimated regression coefficient for the number of storks is zero to many decimal places. Thus, correlation between the number of babies and the number of storks calculated from 6. In other words, a predictor the number of women exists which is related to both the other predictor the number of storks and the outcome variable the number of babies , and which accounts for all of the observed association between the latter two variables.
The number of women predictor variable is commonly called either an omitted variable or a confounding covariate. Regression output from R Coefficients: Estimate Std.
We shall denote the omitted predictor variable by v and the predictor variable included in the one-predictor regression model by x. In the fictitious stork data x corresponds to the number of storks and v corresponds to the number of women. We next consider two distinct cases: 1. Then the omitted variable has an effect on the regression model, which includes just x as a predictor. For example, Y and x can be strongly linearly associated i.
This is exactly the situation in the fictitious stork data. We next look at two real examples, which exemplify the issues. The first example is based on a series of papers Cochrane et al. Rather it was the marked positive correlation between the prevalence of doctors and infant mortality. Whatever way we looked at our data we could not make that association disappear.
Moreover, we could identify no plausible mechanism that would give rise to this association. Kronmal , p. Show that Var Y 2. Chapter of the award-winning book on baseball Keri, makes extensive use of multiple regression. Ticket sales data for each team for each of the years from to are used to develop the mode1. Describe in detail two major concerns that potentially threaten the validity of the model. The analyst was so impressed with your answers to Exercise 5 in Section 3.
Give reasons to support your answer. Describe what, if anything can be learned about model 6. The multivariate version of the Box-Cox method was used to transform the predictors, while a log transformation was used for the response variable to improve interpretability. Perform a partial F-test to see if this is a sensible strategy. Describe how model 6. Output from R output from model 6.
Error t value Intercept 6. Notice that both predictor variables are judged to be statistically significant in the two-variable model, while just one variable is judged to be statistically significant in the three-variable model.
Later in this chapter we shall see that the p-values obtained after variable selection are much smaller than their true values. In view of this, it seems that the three-variable model over-fits the data and as such the two-variable model is to be preferred. Arguably, the two most popular variations on this approach are backward elimination and forward selection.
Backward elimination starts with all potential predictor variables in the regression model. Then, at each step, it deletes the predictor variable such that the resulting model has the lowest value of an information criterion.
This amounts to deleting the predictor with the largest p-value each time. This process is continued until all variables have been deleted from the model or the information criterion increases.
Forward selection starts with no potential predictor variables in the regression equation. Then, at each step, it adds the predictor such that the resulting model has the lowest value of an information criterion.
This amounts to adding the predictor with the smallest p-value each time. This process is continued until all variables have been added to the model or the information criterion increases. Thus, backward elimination and forward selection do not necessarily find the model that minimizes the information criteria across all 2m possible predictor subsets.
In addition, there is no guarantee that backward elimination and forward selection will produce the same final model. However, in practice they produce the same model in many different situations. Example: Bridge construction cont. Given below is the output from R associated with backward elimination based on AIC. It can be shown that backward elimination based on BIC chooses the model with the two predictors log Dwgs and log Spans.
We are again faced with a choice between the two-predictor and three-predictor models discussed earlier. The regression coefficients obtained after variable selection are biased. In addition, the p-values obtained after variable selection from F- and t-statistics are generally much smaller than their true values. These issues are well summarized in the following quote from Leeb and Potscher , page 22 : The aim of this paper is to point to some intricate aspects of data-driven model selection that do not seem to have been widely appreciated in the literature or that seem to be viewed too optimistically.
In particular, we demonstrate innate difficulties of data-driven model selection. Despite occasional claims to the contrary, no model selection procedure—implemented on a machine or not—is immune to these difficulties. The main points we want to make and that will be elaborated upon subsequently can be summarized as follows: 7.
Regardless of sample size, the model selection step typically has a dramatic effect on the sampling properties of the estimators that can not be ignored. In particular, the sampling properties of post-model-selection estimators are typically significantly different from the nominal distributions that arise if a fixed model is supposed.
As a consequence, naive use of inference procedures that do not take into account the model selection step e. In practice, this is often achieved by randomly splitting the data into: 1. A training data set 2. A test data set The training data set is used to develop a number of regression models, while the test data set is used to evaluate the performance of these models. We illustrate these steps using the following example. According to Hastie, Tibshirani and Friedman: The goal is to predict the log-cancer volume lacavol from a number of measurements including log prostate weight lweight , age, log of benign prostatic hyperplasia lpbh , seminal vesicle invasion svi , log of capsular penetration lcp , Gleason score gleason , and percent of Gleason scores 4 or 5 pgg Hastie, Tibshirani and Friedman , p.
We first consider the training set. Figure 7. Looking at Figure 7. There is also no evidence of nonlinearity amongst the eight predictor variables. Each of the plots in Figure 7.
Thus, model 7. This provides further evidence that model 7. Apart from a hint of decreasing error variance, these plots further confirm that model 7. The dashed vertical line in the bottom right-hand plot of Figure 7. Thus, there are no bad leverage points. The two curves in each plot match quite well thus providing further evidence that 7. Below is the output from R associated with fitting model 7. Error 1. Finally, we show in Figure 7. Case 45 appears to be highly influential in the added-variable plot for lweight, and, as such, it should be investigated.
We shall return to this issue later. For now we shall continue under the assumption that 7. The variance inflation factors for the training data set are as follows: lcavol lweight age lbph svi lcp gleason pgg45 2. We next consider variable selection in this example by identifying the subset of the predictors of a given size that maximizes adjusted R-squared i.
Table 7. Given below is the output from R associated with fitting the best models with two-, four- and seven-predictor variables to the training data. However, the p-values obtained after variable selection are much smaller than their true values.
In view of this, it seems that the four- and sevenvariable models over-fit the data and as such the two-variable model seems to be preferred. Given below is the output from R associated with fitting the best models with two-, four and seven-predictor variables to the 30 cases in the test data. Thus, based on the test data none of these models is very convincing. We discuss each of these issues in turn. Notice how the optimal two-, three- and fivevariable models change with the omission of just case Thus, case 45 has a dramatic effect on variable selection.
It goes without saying that case 45 in the training set should be thoroughly investigated. For details on the algorithm see Montgomery, Peck and Vining , pp. It shows a scatter plot of lpsa against lweight with different symbols used for the training and test data sets. The least squares regression line for each data set is also marked on Figure 7. While case 45 in the training data set does not stand out in Figure 7. In summary, case 45 in the training data and case 9 in the test data need to be thoroughly investigated before any further statistical analyses are performed.
This example once again illustrates the importance of carefully examining any regression fit in order to determine outliers and influential points. If cases 9 and 45 are found to be valid data points and not associated with special cases, then a possible way forward is to use variable selection techniques based on robust regression — see Maronna, Martin and Yohai , Chapter 5 for further details.
Using a Lagrange multiplier argument, it can be shown that 7. When the value of s in 7. Alternatively, for small values of s or equivalently large values of l some of the resulting estimated regression coefficients are exactly zero, effectively omitting predictor variables from the fitted model.
LARS, least angle regression Efron et al. In fact, Zhou, Hastie and Tibshirani show that it is possible to find the optimal lasso fit with the computational effort equivalent to obtaining a single least squares fit. Finally, Figure 7. Do the errors have constant variance? YES Are the outliers and leverage points valid? Use the bootstrap for inference Consider modifications to the model NO Use a partial F-test to obtain the final model Figure 7. The generated data set in this question is taken from Mantel The data are given in Table 7.
A Modern Approach to Regression with R. Authors view affiliations Simon Sheather. Compares a number of new real data sets that enable students to learn how regression can be used in real life Provides R code used in each example in the text along with the SAS-code and STATA-code to produce the equivalent output Complete details provided for each example.
Front Matter Pages i-xiv. Pages Simple Linear Regression. Diagnostics and Transformations for Simple Linear Regression. Weighted Least Squares. Multiple Linear Regression. Diagnostics and Transformations for Multiple Linear Regression.
Variable Selection. Logistic Regression. Serially Correlated Errors. Mixed Models. Back Matter Pages About this book Introduction A Modern Approach to Regression with R focuses on tools and techniques for building regression models using real-world data and assessing their validity.
SAS correlated errors linear regression mixed models regression analysis regression diagnostics regression modeling strategies sets. Authors and affiliations Simon Sheather 1 1.