*21*

A **lack of fit test** is used to determine whether or not a full regression model offers a significantly better fit to a dataset than some reduced version of the model.

For example, suppose we would like to use *number of hours studied* to predict *exam score* for students at a certain college. We may decide to fit the following two regression models:

**Full Model:** Score = Î²_{0} + B_{1}(hours) + B_{2}(hours)^{2}

**Reduced Model:** Score = Î²_{0} + B_{1}(hours)

The following step-by-step example shows how to perform a lack of fit test in R to determine if the full model offers a significantly better fit than the reduced model.

**Step 1: Create & Visualize a Dataset**

First, weâ€™ll use the following code to create a dataset that contains the number of hours studied and exam score received for 50 students:

#make this example reproducible set.seed(1) #create dataset df frame(hours = runif(50, 5, 15), score=50) df$score = df$score + df$hours^3/150 + df$hours*runif(50, 1, 2) #view first six rows of data head(df) hours score 1 7.655087 64.30191 2 8.721239 70.65430 3 10.728534 73.66114 4 14.082078 86.14630 5 7.016819 59.81595 6 13.983897 83.60510

Next, weâ€™ll create a scatterplot to visualize the relationship between hours and score:

#load ggplot2 visualization package library(ggplot2) #create scatterplot ggplot(df, aes(x=hours, y=score)) + geom_point()

**Step 2: Fit Two Different Models to the Dataset**

Next, weâ€™ll fit two different regression models to the dataset:

#fit full model full poly(hours,2), data=df) #fit reduced model reduced

**Step 3: Perform a Lack of Fit Test**

Next, weâ€™ll use the **anova()** command to perform a lack of fit test between the two models:

#lack of fit test anova(full, reduced) Analysis of Variance Table Model 1: score ~ poly(hours, 2) Model 2: score ~ hours Res.Df RSS Df Sum of Sq F Pr(>F) 1 47 368.48 2 48 451.22 -1 -82.744 10.554 0.002144 ** --- Signif. codes: 0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1

The F test-statistic turns out to be **10.554** and the corresponding p-value is **0.002144**. Since this p-value is less than .05, we can reject the null hypothesis of the test and conclude that the full model offers a statistically significantly better fit than the reduced model.

**Step 4: Visualize the Final Model**

Lastly, we can visualize the final model (the full model) relative to the original dataset:

**ggplot(df, aes(x=hours, y=score)) +
geom_point() +
stat_smooth(method='lm', formula = y ~ poly(x,2), size = 1) +
xlab('Hours Studied') +
ylab('Score')**

We can see that the curve of the model fits the data quite well.

**Additional Resources**

How to Perform Simple Linear Regression in R

How to Perform Multiple Linear Regression in R

How to Perform Polynomial Regression in R