*37*

Regression analysis is a technique we can use to understand the relationship between one or more predictor variables and a response variable.

One way to assess how well a regression model fits a dataset is to calculate the **root mean square error**, which is a metric that tells us the average distance between the predicted values from the model and the actual values in the dataset.

The lower the RMSE, the better a given model is able to “fit” a dataset.

The formula to find the root mean square error, often abbreviated **RMSE**, is as follows:

**RMSE = **√Σ(P_{i} – O_{i})^{2} / n

where:

- Σ is a fancy symbol that means “sum”
- P
_{i}is the predicted value for the i^{th}observation in the dataset - O
_{i}is the observed value for the i^{th}observation in the dataset - n is the sample size

The following example shows how to interpret RMSE for a given regression model.

**Example: How to Interpret RMSE for a Regression Model**

Suppose we would like to build a regression model that uses “hours studied” to predictor “exam score” of students on a particular college entrance exam.

We collect the following data for 15 students:

We then use statistical software (like Excel, SPSS, R, Python) etc. to find the following fitted regression model:

Exam Score = 75.95 + 3.08*(Hours Studied)

We can then use this equation to predict the exam score of each student, based on how many hours they studied:

We can then calculate the squared difference between each predicted exam score and the actual exam score. Then we can take the square root of the mean of these differences:

The RMSE for this regression model turns out to be **5.681**.

Recall that the residuals of a regression model are the differences between the observed data values and the predicted values from the model.

**Residual** = (P_{i} – O_{i})

where

- P
_{i}is the predicted value for the i^{th}observation in the dataset - O
_{i}is the observed value for the i^{th}observation in the dataset

And recall that the RMSE of a regression model is calculated as:

**RMSE = **√Σ(P_{i} – O_{i})^{2} / n

This means that **the RMSE represents the square root of the variance of the residuals.**

This is a useful value to know because it gives us an idea of the average distance between the observed data values and the predicted data values.

This is in contrast to the R-squared of the model, which tells us the proportion of the variance in the response variable that can be explained by the predictor variable(s) in the model.

**Comparing RMSE Values from Different Models**

The RMSE is particularly useful for comparing the fit of different regression models.

For example, suppose we want to build a regression model to predict the exam score of students and we want to find the best possible model among several potential models.

Suppose we fit three different regression models and find their corresponding RMSE values:

- RMSE of Model 1:
**14.5** - RMSE of Model 2:
**16.7** - RMSE of Model 3:
**9.8**

Model 3 has the lowest RMSE, which tells us that it’s able to fit the dataset the best out of the three potential models.

**Additional Resources**

RMSE Calculator

How to Calculate RMSE in Excel

How to Calculate RMSE in R

How to Calculate RMSE in Python