*26*

When the relationship between a set of predictor variables and aÂ response variableÂ is highly complex, we often use non-linear methods to model the relationship between them.

One such method is building a decision tree. However, the downside of using a single decision tree is that it tends to suffer from high variance. That is, if we split the dataset into two halves and apply the decision tree to both halves, the results could be quite different.

One method that we can use to reduce the variance of a single decision tree is to build a random forest model, which works as follows:

**1.** Take *b* bootstrapped samples from the original dataset.

**2.** Build a decision tree for each bootstrapped sample.

- When building the tree, each time a split is considered, only a random sample ofÂ
*m*predictors is considered as split candidates from the full set ofÂ*p*predictors. Typically we chooseÂ*m*to be equal to âˆš*p*.

**3.** Average the predictions of each tree to come up with a final model.

It turns out that random forests tend to produce much more accurate models compared to single decision trees and even bagged models.

This tutorial provides a step-by-step example of how to build a random forest model for a dataset in R.

**Step 1: Load the Necessary Packages**

First, weâ€™ll load the necessary packages for this example. For this bare bones example, we only need one package:

library(randomForest)

**Step 2: Fit the Random Forest Model**

For this example, weâ€™ll use a built-in R dataset calledÂ **airquality** which contains air quality measurements in New York on 153 individual days.

#view structure of airquality dataset str(airquality) 'data.frame': 153 obs. of 6 variables: $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ... $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ... $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ... $ Temp : int 67 72 74 62 56 66 65 59 61 69 ... $ Month : int 5 5 5 5 5 5 5 5 5 5 ... $ Day : int 1 2 3 4 5 6 7 8 9 10 ... #find number of rows with missing values sum(!complete.cases(airquality)) [1] 42

This dataset has 42 rows with missing values, so before we fit a random forest model weâ€™ll fill in the missing values in each column with the column medians:

#replace NAs with column medians for(i in 1:ncol(airquality)) { airquality[ , i][is.na(airquality[ , i])] median(airquality[ , i], na.rm=TRUE) }

**Related:**Â How to Impute Missing Values in R

The following code shows how to fit a random forest model in R using the **randomForest()** function from theÂ randomForest package.

#make this example reproducible set.seed(1) #fit the random forest model model #display fitted model model Call: randomForest(formula = Ozone ~ ., data = airquality) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 1 Mean of squared residuals: 327.0914 % Var explained: 61 #find number of trees that produce lowest test MSE which.min(model$mse) [1] 82 #find RMSE of best model sqrt(model$mse[which.min(model$mse)]) [1] 17.64392

From the output we can see that the model that produced the lowest test mean squared error (MSE) used **82** trees.

We can also see that the root mean squared error of that model was **17.64392**. We can think of this as the average difference between the predicted value for Ozone and the actual observed value.

We can also use the following code to produce a plot of the test MSE based on the number of trees used:

#plot the test MSE by number of trees plot(model)

And we can use the **varImpPlot()** function to create a plot that displays the importance of each predictor variable in the final model:

#produce variable importance plot varImpPlot(model)

The x-axis displays the average increase in node purity of the regression trees based on splitting on the various predictors displayed on the y-axis.

From the plot we can see that *Wind* is the most important predictor variable, followed closely by *Temp*.

**Step 3: Tune the Model**

By default, the **randomForest()** function uses 500 trees and (total predictors/3) randomly selected predictors as potential candidates at each split. We can adjust these parameters by using the **tuneRF()** function.

The following code shows how to find the optimal model by using the following specifications:

**ntreeTry:**The number of trees to build.**mtryStart:**The starting number of predictor variables to consider at each split.**stepFactor:**The factor to increase by until the out-of-bag estimated error stops improving by a certain amount.**improve:**The amount that the out-of-bag error needs to improve by to keep increasing the step factor.

model_tuned #define predictor variables y=airquality$Ozone, #define response variable ntreeTry=500, mtryStart=4, stepFactor=1.5, improve=0.01, trace=FALSE #don't show real-time progress )

This function produces the following plot, which displays the number of predictors used at each split when building the trees on the x-axis and the out-of-bag estimated error on the y-axis:

We can see that the lowest OOB error is achieved by usingÂ **2** randomly chosen predictors at each split when building the trees.

This actually matches the default parameter (total predictors/3 = 6/3 = 2) used by the initial **randomForest()** function.

**Step 4: Use the Final Model to Make Predictions**

Lastly, we can use the fitted random forest model to make predictions on new observations.

#define new observation new #use fitted bagged model to predict Ozone value of new observation predict(model, newdata=new) 27.19442

Based on the values of the predictor variables, the fitted random forest model predicts that the Ozone value will be **27.19442 **on this particular day.

The complete R code used in this example can be found here.