There are two ways one can write the code to train a random forest model in R.
Both the ways are listed below.
A normal and frequent way of writing the command to train the random forest model is something like this.
rfModel <- randomForest(Survived~. , data = trainSample[, -c(6, 8, 9)])
Notice the ~ sign. We call this the formula way of writing.
Another way of writing the command to train the random forest model is shown below.
rfModel <- randomForest(y=trainSample$Survived, x= trainSample[, -c(6, 8, 9)], data = trainSample)
Here we explicitly mention the y-variable and the x variables.
Recently, I was working on a huge dataset where the task was to predict a variable based on some 12 independent variables. The dataset had close to 1.3 million rows. I tried train the model using the first method i.e the formula way. Sadly, I had to kill the task as it was taking a lot of time.
It was then, I got to know that if you train the model using the second format command, the code runs relatively faster. When investigated further, I got to know that the reason for the difference in time is that the code for random forest is written in C and when one writes in the formula format, the x and y variables are explicitly converted into proper format and this is what takes time. ref
Even the help page of randomForest in R says the same thing.
For large data sets, especially those with large number of variables, calling randomForest via the formula interface is not advised: There may be too much overhead in handling the formula.
Look just before the 'Authors' section.
The time difference is not significant for smaller datasets. However, for larger datasets I would suggest using the second format.
To illustrate the time difference, I have taken a dataset having 2,23,874 rows and 6 columns.
system.time( rfModel1 <- randomForest(DepDelay~. , data = subset(df, select = c(1,2,3,4,5, 13))) )
user system elapsed
94.582 11.356 109.311
system.time( rfModel2 <-randomForest(y=df$DepDelay, x=df[,c(1,2,3,4,5)], data=subset(df, select = c(1,2,3,4,5, 13))) )
user system elapsed
93.499 10.248 106.205
This shows a time difference of 3 seconds. Note that all of these columns were numeric and if one takes categorical columns as well, the time difference would be more.
Also, this dataset is not huge. One would appreciate this more on larger datasets.
Did you find the article useful? If you did, share your thoughts on the topic in the comments.