Random Forests algorithm has always fascinated me. I like how this algorithm can be easily explained to anyone without much hassle. One quick example, I use very frequently to explain the working of random forests is the way a company has multiple rounds of interview to hire a candidate. Let me elaborate.
Say, you appeared for the position of Statistical analyst at WalmartLabs. Now like most of the companies, you don't just have one round of interview. You have multiple rounds of interviews. Each one of these interviews is chaired by independent panels. Each panel assesses the candidate separately and independently. Generally, even the questions asked in these interviews differ from each other. Randomness is important here.
The other thing of utmost importance is diversity. The reason we have a panel of interviews is that we assume a committee of people generally takes better decision than a single individual. Now this committee is not any collection of people. We make sure that the interview panel is a little diversified in terms of topics to be covered in each interview, the type of questions asked, and many other details. You don't go about asking same question in each round of interviews.
After having all the rounds of interviews, the final call whether to select or reject the candidate is based on the majority of the decision from each panel. If out of 5 panel of interviewers, 3 recommends a hire and two against a hire, we tend to go ahead with selecting the candidate. I hope you get the gist.
If you have heard about decision tree, then you are not very far from understanding what random forests are. There are two keywords here - random and forests. Let us first understand what forest means. Random forests is a collection of many decision trees. Instead of relying on single decision tree, you build many decision trees say 100 of them. And you know what a collection of trees is called - a forest. So you now understand why is it called forest.
Why is it called random then?
Say our dataset has 1,000 rows and 30 columns.
There are two levels of randomness in this algorithm:
At row level: Each of these decision trees gets a random sample of the training data (say 10%) i.e. each of these trees will be trained independently on 100 randomly chosen rows out of 1,000 rows of data. Keep in mind that each of these decision trees is getting trained on 100 randomly chosen rows from the dataset i.e they are different from each other in terms of predictions.
At column level: Update: The second level of randomness comes at column level. Say, we want to use only 10% of the columns i.e out of a total of 30 columns (from our example data), only 3 columns will be randomly selected at each node level of the decision tree getting build. So, for the first node of the tree, maybe columns C1, C2, and, C4 will be chosen and based on some metric (Gini coefficients or other metrics to decide on the optimal node), one of these three columns will be chosen as the optimal node.
This process repeats again for the next node of the tree. Again, we will randomly choose 3 columns, say C2, C5, C6 and the best column will be chosen for this node as well.
NOTE: Many beginners and even experts mistakenly understand that the columns are randomly selected at tree level. However, the correct concept is that the columns are randomly selected at each node level of each tree. I had received an email from Prof. Adel Cutler about the same (a while back) and so I have updated this post accordingly. Prof. Adel Cutler is the co-author of Random forest and has worked with Prof. Breiman extensively.
Let me draw an analogy now.
Let us now understand how interview selection process resembles a random forest algorithm. Each panel in the interview process is actually a decision tree. Each panel gives a result whether the candidate is a pass or fail and then a majority of these results is declared as final. Say there were 5 panels, 3 said yes and 2 said no. The final verdict will be yes.
Something similar happens in random forest as well. The results from each of the tree is taken and final result is declared accordingly. Voting and averaging is used to predict in case of classification and regression respectively.
With the advent of huge computational power at our disposal, we hardly think for even a second before we apply random forests. And very conveniently our predictions are made. Let us try to understand other aspects of this algorithm.
When is a random forest a poor choice relative to other algorithms?
Random forests doesn't train well on smaller datasets as it fails to pick on the pattern. To simplify, say we know that 1 pen costs INR 1, 2 pens cost INR 2, 3 pens cost INR 6. In this case linear regression will easily estimate the cost of 4 pens but random forests will fail to come up with a good estimate.
There is a problem of interpretability with random forest. You can't see or understand the relationship between the response and the independent variables. Understand that random forest is a predictive tool and not a descriptive tool. You get variable importance but this may not suffice in many analysis of interests where the objective might be to see the relationship between response and the independent features.
The time taken to train random forests may sometimes be too huge as you train multiple decision trees. Also, in case of categorical variable, the time complexity increases exponentially. For a categorical column with n levels, RF tries split at 2^n -1 points to find the maximal splitting point. However, with the power of H2O we can now train random forests pretty fast. You may want to read about H2O at H2O in R explained.
In case of regression problem, the range of values response variable can take is determine by the values already available in the training dataset. Unlike linear regression, decision trees and hence random forest can't take values outside the training data.
What are the advantages of using random forest?
Since we are using multiple decision trees, the bias remains same as that of a single decision tree. However, the variance decreases and thus we decrease the chances of overfitting. I have explained bias and variance intuitively at The curse of bias and variance.
When all you care about is the predictions and want a quick and dirty way-out, random forest comes to the rescue. You don't have to worry much about the assumptions of the model or linearity in the dataset.
I will add in the R code snippets as well to get an idea of how this is executed soon.
Did you find the article useful? If you did, share your thoughts in the comments. Share this post with people who you think would enjoy reading this. Let's talk more of data-science.