Random Forest Classifier is a supervised classification algorithm. It is very popular and performs better than most classification algorithms. It is a type of ensemble learning method called as bagging or bootstrap aggregation. It produces better results even without feature engineering. The reason behind this classifier being popular is due to its simplicity and efficiency.
Prerequisite: Knowledge about Decision Trees
Bagging or Bootstrap Aggregation
It is an ensemble learning model which combines the predictions from different learning models to predict the result having much greater efficiency than the individual models.
- Bootstrap step:
Let us consider a dataset having 'm' features. This step randomly chooses 'n' features with(out) repetition and provides a subset of the training data to a decision tree. In a similar fashion, we create a large number of decision trees and provide different subsets of training data with different features. Every Decision Tree then predicts a result.
- Aggregation step:
There are large number of results from the different Decision Trees. The results are basically different labels in a classification problem. Now, the task is to combine all these predictions to make an efficient prediction. This is done by a simple vote system. The label which has been predicted more than the other labels is the final predicted result.
The Random Forest Classifier first divides the dataset and provides this to different decision trees to make a learning model. Once the different models are created, testing data can be provided to predict the result by choosing the label which are predicted the most. Bagging is collective term for both the bootstrap and the aggregation step.
Image: Bagging Implementation of Random Forest Classifier in Python:
The dataset we will be using for the implementation is the titanic dataset where we are given the details of the people who are aboard the ship. We have to predict if a person survived or not using the data. The data can be downloaded here.
I did mention that this algorithm works fine even without feature engineering, prior knowledge about the data. So, the following implementation will not have any extra step other than dropping of a few features.
The python code can be downloaded here.
Import necessary libraries:
Read the data and remove missing features. The Sex and Embarked Features has labels {Male, Female} and {Q, S, C} respectively. We need to convert them into numbers for processing and this is done with the help of label Encoders.
It is illogical to predict the survival of a person based on the name and passenger ID, so we drop that feature. The features - Ticket, Cabin and Age has missing values. Since we are not interested in these preprocessing steps, we just drop those. (It works just fine even if you use those, try it!)
Divide the loaded data into feature variables - target variable and training-testing data. Fit the training data to the Random Forest Classifier and predict for the testing data.
The model create has an efficiency of 82% which by itself is quite remarkable considering that we have not done much pre-processing. It only shows the accurate predictive result of the random forest classifier.
For better results, it always a must to pre-process data and feature engineer the variables.
Applications of Random Forest Classifier
It can be used to machine learning problems for all sorts of classification like Image classification, video classification, text and voice classification. It can be used in diverse fields like Banking, Medicine, Stock Market, E-commerce, et cetera.
Advantages of Random Forest Classifier
- It is one of the most accurate learning algorithm for classification.
- It runs efficiently on large databases.
- Even with a few missing features, it can perform well by estimating the missing data.
- Reduces the problem of over-fitting.