Machine Learning: 2018

Random Forest Classifier is a supervised classification algorithm. It is very popular and performs better than most classification algorithms. It is a type of ensemble learning method called as bagging or bootstrap aggregation. It produces better results even without feature engineering. The reason behind this classifier being popular is due to its simplicity and efficiency.

Prerequisite: Knowledge about Decision Trees

Bagging or Bootstrap Aggregation

It is an ensemble learning model which combines the predictions from different learning models to predict the result having much greater efficiency than the individual models.

Bootstrap step:

Let us consider a dataset having 'm' features. This step randomly chooses 'n' features with(out) repetition and provides a subset of the training data to a decision tree. In a similar fashion, we create a large number of decision trees and provide different subsets of training data with different features. Every Decision Tree then predicts a result.

Aggregation step:

There are large number of results from the different Decision Trees. The results are basically different labels in a classification problem. Now, the task is to combine all these predictions to make an efficient prediction. This is done by a simple vote system. The label which has been predicted more than the other labels is the final predicted result.

The Random Forest Classifier first divides the dataset and provides this to different decision trees to make a learning model. Once the different models are created, testing data can be provided to predict the result by choosing the label which are predicted the most. Bagging is collective term for both the bootstrap and the aggregation step.

Image: Bagging

Implementation of Random Forest Classifier in Python:

The dataset we will be using for the implementation is the titanic dataset where we are given the details of the people who are aboard the ship. We have to predict if a person survived or not using the data. The data can be downloaded here.

I did mention that this algorithm works fine even without feature engineering, prior knowledge about the data. So, the following implementation will not have any extra step other than dropping of a few features.

The python code can be downloaded here.

Import necessary libraries:

Read the data and remove missing features. The Sex and Embarked Features has labels {Male, Female} and {Q, S, C} respectively. We need to convert them into numbers for processing and this is done with the help of label Encoders.

It is illogical to predict the survival of a person based on the name and passenger ID, so we drop that feature. The features - Ticket, Cabin and Age has missing values. Since we are not interested in these preprocessing steps, we just drop those. (It works just fine even if you use those, try it!)

Divide the loaded data into feature variables - target variable and training-testing data. Fit the training data to the Random Forest Classifier and predict for the testing data.

The model create has an efficiency of 82% which by itself is quite remarkable considering that we have not done much pre-processing. It only shows the accurate predictive result of the random forest classifier.

For better results, it always a must to pre-process data and feature engineer the variables.

Applications of Random Forest Classifier

It can be used to machine learning problems for all sorts of classification like Image classification, video classification, text and voice classification. It can be used in diverse fields like Banking, Medicine, Stock Market, E-commerce, et cetera.

Advantages of Random Forest Classifier

It is one of the most accurate learning algorithm for classification.
It runs efficiently on large databases.
Even with a few missing features, it can perform well by estimating the missing data.
Reduces the problem of over-fitting.

What is Machine Learning?

Machine Learning is one of the many approaches to Artificial Intelligence (AI). It provides systems the ability to learn and improve from experience. The machine learning programs work on the data provided and try learning by themselves. The basis of machine learning is to build algorithms that can receive input data and use statistical analysis to predict an output while updating outputs as new data becomes available. There are different types of machine learning models which will be covered later.

Why Machine Learning?

It is not possible for any human to go through enormous amount of data, say 500,000 patient records, let alone drawing inference from it. The data, in this case - patient's records, may have different features from age, RBC/WBC count, symptoms, et cetera. Considering all these features and trying to draw an inference seems superhuman task.

Now, what are computers capable of? Processing information and calculations at a very high rate is what a computer does. Using such processing speed to draw inference from a large set of records seems now possible. It can even analyse complex data. With increasing volumes and different varieties of data in past few years, there is a need to draw inference from this data to solve real world problems.

Machine Learning Process:

The learning process has different stages. First is the data pre-processing step which is very important. The raw data provided, sometimes, can have missing, unnecessary elements in it. Before providing the data to the learning model, these things should be taken care of. Also, in some cases, there will be a need to normalize, scale the data, only after which the data is given to the machine learning model.

Secondly, visualize the data. It gives more insight in the training data provided, which helps in further modelling. The patterns within the data can be easily detected by visualization.

Having visualized the data with a slight hint of which model to be using, choose the right model and provided the structured data to it. The learning model will then be able to learn from the data and will have the ability to predict the outcomes for some new data.

Different algorithms have different parameters associated with it. Tuning in these parameters helps you get a better efficiency of the learning model. All these things can be done using different libraries provided by the platform on which you run the program.

Applications of Machine Learning:

The applications of machine learning are endless. It can be used in diverse fields like weather prediction, medical analysis, email Spam classification, detection of text from images, recommendations in online shopping, booking, et cetera. A few links to some successful machine learning applications are:

Image Sources:

Machine Learning

Monday, July 2, 2018

Random Forest Classifier