In my previous article I have listed out the techniques used for Feature Engineering. In this article I will be sharing the techniques for Feature Selection. If you have not gone through my pervious article please use below link to understand Feature Engineering before you start with Feature Selection —

Feature Selection is the process of automatically or manual select the most important variables from the input data for predictive analysis. In other words, it is the process of reducing/removing the irrelevant variables from input data to make model work with high accuracy. Having irrelevant features in input data may badly effect the accuracy of model.


Let us now look into the benefits of Feature Selection:

- Improvement the accuracy of the model.

- Reduces the risk of overfitting

- Enables algorithm to train the model faster.

- Improves data visualization

Next, we’ll discuss various methodology and techniques that helps us to select right features for building model.

1. Filter Method

2. Wrapper Method

3. Embedded Method

Note: Selection of features is independent of any Machine Learning Algorithm.

Filter Method:

Filter methods are generally used as a pre-processing step. It uses a statistical techniques to evaluate the relationship between independent and dependent variable. The choice of statistical technique is dependent upon the data type of variable. It filters our dataset and takes only a subset of it containing relevant variables for building model.

Filter Method

Pearson’s Correlation: It is a technique used to measure the relationship between two quantitative, continuous variables namely x and y. It ranges between -1 to +1.

Pearson Correlation formula

For example:

- age of person and Blood pressure.

- Income and saving

  • Temperature, speed, length, weight etc.,

LDA (Linear Discriminate Analysis): It is a statistical technique used to find linear combination of features that separate two or more class of categorical variable/object/events.

Anova : Anova stands for Analysis of Variance. It is a one-way analysis used to test there are any statistical significant difference between the means of several groups are equal or not. It is used when we need to find the significance between one or more independent categorical variables and one continuous dependent variable.

Chi-Square: It is most commonly used statistical test to find relationship between categorical variables to evaluate the likelihood of association or correlation between the variables.

For Example: Qualification and gender and maternal status.

Wrapper Method: In this method we use subset of random variables to build and train model initially. Based on the inference of the model we add/remove the variables from the subset and rebuild the model. These methods are usually computationally very expensive since we try to select the features for next model from previous model inference.

Forward Selection: It is an iterative model in which we without any feature in the model. In each iteration, we add new feature to existing model which improves the performance of the model. We repeat this till an addition of new variable does not improve the performance of the model.

Backward Selection: In this technique we build a model with all the features and remove the least significant feature one at a time at each iteration to improve the performance of the model. We repeat this until no improvement is observed in removal of the feature.

Recursive feature Elimination: It is greedy optimization algorithm aims to find the best performing features subset. At each iteration it creates a new model and keeps aside the best or the worst performing feature. In keeps on creating the model until not features left out. It can be implemented with wrapper method using Boruta Package that finds the importance of a feature by creating copy/shadow feature. In my next article I would be sharing implementation of Boruta Package.

Steps in Recursive feature Elimination:

- Builds the model using all independent variables.

- Calculate importance of all the variable.

- Each independent variable is ranked using its importance.

- Drop the less significant variable

- Rebuild model using remaining variable and calculate model accuracy

- Repeat above step until all variables are ranked accordingly when they are dropped.

Let us now see what function helps in Recursive feature Elimination selection:

a. Linear Regression — ImFuncs

b. Random Forest — rfFuncs

c. Navie — nbFuncs

d. Bagged — treebagFuncs

Embedded Method :

It perform feature selection during model training. It is implemented by algorithms that have feature selection methods built-in.

Example : LASSO Regression, RIDGE Regression.

Happy Learning!!

Thanks, Pavithra Jayashankar

MCA graduate - Aspiring data analyst