In real time projects we always work on Data coming from different sources. The input data comprise many features, which are usually represented in the form of structured data (Tabular form). Each column can be considered as one feature.
Feature Engineering is a process of create features that make Machine Learning Algorithm work with domain knowledge. It is an act of extracting important features from raw data and transform them into suitable formats for Machine learning algorithm.
First let’s understand feature in simple terms. Feature in other words can be described as Attribute, property. So here we shall list few features of laptop — Processor, Operating System, Hard disk capacity, RAM, Display, Battery, weight, colour, brand, model etc., . Before we purchase any Laptop we look into its features.
Here is another simple example, assume that you wanted to purchase a nice dress for Birthday party. What features of the dress will attract you? May be fabric, color, brand, design, size, pattern etc., before you purchase a dress you will think which dress suits you best. Similarly, it is very crucial for us to select the appropriate features from input data to build a Machine Learning model to best possible output.
Goals of Feature Engineering:
· Preparing the right input dataset, compactable with Machine Learning Algorithms.
· Improving the performance of Machine learning model.
According to survey Data Scientist spend on average 60% of time in Data Preparation. In this article let us understand main techniques of Feature Engineering. If feature engineering is done correctly, predictive power of machine learning algorithm increases.
Feature Selection: All the features of raw data aren’t important for predictive Analysis. It is very important to select the most important variables from raw data for Machine Learning Algorithm. The methods of Feature Selection are Chi-squared test, correlation, coefficient scores, LASSO, Ridge Regression etc.,
Feature Extraction: Analysis with large data uses lot of computational power and memory, therefore we need to reduce the dimensionality of these types of data. However, we are aware that we can’t delete the observation so it decreases the model performance. Therefore here we delete the variables using PCA.
Feature Transformation: Transforming original feature to the functions of original features. The most common forms of data transformation are Scaling, binning and missing value imputation.
Feature Engineering Techniques:
1. Imputation :
The raw dataset is not always complete. Missing values are one of the most common problem that Data Scientist come across when preparing data to build Machine Learning model. The reason for missing value can be human errors, interruptions in data flow, privacy concerns and so on. Missing values affect the performance of the machine learning model.
Algorithms do not accept dataset with missing values, it end up giving error. Imputation technique is used to fill in the missing values. Below code helps us in finding the number of missing values in each column.
The most simple solution for handling missing values is to drop the entire column which contains greater than 50% of missing values. However, there is no optimum threshold for dropping the columns.
Column can contain numerical value or categorical value. The best imputation way for Numerical value is to use statistical method median() of the column. As the average/ mean of the column results in outliers therefore we use median for imputation.
Replacing the missing values with the maximum occurred value in a column is a good option for handling categorical columns. It can be performed with the help of mode().
2. Handling Outliers
Any observation that is too far from the rest of the observations results in outliers. First step in handling outliers is to detect them. The best way to detect outliers is to demonstrate data visually, gives a chance to take decision with high precision. We can use box plot, Z-score or cook’s distance to identify the outliers.
Trimming the outliers make sense when we are dealing with large number of observations. Statistical methodologies that help in detecting outliers is Standard Deviation and Percentiles.
3. Binning / Grouping operations:
Binning can be applied on both numerical and categorical data. The main motive of binning is to make the model Robust and avoid overfitting.
4. Log Transform:
Logarithmic transformation helps to handle skewed data and make model distribution to become normal. It is helpful when the magnitude order of the data changes within the range of the data. If we set to delete all outliers, we end up losing valuable information. Log Transform benefits in reducing the effect of outliers, due to the normalization of magnitude difference and the model becomes more robust.
Keep in mind to apply Log Transform on data we must have only positive values in data. We can also add 1 to data before transform. Thus, we ensure that the output of the Log Transform is also positive.
5. One Hot Encoding :
One-hot encoding is one of the most common encoding methods in Machine Learning. This method spreads the Binary value in a flag column and assigns 0 or 1 to them.
This method changes your categorical data (which is challenging to understand for Machine Learning algorithms) to numerical format and enables us to group categorical data without losing information.
get_dummies function of Pandas library maps N distinct values in column to N-1 binary columns.
6. Grouping Operations
The key point of group by operations is to decide the aggregation functions of features. Let us understand different ways for aggregating categorical columns,
i. Select the label/ value with highest frequency i.e., max operation for categorical column. We need to use lambda function for this purpose.
ii. Second option is to create pivot table. It is defined as aggregated functions for the values between grouped and encoded columns.
iii. Third option is to apply groupby function after applying one-hot encoding.
Numerical columns are grouped using sum and mean functions.
7. Feature Split
In feature splitting, we split the single feature into two or more parts to get necessary information while building machine learning algorithm.
For an example, Name column contains first and last name of the person. Split() function is used for feature splitting.
We are interested only in the First name, below code helps us in splitting
Feature splitting is a vital step in improving the performance of the model, easier to perform other feature engineering techniques.
Numerical features of the dataset don’t have certain range and they differ from each other. The continuous feature become identical in terms of range, after scaling process. The algorithms based on distance calculations such as K-NN or K-means need to have scaled continuous features as model input.
Types of scaling:
Thanks — Pavithra Jayashankar