Column Transformer for Faster Feature Engineering in Machine Learning

Data pre-processing techniques

Amit Chauhan

--

Photo by Caspar Camille Rubin on Unsplash

Data is very important and a need for predictive modeling. Data goes through various pre-processing techniques before being feed into machine learning model. Feature engineering is a vital part of data pre-processing that handles missing values, data scaling, data encoding from string categories to numerical, and other techniques.

Each column in the data can have a different problem and it can be handled using pre-processing techniques. The main issue in data pre-processing is to handle each column.

Suppose we need to use one hot encoding, imputation, ordinal encoding, and others. These different process creates arrays of each column and then need to concatenate all these arrays to make one big array. So, this type of approach is not so much efficient and faster.

The column transformer is a way to handle all these trouble columns in a single process.

Python Example:

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
df1 = pd.read_csv('demo.csv')
df1
# check any missing values
df1.isnull().sum()

#output:
PassengerId 0
Survived 0
Pclass 0
Sex 0
Score 16
dtype: int64

Lets we have data in which the columns need to be transformed before feeding to machine learning. The first phase will be the individual processes to handle these columns and the second phase will use the technique called column transformer.

The columns can be broken into different columns based on their problems.

  1. Ordinal encoding: The categories in a column Pclass are a type of some order that needs an ordinal encoding.
  2. One hot Encoding (OHE): The column Sex needs a technique OHE because there is no order in categories.

--

--