Converting Numerical Data to Categorical Data

Converting Numerical Data to Categorical Data

In machine learning, converting numerical data to categorical can significantly impact model accuracy. In my recent project which aimed to identifying the most popular YouTube channels, I observed that when i use a columns numerical to categorical that improvement accuracy. This transformation is particularly useful when dealing with skewed numeric distributions, such as those found in subscriber counts, age columns or salary of others datasets.

So, how this numerical to categorical converting process work?

There are two primary processes for converting numerical data to categorical:

  1. Distribution or Binning

  2. Binarization

Here we only describe about Distribution or Binning…

For this approach, we use 'KBinsDiscretizer' class from scikit-learn. Here's an example code snippet:

from sklearn.preprocessing import KBinsDiscretizer
# Define the number of bins, encoding method, and strategy

K_bins_example = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')

Here, 'n_bins' means number of bins of the column you want. And then there are two options for encoding one is ordinal and another one is 'OneHotEndcoder'. And last parameter is strategy, you can use here 'kmeans', quantile etc. as your needs.

We can also use Category encoder to do same task.

How do you typically converting numerical to categorical in your datasets?

Reference

[1] A. Géron, "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow," O'Reilly Media, 2019