Extremely Wide One-Hot Encoding: The Curse of Dimensionality
This visualization dramatically demonstrates the "curse of dimensionality" — a critical problem
that occurs when one-hot encoding categorical variables with many unique values. Watch as a simple
dataset with just 4 columns expands to over 150 columns, creating an extremely sparse matrix.
The Curse of Dimensionality Problem
One-hot encoding is a standard technique for handling categorical variables in machine learning.
However, when a categorical feature has many possible values (like product categories, ZIP codes,
or user IDs), one-hot encoding creates an explosion in dimensionality:
A single categorical column with 100 unique values becomes 100 binary columns
This makes your dataset extremely wide and sparse (mostly zeros)
Models often perform poorly on extremely sparse data
Computational complexity increases dramatically
Increased risk of overfitting
How This Visualization Works
The animation shows a progression through increasingly extreme one-hot encoding:
Starting with a simple 4-column dataset (id, customer, category, purchase)
Moving through stages of encoding with more and more categorical columns
Ending with 150+ columns, most containing only zeros (an extremely sparse matrix)
Watch how the dataset width grows dramatically while becoming increasingly sparse. This visually
demonstrates why handling high-cardinality categorical features often requires techniques beyond
simple one-hot encoding.
One-Hot Encoding: Extreme Width Visualization
Original dataset: 4 columns
Dataset Structure
4 columns total
id
customer
category
purchase
Dataset Expansion Visualization
4 cols
+0 one-hot columns
1× wider than the original dataset!
With 240 categories, the dataset is now extremely sparse — each row contains just 1 non-zero value across 240 encoded columns
Alternatives to One-Hot Encoding for High-Cardinality Features
When dealing with categorical features that have many unique values, consider these alternatives:
Feature Hashing: Maps categories to a fixed number of dimensions using a hash function
Target Encoding: Replaces categories with their target mean values
Embeddings: Learns dense vector representations of categories (common in deep learning)
Dimensionality Reduction: Apply PCA or t-SNE after one-hot encoding
Binary Encoding: Encodes the unique values as binary code, then creates columns for each digit
These techniques can significantly reduce dimensionality while preserving most of the information
in high-cardinality categorical features.