Feature Selection: Knowing What to Keep

Aastha Thakker
Nov 13, 2025
6 min read

So, you’ve got a dataset with 47 columns, and you’re pretty sure at least 30 of them are just… there. Taking up space. Feature selection is basically that moment when you stop pretending everything is important and start filtering out the useless stuff your model never asked for. Most people treat feature selection the same way they treat their phone gallery, keep everything “just in case,” and then wonder why the whole thing slows down. In machine learning, dumping every single column into a model and hoping it magically performs is basically that energy.

Feature selection is the part where you stop hoarding useless data, pick only what actually matters, and give your model a fair chance to behave like it has some sense.

What is Feature Selection?

Feature selection is an important step in any machine learning model, where the aim is to identify the features that genuinely help the model learn while filtering out those that add noise or unnecessary complexity. By reducing the feature set to what matters, you simplify the model, cut down overfitting, and speed up training, all while improving interpretability. It’s also important to distinguish this from feature engineering, which focuses on creating or modifying features to boost performance; both steps work together to build efficient and accurate models.

What is Feature Engineering?

Feature engineering is the process of creating new features or transforming existing ones, so they capture the underlying patterns in the data more effectively. Instead of relying only on the raw columns, feature engineering helps highlight relationships, reduce noise, and give the model clearer signals to learn from. This can include combining features, extracting meaningful components, encoding categories, or scaling values, basically reshaping the data so the model doesn’t have to struggle to interpret it.

Difference between Feature Selection and Feature Engineering

Importance of Feature Selection

Speeds up model training: With fewer unnecessary features, algorithms train much faster and consume fewer resources.
Reduces model complexity: A smaller, cleaner feature set makes the model lighter, easier to debug, and simpler to maintain.
Improves prediction accuracy: Removing confusing or irrelevant inputs helps the model focus on the signals that actually matter.
Prevents overfitting: By filtering out noisy or redundant features, the model is less likely to memorize patterns that don’t generalize.
Enhances interpretability: With only the important features kept, understanding how and why the model predicts something becomes far more straightforward.
Cuts down on storage and computation: Fewer features mean less memory usage, faster data loading, and smoother experimentation.
Boosts generalization on unseen data: Cleaner inputs help the model maintain stable performance when tested outside the training environment.
Reduces multicollinearity issues: Removing highly correlated or repetitive features prevents the model from getting conflicting signals.
Supports better feature engineering: When you know what matters, it becomes easier to create new or improved features intentionally.
Improves overall workflow efficiency: Feature selection, done right after data cleaning, sets a strong foundation for modelling and ensures you’re not wasting time training on irrelevant data.

Categories of Feature Selection

When a dataset has d features, trying every possible mix of features to find the best subset would mean training (2ᵈ − 1) different models.

Why this formula?

Every feature has only two choices:

either include it in the subset
or exclude it from the subset

So, if you have d features, and each feature has 2 choices, the total combinations become:

2 × 2 × 2 × … (d times) = 2ᵈ

This represents all possible ways you can pick or not pick features.

But why do we subtract 1?

Because one of those combinations is “select nothing”, which is useless, you can’t train a model with zero features.

So, we remove that one case: Total useful combinations = 2ᵈ − 1

Even with just 20 features, this means over a million possible combinations. Testing every one of these combinations by training a model each time would take forever, which is why this formula highlights how quickly the problem becomes impossible to brute-force. Instead of trying all subsets manually, we use feature selection techniques that help us find the most important features efficiently. Broadly, they are grouped into two categories, 3 in Supervised.

A) Filter Method

Filter methods evaluate features using statistical measures before any model training happens. They use metrics like correlation coefficients, chi-square tests, or mutual information to score each feature’s relevance to the target variable. Based on these scores, irrelevant or redundant features are filtered out.

The main advantage is speed. Since no model training is involved, filter methods can process large datasets quickly and efficiently. This makes them ideal for initial data exploration and dimensionality reduction.

However, filter methods have a key limitation: they evaluate features independently. This means they can miss important feature interactions — cases where two or more features together provide valuable information, even if individually they seem weak. Despite this, filter methods remain a standard first step in most feature selection pipelines, particularly when working with high-dimensional data where computational efficiency matters.

Common techniques: Pearson correlation, ANOVA F-test, Chi-square test, Mutual Information

B) Wrapper Method

Wrapper methods treat feature selection as a search problem. They evaluate different subsets of features by actually training and testing the model multiple times, then selecting the subset that produces the best performance.

The process works in three stages:

Subset generation — Create different combinations of features
Model training — Train the model on each subset
Performance evaluation — Measure model performance (accuracy, F1-score, etc.) and select the best-performing subset

Because wrapper methods involve the actual model in the selection process, they capture feature interactions and dependencies that filter methods miss. This typically leads to better model performance.

The tradeoff is computational cost. Training a model repeatedly on different feature subsets is time-consuming and resource-intensive, especially with large datasets or complex models. Wrapper methods work best when you have a smaller dataset or when model accuracy is the top priority.

Common techniques: Forward Selection, Backward Elimination, Recursive Feature Elimination (RFE)

C) Embedded Method

Embedded methods perform feature selection as an integrated part of the model training process. Instead of selecting features before or after training, these methods learn which features are important while the model is being built.

Tree-based models like Random Forest, XGBoost, CatBoost, and LightGBM are classic examples. As these models construct decision trees, they evaluate which features provide the best splits based on criteria like information gain or Gini impurity. Features that consistently improve the model’s performance receive higher importance scores, while less useful features naturally get sidelined.

Regularization-based models like Lasso regression also fall into this category. Lasso adds a penalty term that can shrink some feature coefficients to zero, effectively removing them from the model during training.

Embedded methods offer a practical middle ground: they’re faster than wrapper methods because they don’t require multiple training rounds, yet they’re smarter than filter methods because they consider feature interactions within the model’s context. They also provide interpretability through feature importance scores.

For most real-world applications, especially those using tree-based algorithms, embedded methods are sufficient and are widely adopted in production systems. They may not always be optimal for very high-dimensional datasets or simpler linear models, where additional feature selection might still help with training efficiency and generalization.

Common techniques: Random Forest feature importance, L1 regularization (Lasso), Tree-based feature importance (XGBoost, LightGBM)

Hybrid Methods

Hybrid methods combine multiple approaches to leverage their respective strengths. The typical workflow involves two stages:

Pre-training stage: Apply a filter method to quickly eliminate clearly irrelevant or redundant features from the dataset
Post-training stage: Use a wrapper method to refine the selection and identify the optimal feature subset for model performance

This strategy addresses the weaknesses of individual methods. The filter step handles the computational burden of high-dimensional data, while the wrapper step ensures the final feature set is optimized for the specific model being used.

Hybrid methods are particularly effective for complex, large-scale problems where you need both efficiency and accuracy. They’re increasingly common in industrial applications where datasets are large, but model performance is critical, think fraud detection, recommendation systems, or medical diagnosis.

The additional complexity is justified when neither filters alone (too simplistic) nor wrappers alone (too expensive) can meet your requirements.

“WHEN TO STOP”

Knowing When You’ve Selected Enough Features

Feature selection isn’t about finding the perfect subset — it’s about hitting diminishing returns before you waste time. Here’s how to know when to stop:

The 3 Stopping Signals

1. Performance Plateau Track your validation metric as you add/remove features:

10 features → 82% accuracy15 features → 87% accuracy  20 features → 88% accuracy ← You probably should stop here25 features → 88.2% accuracy30 features → 88.1% accuracy

When adding 5 more features only gains you 0.2%, you’re done. That extra complexity isn’t buying you anything.

2. The 1% Rule If a feature improves your metric by less than 1% and increases training time by more than 10%, drop it. Real example: A fraud detection model went from 94.2% to 94.7% F1-score by adding “user’s browser type” — but inference time jumped 40% because it required extra API calls. Not worth it.

3. Business Constraint Check Sometimes the best features aren’t available in production:

Can you actually get this data in real-time? (Predicting loan defaults using “employment history” is great, but if HR systems take 3 days to respond, it’s useless)
Does this feature introduce bias? (ZIP code might be predictive for credit risk, but using it might violate fairness policies)
What’s the cost? (Calling an external API for credit scores costs $0.50 per prediction, does the accuracy gain justify it?)

Framework

START: All features
↓
Apply Filter Method → Keep top 30-40% by statistical score
↓
Does model accuracy meet baseline? 
   NO → Add back features using Wrapper/Embedded
   YES → Are you overfitting? (train-val gap > 5%)
          YES → Remove features using regularization
          NO → Check training time
                Too slow? → Remove low-importance features
                Acceptable? → STOP HERE ✓

Aastha Thakker