Feature Selection: Declutter your data

The lifestyle trend of decluttering, simplifying and organising our homes has many parallels in data science. With an abundance of data, we can run the risk of creating too much noise and producing poor performance in machine learning models. By discarding unnecessary data points (features) many machine learning models perform better. So it is vital we only include the most important features, but the challenge is identifying which features we really need and getting rid of the ones that we don’t.

When a dataset contains thousands of features and we generate thousands more through feature engineering, working out what we need and what we get rid of isn’t easy. This is a problem that is deeply explored in the literature, with many solutions offered. We found that the existing methods had some flaws: they were model-specific, did not consider interaction effects, and were not practical for large feature sets. Our blog on Evolutionary Feature Engineering explores one methodology we are trailing at Elula. Here, we will explore another method that we use to discard features that do not help our model performance – the Subset Feature Scorer.

The central idea of the Subset Feature Scorer method is experimentation. We first generate a small list of features, the “subset”, by randomly taking features from the full set. We then train a model on these features and record a score for each feature.

We score the feature by multiplying:

– the importance of that feature in the model by

– the performance of the model, such as the F1 score.

In this way, the score of each feature is derived by what the feature contributed to the model, as well as how well that model actually performed.

We run many thousands of these experiments, with each feature appearing many times. The final score of a feature is the maximum of all the experiment scores for that feature, the best it ever performed. We choose the maximum rather than the average because we want to take into account interaction effects: if under the right circumstances the feature can perform extremely well, we want to keep it.

It is important to choose the size of each subset based on the number of experiments we will perform and the number of features in our dataset.

The subset must be large enough that each individual feature will probably be in an experiment with all other features at some point, so we can see interaction effects. However, it must be small enough so that training time is low, and the score reflects strongly on each feature present in the experiment. A high scoring model with 10 features tells us much more about those 10 features than a high scoring model with 100 features.

After running all experiments and calculating all final scores for each feature, we are left with a score for each feature which is an intelligent measure of how important that feature is. This allows us to discard unnecessary features and keep only the small number of features that should be incorporated into our final model.

The Subset Feature Scorer is far more flexible than a walk-backwards algorithm. In the walk-backwards algorithm, to reduce a feature set from 1000 features to 100, experimenting by removing each feature, we must run almost 500,000 experiments. If you want to reduce that number of experiments, your only choice is to increase the number of features you want at the end. However in the Subset Feature Scorer, instead of saying “we need 500,000 model runs”, we can say “here is the best we can do with 30,000 model runs”, which will still contain an accurate measure of feature importance. We can achieve the maximum level of confidence for any given restraint, rather than requiring a certain amount of resources in order to achieve the goal.

The Subset Feature Scorer is therefore an intelligent way to solve a large problem in data science, which has only become possible due to the ability to extract a feature importance from any model, and the ability to run tens of thousands of models cheaply and simultaneously using cloud computing.

“If you use the right method and concentrate your efforts on eliminating clutter thoroughly and completely within a short span of time, you’ll see instant results that will empower you to keep your space in order ever after.” – Marie Kondo

Feature Selection: Declutter your data

Joshua

Previous PostDeep Survival Analysis

Next PostEvolutionary Feature Engineering