Unbalanced Datasets & What To Do About Them

by Germán Lahera on Jan 22, 2019

Unbalanced datasets are prevalent in a multitude of fields and sectors, and of course, this includes financial services. From fraud to non-performing loans, data scientists come across them in many contexts. The challenge appears when machine learning algorithms try to identify these rare cases in rather big datasets. Due to the disparity of classes in the variables, the algorithm tends to categorize into the class with more instances, the majority class, while at the same time giving the false sense of a highly accurate model. Both the inability to predict rare events, the minority class, and the misleading accuracy detracts from the predictive models we build. 

The class imbalance problem between the majority and minority is frustrating, but not unexpected. We will now discuss the main techniques and methods available when dealing with this type of data. At the end of this post, you will find common libraries and packages from Python and R used to resolve this issue.

What Does An Unbalanced Dataset Mean?

In simple terms, an unbalanced dataset is one in which the target variable has more observations in one specific class than the others.

Screenshot 2019-07-22 at 15.44.22

For example, let’s suppose that we have a dataset used to detect a fraudulent transaction. If we have a binary target variable (2 classes) — that’s 1 when the transaction is fraudulent and 0 when it isn’t — it’s normal for less than 1% of the observations to belong to class 1 (fraud) than to class 0 (not fraud). In this case, we have a highly unbalanced dataset. 

Another example would be a target variable with three classes, where 70% of the observations belong to the 1st class and 17% and 13% to the 2nd and 3rd classes respectively.

You might think: “Okay, that sounds simple, where does the problem lie?”.

If your goal is to use techniques to cluster the sample into natural groups or to describe the relationship of the minority class with the features (independent variables), then this doesn’t pose a “huge” problem. It only becomes an issue when this “property” affects the performance of the algorithms or the models that you could obtain.

If the classes are separable using the available features, then the distribution of the classes between them is not problematic.

Besides, the problem is that models trained on unbalanced datasets often have poor results when they have to generalize (predict a class or classify unseen observations). Despite the algorithm you choose, some models will be more susceptible to unbalanced data than others. Ultimately, this means you will not end up with a good model.

The algorithm receives significantly more examples from one class, prompting it to be biased towards that particular class. It does not learn what makes the other class “different” and fails to understand the underlying patterns that allow us to distinguish classes.

The algorithm learns that a given class is more common, making it “natural” for there to be a greater tendency towards it. The algorithm is then prone to overfitting the majority class. Just by predicting the majority class, models would score high on their loss-functions. In these instances, the Accuracy Paradox appears.

When faced with This situation, what steps can be taken to solve it?

1. Resample the dataset:

  • Undersampling: The idea is to reduce the ratio of instances in the majority and minority levels. You can randomly select observations in the desired ratio-- 50/50, 60/40 in a binary case, or 40/30/30 if you have three classes. In this case, taking a random sample without replacement would be enough. You can carry out an informed undersampling by looking at the distribution of the data and selecting the observations to discard. In this last case, you can first try using a clustering technique or k-NN (k-nearest neighbors algorithm) to obtain a downsampled dataset. This dataset includes observations of every natural group of data inside the majority class. This way, you retain all the underlying information in the sample. Otherwise, random downsampling may select all of one type of observation, and you will lose valuable information from the sample. During the resampling, you can try different ratios as each class does not have to contain the same number of observations.
  • Oversampling: You can create synthetical observations of the minority class based on the available data. Some algorithms that allow for this include Variational Autoencoders (VAE), SMOTE (Synthetic Minority Over-sampling Technique) or MSMOTE (Modified Synthetic Minority Over-sampling Technique).Screenshot 2019-07-18 at 14.15.15
  • VAE allows you to explore variations in your current data, not just in a random way, but in the desired direction. For this task, VAE is a powerful method. The theory is available in many resources, but in summary, an autoencoder network is a connected encoder and decoder. The encoder registers input data and turns them into a smaller, dense representation. The decoder network uses this representation and converts it back to the original input. The standard autoencoders learn to generate compact representations and reconstruct their inputs correctly. The latent space they send their inputs to and is where their encoded vectors lie may not be continuous or allow easy interpolation. In the case of VAEs, their latent spaces are continuous by design, allowing easy random sampling and interpolation. They achieve this by making its encoder output not an encoding vector, but instead two vectors: μ and σ. They are the parameters of a Gaussian distribution from which you sample, obtain the encoded sample, and then pass it to the decoder. This stochastic generation means, that even for the same input, while the mean and standard deviations remain the same, the actual encoding will vary on every single pass.Screenshot 2019-07-18 at 12.29.32
  • For SMOTE, you select some observations (20, 30 or 50, the number is changeable) and use a distance measure to synthetically generate a new instance with the same “properties” for the available features. Analyzing one feature at a time, SMOTE takes the difference between an observation and its nearest neighbor. It multiplies the difference with a random number between zero and one. Then, it identifies a new point by adding the random number to the feature. This way, SMOTE does not copy observations and instead creates a new, synthetic one.

    Screenshot 2019-07-18 at 14.20.50

2. Collect more data from the minority class

This option appears trivial, but it solves the problem when it is applicable.

3. Use the “adequate” correct algorithm

Some algorithms are more robust than others. A mastery of the theory behind each algorithm will help you understand their strengths and weaknesses in various situations. Remember, machine learning algorithms are chosen based on the input data and learning task at hand.

4. Change your approach

Instead of building a classifier, sometimes it is beneficial to change your approach and the scope ;  one option would be to analyze your data from the ‘anomaly detection’ point of view. Then, you can apply from ‘One Class SVM’ to ‘Local Outlier Factor (LOF)’ algorithms.

5. Use penalized models

Many algorithms have their own penalized version. Usually, algorithms treat all misclassifications the same, so the idea is to penalize misclassifications from the minority class more than the majority. Mistakes made during training carry an additional cost (that is why they are called cost-sensitive classifiers), but in theory, these penalties help the model improve the attention given to the minority class. Sometimes, the penalties are called weights. Achieving the correct matrix of penalties can be difficult and sometimes does not do much to improve results, so try several schemas until you find the one that works best for your situation.

The Accuracy Paradox:

In unbalanced datasets, the “Accuracy Paradox” is common. It occurs when you use the “accuracy” metric to learn the best model.

Let’s continue with the example of fraud detection data. The algorithm will want to assign 99% of the observations to the majority class, as then the model's accuracy will be above 90%.  It is true in theory that most of the observations belong to the majority class. In practice, you do not have a good model; you have a model that assigns the same class to all the observations and does not generalize well.

So, which metrics can we use when our data is unbalanced?

  • From the confusion matrix, we can calculate (just looking at the confusion matrix gives us insightful information):
    • Sensitivity and Specificity
    • Precision and Recall
  • F1-score (also F-score or F-measure): the harmonic average of precision and recall.
  • AUC (ROC curves)
  • Logloss
  • Normalized Gini Score (the Gini score is merely a reformulation of the AUC: Gini = 2×AUC−1)
  • Kappa: Classification accuracy normalized by the imbalance of the classes in the data.

Common libraries and packages used to resolve this issue

In Python, one of the best options is the imbalanced-learn package:

It includes undersampling and oversampling methods. You can find all the options in the API documentation inside the link shown above. Notice that it has utilities for Keras and TensorFlow and includes functions to calculate some of the metrics discussed before.


import pandas as pd
import imblearn.under_sampling as under from sklearn.datasets import load_breast_cancer breast_cancer = load_breast_cancer() pd.Series(breast_cancer.target).value_counts() Out: 1 357 0 212 UnderSampling = under.ClusterCentroids(sampling_strategy={1:300, 0:212},
random_state=83, voting='hard') x_resampled, y_resampled = UnderSampling.fit_resample(breast_cancer.data,
breast_cancer.target) pd.Series(y_resampled).value_counts() Out: 1 300 0 212

When using R's DMwR (Data Mining with R) package, you can find the SMOTE function to apply the method discussed above.

Another option is the ROSE function (the package has the same name). ROSE (Random Over-Sampling Examples) aids the task of binary classification in the presence of rare classes. It produces a synthetic and likely balanced sample of data simulated using a smoothed-bootstrap approach.

The caret package includes useful functions such as:

  • downSample: randomly sample a data set so that all the classes have the same frequency as the minority class.
  • confusionMatrix: includes metrics from the confusion matrix (sensitivity, specificity, kappa, precision, and more.)

library('DMwR') data("iris") table(iris$Species)
Out : setosa versicolor virginica 50 50 50 x_smote <- SMOTE(Species~., iris, perc.over = 150, perc.under = 250, k=5) table(x_smote$Species) Out : setosa versicolor virginica 100 66 59

The original version of this post can be found on Strands Tech Corner on Medium.

About
Germán Lahera
Germán Lahera

Data scientist at Strands. Mathematician with a Master of Science in Probability and Statistics. Working in machine learning and artificial intelligence applications during the past several years.

Get the latest updates here

SUBSCRIBE HERE