It’s very common to find unbalanced datasets in all fields and sectors. They are everywhere, and of course, this includes banks. From fraud to non-performing loans, data scientists find them in many contexts.
Here we will discuss the principal techniques and methods available when dealing with this type of data. At the end of this blog post, you’ll find some of the common libraries and packages from Python and R that we use when working on this kind of data on a daily basis.
What We Understand by an Unbalanced Dataset
First, it’s important to clarify that this concept applies when the target variable (response feature / dependent variable) is categorical or ordinal. This type of variables has classes or levels. In the event that the target variable is continuous, it would not make sense to talk about unbalanced datasets. This only applies to classification tasks. With this in mind, let’s continue.
In simple terms, an unbalanced dataset is one in which the target variable has many more observations in one specific class (or level) than the others.
For example, let’s suppose that we have a dataset to analyse fraud detection. If we have a binary target variable (2 classes or levels) — that’s 1 when the transaction is fraudulent and 0 when it isn’t — it’s normal for less than 1% of the observations to belong to class 1 (fraud) than to class 0 (not fraud).
In this case, we have a highly unbalanced dataset. We could have another target variable with 3 classes, for example, where 70% of the observations belongs to the 1st class and 17% and 13% to the 2nd and 3rd respectively.
You might think: “Ok, that sounds simple, where does the problem lie?”. Here it is important to include a second parenthesis with an additional two points:
- If your goal is to use techniques to cluster the sample into natural groups or to describe the relationship of the minority class with the rest of the features (independent variables) then this doesn’t pose a “huge” problem. It only becomes an issue when this “property” affects the performance of the algorithms or the models that you could obtain.
- If the classes are perfectly separable using the available features, then distribution of the observations between them isn’t problematic.
Putting these cases to one side, the problem is that the models obtained after training in unbalanced datasets (this is the case whichever algorithm you choose, though some are more vulnerable than others to unbalanced data), have poor results in almost all cases when they have to generalize (predict a class or classify unseen observations). This ultimately means you won’t end up with a good model.
What happens is that the algorithm receives significantly greater numbers of examples from one class, causing a bias to that particular class. It doesn’t learn what makes the other class “different”, and fails to understand the underlying patterns that allow us differentiate one class from another based on the features used. Another consequence is that the algorithm learns that any given class is more “common”, making it more “natural” for there to be a greater tendency towards it, and is prone to “overfitting’’ the majority class. Just by predicting the majority class, models would score high on their loss-functions. In these cases, the “Accuracy Paradox” appears.
When faced with a situation like this, what steps can be taken to solve it?
- Resample the dataset:
- Undersampling: The idea is to reduce the ratio of observations between the majority and minority classes. You can just select randomly a number of observations that represent a ratio of 50/50 or 60/40 (binary case) or 40/30/30 (if you have 3 classes) for example. In this case, a ‘random sampling without replacement’ would be enough. But, you can carry out an informed undersampling by finding out the distribution of data first and selectively picking the observations to be discarded. In this last case, you can use first a clustering technique or k-NN (k-nearest neighbors algorithm) to finish with a downsampled dataset that includes observations of every natural group of data inside the majority class. This way you keep all the underlying information hidden in the sample because random downsampling may select all the observations from one “type” of observation and you will lose real information from the sample. During the resampling you can try different ratios, it doesn’t have to be exactly the same amount of observations in every class.
- Oversampling: You can create synthetical observations of the minority class based on the available data. Algorithms that allow this are, for example: Variational Autoencoders (VAE), SMOTE (Synthetic Minority Over-sampling Technique) or MSMOTE (Modified Synthetic Minority Over-sampling Technique).
- VAE allows you to explore variations in your current data, not just in a random way, but also in a desired direction. For this task, VAE is a very powerful method. The theory is available in many resources, but in summary: an autoencoder network is a pair of two connected networks, an encoder and a decoder. The encoder takes in an input, and converts it into a smaller and dense representation. The decoder network can use this representation to convert it back to the original input. The standard autoencoders learn to generate compact representations and reconstruct their inputs correctly. The latent space they convert their inputs to and where their encoded vectors lie, may not be continuous or allow easy interpolation. In the case of VAEs, their latent spaces are, by design, continuous, allowing easy random sampling and interpolation. They achieve this making its encoder output not an encoding vector, instead, the output are two vectors: μ, and σ. They form the parameters of a Gaussian distribution from which you sample, to obtain the encoded sample, which you pass to the decoder. This stochastic generation means, that even for the same input, while the mean and standard deviations remain the same, the actual encoding will vary on every single pass.
- SMOTE, select some observations (20, 30 or 50, the number is optional) using a distance measure, to synthetically generate a new one with the same “properties” for the available features, perturbing an observation one feature at a time by a random amount within the difference to the neighbour observations; it doesn’t just copy the same observations.
2. Collect more data from the minority class.
This option could appear trivial, but it really solves the problem when it is applicable.
3. Use the “adequate” algorithm
Some algorithms are more robust than others. An adequate domain of the theory behind every algorithm will allow you to understand their strengths and weaknesses in every situation. Remember that machine learning algorithms are chosen on the basis of the input data and the learning task at hand.
4. Change your approach
Sometimes, instead of building a classifier, it is more beneficial to change your approach and the scope — one option would be to analyze your data from the ‘anomaly detection’ point of view for example. Then you can apply from ‘One Class SVM’s’ to ‘Local Outlier Factor (LOF)’ algorithms.
5. Use penalized models
Many algorithms have their own penalized version. Normally the algorithms treat all misclassifications the same way, so the idea is to penalize misclassifications from the minority class more than the majority. It carries an additional cost for mistakes during training (that’s why they are sometimes called cost-sensitive classifiers), but in theory, these penalties help the model improve the attention given to the minority class. In some cases, the penalties are called weights. Achieving the correct matrix of penalties can be difficult and sometimes doesn’t do much to improve results, so try several schemas until you find the one that works best for your specific situation.
In unbalanced datasets, the “Accuracy Paradox” is common. This happens when you use ‘Accuracy’ as the metric to learn the best model.
Let’s continue with the example of fraud detection data. In these cases, the algorithm considers it better to assign all the observations to the majority class (99% of the observations), because then the accuracy of the models will be above 90%. It’s true in theory that almost all the observations belong to this majority class, but in the practice, you don’t have a good model, you have a model that assigns the same class to all the observations and doesn’t generalize well.
So this question arises: which metric I can use when my data is unbalanced?
Metrics for unbalanced data:
- From the confusion matrix we can calculate (just looking at the confusion matrix we get insightful information):
- Sensitivity and Specificity
- Precision and Recall
- F1-score (also F-score or F-measure): is the harmonic average of precision and recall.
- AUC (ROC curves)
- Normalized Gini Score (the Gini score is merely a reformulation of the AUC: gini = 2×AUC−1)
- Kappa: Classification accuracy normalised by the imbalance of the classes in the data.
In Python one of the best options is the imbalanced-learn package:
It includes under-sampling and over-sampling methods. You can find all the options in the API documentation inside the link showed above. Notice that it has utilities for Keras and TensorFlow. Also, include functions to calculate some of the metrics discussed before.
In the DMwR (Data Mining with R) package you can find the SMOTE function to apply the method discussed above.
Another option is the ROSE function (the package has the same name). ROSE (Random Over-Sampling Examples) aids the task of binary classification in the presence of rare classes. It produces a synthetic, possibly balanced, sample of data simulated according to a smoothed-bootstrap approach.
The caret package includes very useful functions as:
- downSample: randomly sample a data set so that all classes have the same frequency as the minority class.
- confusionMatrix: also include all the metrics that you can obtain from the confusion matrix: sensitivity, specificity, kappa, precision, etc.
The original version of this post can be found on Strands Tech Corner on Medium