Voiced by Amazon Polly |
Overview
Machine learning is a branch of artificial intelligence that involves building algorithms and statistical models that enable computers to learn and make predictions or decisions without being explicitly programmed. To put it differently, machine learning algorithms enable computers to gather knowledge from data and enhance their capabilities progressively.
There are several types of Machines learning algorithms, including:
- Supervised Learning
- Unsupervised Learning
- Semi-Supervised Learning
- Reinforcement Learning
To keep ourselves intact, we will be discussing Supervised Learning.
Supervised Machine Learning is where a model is trained using labeled data, where each data point has a known output or target value. Such models get trained by learning the mapping between input and output, making the prediction further on the unseen data.
Supervised Machine Learning is further divided into:
- Regression: Regression algorithms are used when the output is continuous, such as predicting the demand for the rental bike at any hour of the day.
- Classification: This algorithm is used when the output variable is categorical, such as predicting whether a person might suffer from a deadly disease in the coming years based on current health-related features. The algorithm learns a function that maps the input variables to a discrete output variable, such as a binary or multi-class classification.
Since we have discussed machine learning and its subfields, it’s time to develop a foundation around the topic we will discuss today.
We usually encounter many binary classification problems, and it isn’t easy to find such a case where both the classes in the dataset are equally proportioned. E.g., Taking a binary classification problem wherein we have to predict whether a patient will suffer from a deadly disease in the coming 10 years depending upon the current habits and health-related factors such as blood pressure, hemoglobin, etc. In such a case, there would be a lesser chance that if we have data of 1 lakh patients under both classes have equal or comparable proportions.
Since we are dealing with supervised learning, our model will learn from the training data, which is biased toward one class (95,000 patients do not have a risk of the deadly disease in the coming 10 years, and only 5000 will have).
And talking about why this class imbalance is an issue? This is because it leads to inconsistent accuracy when evaluating the model, and secondly, the model learns on biased data. Prediction of such a model is unreliable when we use cases with serious consequences, like in the healthcare industry.
Introduction
SMOTE (Synthetic Minority Over-sampling Technique) is a method employed to tackle the issue of imbalanced classes in datasets. It is an oversampling technique wherein synthetic samples are generated for the class with rare or fewer occurrences in our dataset. Its primary emphasis lies on the feature space, utilizing interpolation between closely located positive instances to generate new instances.
It is done using any distance metric such as Euclidean Distance or Manhattan Distance, and the distance difference between the feature vector and its neighbors is calculated. The difference is multiplied by a random value between 0 and 1 (excluding 0) and added to the previous feature vector to generate synthetic data.
Pioneers in Cloud Consulting & Migration Services
- Reduced infrastructural costs
- Accelerated application deployment
Implementation of SMOTE
Let’s now have a look at the implementation of SMOTE:
- We have used the Cardiovascular dataset and wherein we have two classes
- 1: Represents There is a risk of deadly disease to the patient.
- 0: Represents There is no risk of deadly disease to the patient.
- The dataset is not very large, it has around 3400 records.
Let’s have a glimpse of what the dataset looks like:
- We will now see how both classes are distributed:
- Since we can see that the 0 or No risk of disease is the majority class, if the model learns and predicts from this data, it will lead to biased prediction. Thus, we will use SMOTE to counter the issue.
Let’s have a look at the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
#import the library from imblearn.over sampling import SMOTE # Resampling the minority class sm = SMOTE(sampling_strategy='"minority', random_state=42) # Fit the model to generate the data. X, y = sm.fit_resample(df _transformed.drop('TenvearCHD', axis=1), df _transformed['TenYearCHD']) df_ smote = pd.concat([pd.DataFrame(X), pd.DataFrame(y)], axis=1) |
- After applying the SMOTE, lets now have a look at how the classes are distributed:
- The classes are now equally distributed.
Advantages and Disadvantages of SMOTE
Advantages of SMOTE:
- It preserves the original data distribution.
- Reduces risk of overfitting.
- It works well with high-dimensional data.
- Easy to implement.
- Generates more diverse synthetic samples.
- It can be combined with other techniques.
Disadvantages of SMOTE:
- It generates the synthetic data points using the minority class, which may sometimes lead to the imputation of noisy points.
- New data points are generated from the existing data points, which can lead to overfitting.
- Sampling parameters can be difficult to tune.
Conclusion
Through this blog, we tried to understand what, why, and how to tackle a major class imbalance issue. Several other techniques could be used to address the issue. In the further set of blogs, I will try to come up with other techniques to combat the class imbalance issue.
Making IT Networks Enterprise-ready – Cloud Management Services
- Accelerated cloud migration
- End-to-end view of the cloud environment
About CloudThat
CloudThat is also the official AWS (Amazon Web Services) Advanced Consulting Partner and Training partner and Microsoft gold partner, helping people develop knowledge of the cloud and help their businesses aim for higher goals using best in industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.
Drop a query if you have any questions regarding SMOTE and I will get back to you quickly.
To get started, go through our Consultancy page and Managed Services Package that is CloudThat’s offerings.
FAQs
1. List down some different techniques to address the issue of Class Imbalance.
ANS: – The other such techniques to combat class imbalance are:
- ADASYN (Adaptive Synthetic Sampling Approach)
- Random Under-sampling
- Random Over Sampling
- Hybridization (SMOTE + Tomek Links)
- Hybridization (SMOTE + ENN)
2. What is the difference between SMOTE and random oversampling?
ANS: – Random oversampling involves duplicating instances of the minority class randomly, whereas SMOTE generates synthetic instances by interpolating between minority class instances and their k-nearest neighbors in the feature space. SMOTE is typically considered to be more effective than random oversampling.
3. Why is class imbalance a problem in machine learning?
ANS: – Class imbalance can lead to biased models that perform poorly on the minority class. Machine learning algorithms tend to favor the majority class due to its larger representation in the dataset.
WRITTEN BY Parth Sharma
Click to Comment