[ad_1]
An article exploring techniques for outlier detection in datasets. Learn how to use data visualization, z-scores, and clustering techniques to spot outliers in your dataset
Nassim Taleb writes how “tail” events define a large part of the success (or failure) of a phenomenon in the world.
Everybody knows that you need more prevention than treatment, but few reward acts of prevention.
N. Taleb — The Black Swan
A tail event is a rare event, the probability of which is on the tail of the distribution, on the left or right.
According to Taleb, we live our lives focusing primarily on the most plausible events, those that are most likely to happen. By doing this, we are not preparing ourselves to deal with the rare events that might happen.
When rare events happen (especially the negative ones), they take us by surprise and our usual actions that we typically take have no effect.
Just think of our behavior when a rare event occurs, such as the bankruptcy of the FTX cryptocurrency exchange, or a powerful earthquake that disrupts the territory. For those directly involved, the typical reaction is panic.
Anomalies are present everywhere, and when we draw a distribution and its probability function we are actually obtaining useful information to protect ourselves or to implement strategies for these tail events, should they occur.
It is therefore necessary to inform ourselves on how to identify these anomalies, and above all to be ready to act in cases where they are observed.
In this article, we will focus on the methods and techniques used to identify outliers (the mentioned anomalies) in data. In particular, we will explore data visualization techniques and the use of descriptive statistics and statistical testing.
An outlier is a value that deviates significantly from the other values in the dataset. This deviation can be numerical or even categorical.
For example, a numeric outlier is when we have one value that is much larger or much smaller than most other values within the dataset.
A categorical outlier, on the other hand, occurs when we have labels known as “other” or “unknown” that represent a much higher proportion of the other labels within the dataset.
Outliers can be caused by measurement errors, input errors, transcription errors or simply by data that does not follow the normal trend of the dataset.
In some cases, outliers can be indicative of broader problems in the dataset or the process that produced the data and can offer important insights to the people who developed the data collection process.
There are several techniques that we can use to identify outliers in our data. These are the ones we will touch upon in this article
- data visualization: which allows you to identify anomalies by looking at the distribution of data by making use of graphs useful for this purpose
- use of descriptive statistics, such as the interquartile range
- use of z-scores
- use of clustering techniques: which allows to identify groups of similar data and to identify any “isolated” or “unclassifiable” data
each of these methods is valid for identifying outliers, and should be chosen based on our data. Let’s see them one by one.
Data visualization
One of the most common techniques for finding anomalies is through exploratory data analysis and particularly with data visualization.
Using Python, you can use libraries like Matplotlib or Seaborn to visualize the data in such a way that you can easily spot any anomalies.
For example, you can create a histogram or boxplot to visualize the distribution of your data and spot any values that deviate significantly from the mean.
The anatomy of the boxplot can be understood from this Kaggle post.
https://www.kaggle.com/discussions/general/219871
If you want to read more about how to perform exploratory data analysis (EDA), read this article 👇
Use of descriptive statistics
Another method of identifying anomalies is the use of descriptive statistics. For example, the interquartile range (IQR) can be used to identify values that deviate significantly from the mean.
The interquartile range (IQR) is defined as the difference between the third quartile (Q3) and the first quartile (Q1) of the dataset. Outliers are defined as values outside the IQR range multiplied by a coefficient typically of 1.5.
The previously discussed boxplot is just one method that uses such descriptive metrics to identify anomalies.
An example in Python for identifying outliers using interquartile range is as follows:
import numpy as npdef find_outliers_IQR(data, threshold=1.5):
# Find first and third quartiles
Q1, Q3 = np.percentile(data, [25, 75])
# Compute IQR (interquartile range)
IQR = Q3 - Q1
# Compute lower and upper bound
lower_bound = Q1 - (threshold * IQR)
upper_bound = Q3 + (threshold * IQR)
# Select outliers
outliers = [x for x in data if x < lower_bound or x > upper_bound]
return outliers
This method calculates the first and third quartiles of the dataset, then calculates the IQR and the lower and upper bounds. Finally, identify outliers as those values that are outside the lower and upper thresholds.
This handy function can be used to identify outliers in a dataset and can be added to your toolkit of utility functions in almost any project.
Use of z-scores
Another way to spot anomalies is through z-scores. Z-scores measure how much a value deviates from the mean in terms of standard deviations.
The formula for converting data to z-scores is as follows:
where x is the original value, μ is the dataset mean, and σ is the dataset standard deviation. The z-score indicates how many standard deviations the original value is from the mean. A z-score value greater than 3 (or less than -3) is usually considered an outlier.
This method is particularly useful when working with large datasets and when you want to identify anomalies in an objective and reproducible way.
In Sklearn in Python, the conversion to z scores can be done like this
from sklearn.preprocessing import StandardScalerdef find_outliers_zscore(data, threshold=3):
# Normalize data
scaler = StandardScaler()
standardized = scaler.fit_transform(data.reshape(-1, 1))
# Select outliers
outliers = [data[i] for i, x in enumerate(standardized) if x < -threshold or x > threshold]
return outliers
Use of clustering techniques
Finally, clustering techniques can be used to identify any “isolated” or “unclassifiable” data. This can be useful when working with very large and complex datasets, where data visualization is not enough to spot anomalies.
In this case, one option is to use the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm, which is a clustering algorithm that can identify groups of data based on their density and locate any points that don’t belong to any clusters. These points are considered as outliers.
The DBSCAN algorithm can again be implemented with Python’s sklearn lib.
Take this visualized dataset for example
The DBSCAN application provides this visualization
The code to create these charts is as follows
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCANdef generate_data_with_outliers(n_samples=100, noise=0.05, outlier_fraction=0.05, random_state=42):
# Create random data
X = np.concatenate([np.random.normal(0.5, 0.1, size=(n_samples//2, 2)),
np.random.normal(1.5, 0.1, size=(n_samples//2, 2))], axis=0)
# Add outliers
n_outliers = int(outlier_fraction * n_samples)
outliers = np.random.RandomState(seed=random_state).rand(n_outliers, 2) * 3 - 1.5
X = np.concatenate((X, outliers), axis=0)
# Add noise to the data to resemble real-world data
X = X + np.random.randn(n_samples + n_outliers, 2) * noise
return X
# Genereate data
X = generate_data_with_outliers(outlier_fraction=0.2)
# Apply DBSCAN to cluster the data and find outliers
dbscan = DBSCAN(eps=0.2, min_samples=5)
dbscan.fit(X)
# Select outliers
outlier_indices = np.where(dbscan.labels_ == -1)[0]
# Visualize
plt.scatter(X[:, 0], X[:, 1], c=dbscan.labels_, cmap="viridis")
plt.scatter(X[outlier_indices, 0], X[outlier_indices, 1], c="red", label="Outliers", marker="x")
plt.xticks([])
plt.yticks([])
plt.legend()
plt.show()
This method creates a DBSCAN object with the parameters eps
and min_samples
and fits it to the data. Then identify outliers as those values that don’t belong to any cluster, i.e. those that are labeled as -1.
This is just one of many clustering techniques that can be used to identify anomalies. For example, a method based on deep learning relies on autoencoders particular neural networks that exploit a compressed representation of the data to identify distinctive features in the input data.
In this article we have seen several techniques that can be used to identify outliers in data.
We talked about data visualization, the use of descriptive statistics and z-scores, and clustering techniques.
Each of these techniques is valid and should be chosen based on the type of data you are analyzing. The important thing is to remember that identifying outliers can provide important information to improve data collection processes and to make better decisions based on the results obtained.
Source link