What are Outliers
Outliers simply mean Values that are different from other data points. May be small values or large values or slight variation in the values. Example: To study the trend of XYZ ltd. every 10 years. It is very easy to detect outliers in this example 1945 and present years are deviations from the usual intervals of study. Hence, we can easily remove them.
Causes of Outliers
- Data entry or measurement errors,
- Sampling problems and unusual conditions,
- Natural variation.
Why Outliers are a Problem
- Increase the variability of the data, decreases the statistical power.
- Machine learning algorithms are very sensitive to the range and distribution of attribute values.
- Outliers can spoil and mislead the training process resulting in longer training time
- Less accurate predictions, poor results.
How to Detect Outliers?
No strict statistical rules. Finding outliers depends on subject-area knowledge and an understanding of the data collection process.
Ways to Detect Outliers?
There are several ways to detect outliers to list a few:
- Sorting the values,
- Graphical representation: Box plot or histogram (Single Variate)
- Multivariate variables through scatter plots.
- Z-Sore test
- Hypothesis Testing
- Finding the Interquartile range
In this post, I will discuss Z-Score, Hypothesis Testing, and Interquartile range.
Z-Score
Z-Sore test to detect outliers: Can quantify the unusual patterns when data follow the
normal distribution. it is simply the no. of standard deviations above and below the mean,
each value falls.
Z score of 2: indicates an observation is 2 standard deviations above the average
while a z-score of -2 signifies it is two SD below the mean. Z score= 0 means value equals
the mean. Z-score is finding the distribution of data where the mean is 0 and the standard deviation is 1 i.e. normal distribution.
In most cases, a threshold of 3 or -3 is used i.e if the Z-score value is
greater than or less than 3 or -3 respectively, that data point will be identified as outliers.
Hypothesis Testing
Finding outliers with Hypothesis Tests: Grubb’s test: Null or Alternative Hypothesis.
the p-value is less than the significance level, reject the null and accept the alternative hypothesis
eg: Null: all values are drawn from a single population that follows the same normal
distribution.
Alternative: one value in the sample was not drawn from the same normally distributed population as
the other values.
Grubb’s test Limitations: it can detect only one outlier. We need to perform a Different test if we have many outliers.
Interquartile Range
The interquartile range (IQR), also called the midspread or middle 50%, or technically H-spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1.
IQR=Q3-Q1
Significance of Outliers
Outlier you can be an outlier in your family, having a distinct and different opinion, you might be distinct in your own rights. But does this means, that you are eliminated from the family? Does this mean that difference of opinions is not to be respected?
Hence it doesn’t make sense to remove the outliers from every data set. It largely depends upon the data set and the field of study. Based on that one decides whether to remove outliers or not.
Example: Salary of Employees and VP of an organization. The salary of the VP will be an outlier, but to remove his salary considering it to an outlier would simply mean that the true picture of the organization is hidden. The income disparity cannot be ignored here, management gets a chance to revise their policies. Outliers indicate an unusual pattern, depending upon the field and area of study outliers become significant.
Significance of the outlier:
1)Very informative about the subject area and data collection process.
2) Before removing them it is essential to understand how outliers occur
Sources
https://statisticsbyjim.com/basics/remove-outliers/