You are currently viewing What are Outliers in the Data Set
Outliers in Data set

What are Outliers in the Data Set

Table of Content

    What are Outliers

    Outliers simply mean Values that are different from other data points. May be small values or large values or slight variation in the values. Example: To study the trend of XYZ ltd. every 10 years. It is very easy to detect outliers in this example 1945 and present years are deviations from the usual intervals of study. Hence, we can easily remove them.

    What are outliers
    Outliers Detection

    Causes of Outliers

    • Data entry or measurement errors,
    • Sampling problems and unusual conditions,
    • Natural variation.

    Why Outliers are a Problem

    • Increase the variability of the data, decreases the statistical power.
    • Machine learning algorithms are very sensitive to the range and distribution of attribute values.
    • Outliers can spoil and mislead the training process resulting in longer training time
    • Less accurate predictions, poor results.

    How to Detect Outliers?

    No strict statistical rules. Finding outliers depends on subject-area knowledge and an understanding of the data collection process.

    Ways to Detect Outliers?

    There are several ways to detect outliers to list a few:

    • Sorting the values,
    • Graphical representation: Box plot or histogram (Single Variate)
    • Multivariate variables through scatter plots.
    • Z-Sore test
    • Hypothesis Testing
    • Finding the Interquartile range

    In this post, I will discuss Z-Score, Hypothesis Testing, and Interquartile range.

    Z-Score

    Z-Sore test to detect outliers: Can quantify the unusual patterns when data follow the
    normal distribution. it is simply the no. of standard deviations above and below the mean,
    each value falls.
    Z score of 2: indicates an observation is 2 standard deviations above the average
    while a z-score of -2 signifies it is two SD below the mean. Z score= 0 means value equals
    the mean. Z-score is finding the distribution of data where the mean is 0 and the standard deviation is 1 i.e. normal distribution.

    In most cases, a threshold of 3 or -3 is used i.e if the Z-score value is
    greater than or less than 3 or -3 respectively, that data point will be identified as outliers.

    Z-Score Formula

    Hypothesis Testing

    Finding outliers with Hypothesis Tests: Grubb’s test: Null or Alternative Hypothesis.
    the p-value is less than the significance level, reject the null and accept the alternative hypothesis
    eg: Null: all values are drawn from a single population that follows the same normal
    distribution.
    Alternative: one value in the sample was not drawn from the same normally distributed population as
    the other values.
    Grubb’s test Limitations: it can detect only one outlier. We need to perform a Different test if we have many outliers.

    Interquartile Range

    The interquartile range (IQR), also called the midspread or middle 50%, or technically H-spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1.

    IQR=Q3-Q1

    Significance of Outliers

    Outlier you can be an outlier in your family, having a distinct and different opinion, you might be distinct in your own rights. But does this means, that you are eliminated from the family? Does this mean that difference of opinions is not to be respected?

    Hence it doesn’t make sense to remove the outliers from every data set. It largely depends upon the data set and the field of study. Based on that one decides whether to remove outliers or not.

    Example: Salary of Employees and VP of an organization. The salary of the VP will be an outlier, but to remove his salary considering it to an outlier would simply mean that the true picture of the organization is hidden. The income disparity cannot be ignored here, management gets a chance to revise their policies. Outliers indicate an unusual pattern, depending upon the field and area of study outliers become significant.

    Significance of the outlier:
    1)Very informative about the subject area and data collection process.
    2) Before removing them it is essential to understand how outliers occur

    Sources

    https://statisticsbyjim.com/basics/remove-outliers/


    Leave a Reply