Outlier Test in Python/v3
Learn how to test for outliers in datasets using Python.
See our Version 4 Migration Guide for information about how to upgrade.
New to Plotly?¶
Plotly's Python library is free and open source! Get started by dowloading the client and reading the primer.
You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.tools import FigureFactory as FF
import numpy as np
import pandas as pd
import scipy
Import Data¶
In order to start performing outlier tests, we will import some data of average wind speed sampled every 10 minutes, also used in the Normality Test Tutorial.
data = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/wind_speed_laurel_nebraska.csv')
df = data[0:10]
table = FF.create_table(df)
py.iplot(table, filename='wind-data-sample')
In any set of data, an outlier
is a a datum point that is not consistent with the other data points. If the data sampled from a particular distribution then with high probability, an outlier would not belong to that distribution. There are various tests used for testing if a particular point is an outlier, and this is done with the same null-hypothesis testing used in Normality Tests.
Q Test¶
Dixon's Q-Test is used to help determine whether there is evidence for a given point to be an outlier of a 1D dataset. It is assumed that the dataset is normally distributed. Since we have very strong evidence that our dataset above is normal from all our normality tests, we can use the Q-Test here. As with the normality tests, we are assuming a significance level of $0.05$ and for simplicity, we are only considering the smallest datum point in the set.
For more information on the choice of 0.05 for a significance level, check out this page.
def q_test_for_smallest_point(dataset):
q_ref = 0.29 # the reference Q value for a significance level of 95% and 30 data points
q_stat = (dataset[1] - dataset[0])/(dataset[-1] - dataset[0])
if q_stat > q_ref:
print("Since our Q-statistic is %f and %f > %f, we have evidence that our "
"minimum point IS an outlier to the data.") %(q_stat, q_stat, q_ref)
else:
print("Since our Q-statistic is %f and %f < %f, we have evidence that our "
"minimum point is NOT an outlier to the data.") %(q_stat, q_stat, q_ref)
For our example, the Q-statistic is the ratio of the absolute distance between the smallest and closest number in the set, to the range of our dataset. This means:
$$ \begin{align*} Q = \frac{gap}{range} \end{align*} $$For our example, we will take 30 values from our dataset that contains the minimum value in full dataset, and apply the test on that sample. Then we'll convert our array to a list and sort it by increasing value.
dataset = data[100:130]['10 Min Sampled Avg'].values.tolist()
dataset.sort()
q_test_for_smallest_point(dataset)
Visualize the Q Test¶
To properly visualize our critical height
, we can make a scatter plot with the dataset points in increasing order and draw a line for our critical height. This critical height is the threshold such that if our lowest point in the dataset was lower than it, than it would be considered an outlier
. To derive this value, we just take
from a look-up table and then plug it into our formula for $Q$ above, replacing our smallest value with an unknown $x$
$$ \begin{align*} 0.29 = \frac{5.5 - x}{26.0} \end{align*} $$and therefore we get
$$ \begin{align*} x = -2.04 \end{align*} $$x = [j for j in range(len(dataset))]
y1 = dataset
y2 = [-2.04 for j in range(len(dataset))]
trace1 = go.Scatter(
x = x,
y = y1,
mode = 'lines+markers',
name='Dataset',
marker=dict(symbol=[100, 0])
)
trace2 = go.Scatter(
x = x,
y = y2,
mode = 'lines',
name='Critical Line'
)
data = [trace1, trace2]
py.iplot(data, filename='q-test-scatter')
Since our smallest value (the holoed out circle) is higher than the critical line, this validates the result of the test that the point is NOT
an outlier.