Anova in Python/v3
Learn how to perform a one and two way ANOVA test using Python.
See our Version 4 Migration Guide for information about how to upgrade.
New to Plotly?¶
Plotly's Python library is free and open source! Get started by dowloading the client and reading the primer.
You can set up Plotly to work in online or offline mode, or in jupyter notebooks.
We also have a quick-reference cheatsheet (new!) to help you get started!
Imports¶
The tutorial below imports NumPy, Pandas, SciPy, and Statsmodels.
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.tools import FigureFactory as FF
import numpy as np
import pandas as pd
import scipy
import statsmodels
import statsmodels.api as sm
from statsmodels.formula.api import ols
One-Way ANOVA¶
An Analysis of Variance Test
or an ANOVA
is a generalization of the t-tests to more than 2 groups. Our null hypothesis states that there are equal means in the populations from which the groups of data were sampled. More succinctly:
for $n$ groups of data. Our alternative hypothesis would be that any one of the equivalences in the above equation fail to be met.
moore = sm.datasets.get_rdataset("Moore", "car", cache=True)
data = moore.data
data = data.rename(columns={"partner.status" :"partner_status"}) # make name pythonic
moore_lm = ols('conformity ~ C(fcategory, Sum)*C(partner_status, Sum)', data=data).fit()
table = sm.stats.anova_lm(moore_lm, typ=2) # Type 2 ANOVA DataFrame
print(table)
In this ANOVA test, we are dealing with an F-Statistic
and not a p-value
. Their connection is integral as they are two ways of expressing the same thing. When we set a significance level
at the start of our statistical tests (usually 0.05), we are saying that if our variable in question takes on the 5% ends of our distribution, then we can start to make the case that there is evidence against the null, which states that the data belongs to this particular distribution.
The F value is the point such that the area of the curve past that point to the tail is just the p-value. Therefore:
Pr(>F)=pFor more information on the choice of 0.05 for a significance level, check out this page.
Let us import some data for our next analysis. This time some data on tooth growth:
data = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/tooth_growth_csv')
df = data[0:10]
table = FF.create_table(df)
py.iplot(table, filename='tooth-data-sample')
Two-Way ANOVA¶
In a Two-Way ANOVA
, there are two variables to consider. The question is whether our variable in question (tooth length len
) is related to the two other variables supp
and dose
by the equation:
formula = 'len ~ C(supp) + C(dose) + C(supp):C(dose)'
model = ols(formula, data).fit()
aov_table = statsmodels.stats.anova.anova_lm(model, typ=2)
print(aov_table)
