Group By in MATLAB®
How to use Group By in MATLAB® with Plotly.
Plotly Studio: Transform any dataset into an interactive data application in minutes with AI. Try Plotly Studio now.
Note: We are retiring documentation for R, MATLAB, Julia, and F#. Learn more about this change here.
Dataset Array Summary Statistics Organized by Group
Load the sample data.
load('hospital');
The dataset array hospital has 100 observations and 7 variables.
Create a dataset array with only the variables Sex, Age, Weight, and Smoker.
load('hospital');
dsa = hospital(:,{'Sex','Age','Weight','Smoker'});
Sex is a nominal array, with levels Male and Female. The variables Age and Weight have numeric values, and Smoker has logical values.
Compute the mean for the numeric and logical arrays, Age, Weight, and Smoker, grouped by the levels in Sex.
load('hospital');
dsa = hospital(:,{'Sex','Age','Weight','Smoker'});
statarray = grpstats(dsa,'Sex')
statarray =
Sex GroupCount mean_Age mean_Weight mean_Smoker
Female Female 53 37.717 130.47 0.24528
Male Male 47 38.915 180.53 0.44681
statarray is a dataset array with two rows, corresponding to the levels in Sex. GroupCount is the number of observations in each group. The means of Age, Weight, and Smoker, grouped by Sex, are given in mean_Age, mean_Weight, and mean_Smoker.
Compute the mean for Age and Weight, grouped by the values in Smoker.
load('hospital');
dsa = hospital(:,{'Sex','Age','Weight','Smoker'});
statarray = grpstats(dsa,'Smoker','mean','DataVars',{'Age','Weight'})
statarray =
Smoker GroupCount mean_Age mean_Weight
0 false 66 37.97 149.91
1 true 34 38.882 161.94
In this case, not all variables in dsa (excluding the grouping variable, Smoker) are numeric or logical arrays; the variable Sex is a nominal array. When not all variables in the input dataset array are numeric or logical arrays, you must specify the variables for which you want to calculate summary statistics using DataVars.
Compute the minimum and maximum weight, grouped by the combinations of values in Sex and Smoker.
load('hospital');
dsa = hospital(:,{'Sex','Age','Weight','Smoker'});
statarray = grpstats(dsa,{'Sex','Smoker'},{'min','max'},...
'DataVars','Weight')
statarray =
Sex Smoker GroupCount min_Weight max_Weight
Female_0 Female false 40 111 147
Female_1 Female true 13 115 146
Male_0 Male false 26 158 194
Male_1 Male true 21 164 202
There are two unique values in Smoker and two levels in Sex, for a total of four possible combinations of values: Female Nonsmoker (Female_0), Female Smoker (Female_1), Male Nonsmoker (Male_0), and Male Smoker (Male_1).
Specify the names for the columns in the output.
load('hospital');
dsa = hospital(:,{'Sex','Age','Weight','Smoker'});
statarray = grpstats(dsa,{'Sex','Smoker'},{'min','max'},...
'DataVars','Weight','VarNames',{'Gender','Smoker',...
'GroupCount','LowestWeight','HighestWeight'})
statarray =
Gender Smoker GroupCount LowestWeight HighestWeight
Female_0 Female false 40 111 147
Female_1 Female true 13 115 146
Male_0 Male false 26 158 194
Male_1 Male true 21 164 202
Summary Statistics for a Dataset Array Without Grouping
Load the sample data.
load('hospital');
The dataset array hospital has 100 observations and 7 variables.
Create a dataset array with only the variables Age, Weight, and Smoker.
load('hospital');
dsa = hospital(:,{'Age','Weight','Smoker'});
The variables Age and Weight have numeric values, and Smoker has logical values.
Compute the mean, minimum, and maximum for the numeric and logical arrays, Age, Weight, and Smoker, with no grouping.
load('hospital');
dsa = hospital(:,{'Age','Weight','Smoker'});
statarray = grpstats(dsa,[],{'mean','min','max'})
statarray =
GroupCount mean_Age min_Age max_Age mean_Weight
All 100 38.28 25 50 154
min_Weight max_Weight mean_Smoker min_Smoker max_Smoker
All 111 202 0.34 false true
The observation name All indicates that all observations in dsa were used to compute the summary statistics.
Group Means for a Matrix Using One or More Grouping Variables
Load the sample data.
load('carsmall')
All variables are measured for 100 cars. Origin is the country of origin for each car (France, Germany, Italy, Japan, Sweden, or USA). Cylinders has three unique values, 4, 6, and 8, indicating the number of cylinders in each car.
Calculate the mean acceleration, grouped by country of origin.
load('carsmall')
means = grpstats(Acceleration,Origin)
means = 14.4377 18.0500 15.8867 16.3778 16.6000 15.5000
means is a 6-by-1 vector of mean accelerations, where each value corresponds to a country of origin.
Calculate the mean acceleration, grouped by both country of origin and number of cylinders.
load('carsmall')
means = grpstats(Acceleration,{Origin,Cylinders})
means = 17.0818 16.5267 11.6406 18.0500 15.9143 15.5000 16.3375 16.7000 16.6000 15.5000
There are 18 possible combinations of grouping variable values because Origin has 6 unique values and Cylinders has 3 unique values. Only 10 of the possible combinations appear in the data, so means is a 10-by-1 vector of group means corresponding to the observed combinations of values.
Return the group names along with the mean acceleration for each group.
load('carsmall')
[means,grps] = grpstats(Acceleration,{Origin,Cylinders},{'mean','gname'})
means =
17.0818
16.5267
11.6406
18.0500
15.9143
15.5000
16.3375
16.7000
16.6000
15.5000
grps =
10x2 cell array
{'USA' } {'4'}
{'USA' } {'6'}
{'USA' } {'8'}
{'France' } {'4'}
{'Japan' } {'4'}
{'Japan' } {'6'}
{'Germany'} {'4'}
{'Germany'} {'6'}
{'Sweden' } {'4'}
{'Italy' } {'4'}
The output grps shows the 10 observed combinations of grouping variable values. For example, the mean acceleration of 4-cylinder cars made in France is 18.05.
Multiple Summary Statistics for a Matrix Organized by Group
Load the sample data.
load carsmall
The variable Acceleration was measured for 100 cars. The variable Origin is the country of origin for each car (France, Germany, Italy, Japan, Sweden, or USA).
Return the minimum and maximum acceleration grouped by country of origin.
load carsmall
[grpMin,grpMax,grp] = grpstats(Acceleration,Origin,{'min','max','gname'})
grpMin =
8.0000
15.3000
13.9000
12.2000
15.7000
15.5000
grpMax =
22.2000
21.9000
18.2000
24.6000
17.5000
15.5000
grp =
6x1 cell array
{'USA' }
{'France' }
{'Japan' }
{'Germany'}
{'Sweden' }
{'Italy' }
The sample car with the lowest acceleration is made in the USA, and the sample car with the highest acceleration is made in Germany.
Plot Prediction Intervals for a New Observation in Each Group
Load the sample data.
load('carsmall')
The variable Weight was measured for 100 cars. The variable Model_Year has three unique values, 70, 76, and 82, which correspond to model years 1970, 1976, and 1982.
Calculate the mean weight and 90% prediction intervals for each model year.
load('carsmall')
[means,pred,grp] = grpstats(Weight,Model_Year,...
{'mean','predci','gname'},'Alpha',0.1);
Plot error bars showing the mean weight and 90% prediction intervals, grouped by model year. Label the horizontal axis with the group names.
load('carsmall')
[means,pred,grp] = grpstats(Weight,Model_Year,...
{'mean','predci','gname'},'Alpha',0.1);
ngrps = length(grp); % Number of groups
errorbar((1:ngrps)',means,pred(:,2)-means)
xlim([0.5 3.5])
set(gca,'xtick',1:ngrps,'xticklabel',grp)
title('90% Prediction Intervals for Weight by Year')
fig2plotly(gcf);
Plot Group Means and Confidence Intervals
Load the sample data.
load('carsmall')
The variables Acceleration and Weight are the acceleration and weight values measured for 100 cars. The variable Cylinders is the number of cylinders in each car. The variable Model_Year has three unique values, 70, 76, and 82, which correspond to model years 1970, 1976, and 1982.
Plot mean acceleration, grouped by Cylinders, with 95% confidence intervals.
load('carsmall')
grpstats(Acceleration,Cylinders,0.05);
fig2plotly(gcf);
The mean acceleration for cars with 8 cylinders is significantly lower than for cars with 4 or 6 cylinders.
Plot mean acceleration and weight, grouped by Cylinders, and 95% confidence intervals. Scale the Weight values by 1000 so the means of Weight and Acceleration are the same order of magnitude.
load('carsmall')
grpstats([Acceleration,Weight/1000],Cylinders,0.05);
fig2plotly(gcf);
The average weight of cars increases with the number of cylinders, and the average acceleration decreases with the number of cylinders.
Plot mean acceleration, grouped by both Cylinders and Model_Year. Specify 95% confidence intervals.
load('carsmall')
grpstats(Acceleration,{Cylinders,Model_Year},0.05);
fig2plotly(gcf);
There are nine possible combinations of grouping variable values because there are three unique values in Cylinders and three unique values in Model_Year. The plot does not show 8-cylinder cars with model year 1982 because the data did not include this combination.
The mean acceleration of 8-cylinder cars made in 1976 is significantly larger than the mean acceleration of 8-cylinder cars made in 1970.