Group By in MATLAB®
How to use Group By in MATLAB® with Plotly.
Dataset Array Summary Statistics Organized by Group
Load the sample data.
load('hospital');
The dataset array hospital
has 100 observations and 7 variables.
Create a dataset array with only the variables Sex
, Age
, Weight
, and Smoker
.
load('hospital');
dsa = hospital(:,{'Sex','Age','Weight','Smoker'});
Sex
is a nominal array, with levels Male
and Female
. The variables Age
and Weight
have numeric values, and Smoker
has logical values.
Compute the mean for the numeric and logical arrays, Age
, Weight
, and Smoker
, grouped by the levels in Sex
.
load('hospital');
dsa = hospital(:,{'Sex','Age','Weight','Smoker'});
statarray = grpstats(dsa,'Sex')
statarray = Sex GroupCount mean_Age mean_Weight mean_Smoker Female Female 53 37.717 130.47 0.24528 Male Male 47 38.915 180.53 0.44681
statarray
is a dataset array with two rows, corresponding to the levels in Sex
. GroupCount
is the number of observations in each group. The means of Age
, Weight
, and Smoker
, grouped by Sex
, are given in mean_Age
, mean_Weight
, and mean_Smoker
.
Compute the mean for Age
and Weight
, grouped by the values in Smoker
.
load('hospital');
dsa = hospital(:,{'Sex','Age','Weight','Smoker'});
statarray = grpstats(dsa,'Smoker','mean','DataVars',{'Age','Weight'})
statarray = Smoker GroupCount mean_Age mean_Weight 0 false 66 37.97 149.91 1 true 34 38.882 161.94
In this case, not all variables in dsa
(excluding the grouping variable, Smoker
) are numeric or logical arrays; the variable Sex
is a nominal array. When not all variables in the input dataset array are numeric or logical arrays, you must specify the variables for which you want to calculate summary statistics using DataVars
.
Compute the minimum and maximum weight, grouped by the combinations of values in Sex
and Smoker
.
load('hospital');
dsa = hospital(:,{'Sex','Age','Weight','Smoker'});
statarray = grpstats(dsa,{'Sex','Smoker'},{'min','max'},...
'DataVars','Weight')
statarray = Sex Smoker GroupCount min_Weight max_Weight Female_0 Female false 40 111 147 Female_1 Female true 13 115 146 Male_0 Male false 26 158 194 Male_1 Male true 21 164 202
There are two unique values in Smoker
and two levels in Sex
, for a total of four possible combinations of values: Female Nonsmoker (Female_0
), Female Smoker (Female_1
), Male Nonsmoker (Male_0
), and Male Smoker (Male_1
).
Specify the names for the columns in the output.
load('hospital');
dsa = hospital(:,{'Sex','Age','Weight','Smoker'});
statarray = grpstats(dsa,{'Sex','Smoker'},{'min','max'},...
'DataVars','Weight','VarNames',{'Gender','Smoker',...
'GroupCount','LowestWeight','HighestWeight'})
statarray = Gender Smoker GroupCount LowestWeight HighestWeight Female_0 Female false 40 111 147 Female_1 Female true 13 115 146 Male_0 Male false 26 158 194 Male_1 Male true 21 164 202
Summary Statistics for a Dataset Array Without Grouping
Load the sample data.
load('hospital');
The dataset array hospital
has 100 observations and 7 variables.
Create a dataset array with only the variables Age
, Weight
, and Smoker
.
load('hospital');
dsa = hospital(:,{'Age','Weight','Smoker'});
The variables Age
and Weight
have numeric values, and Smoker
has logical values.
Compute the mean, minimum, and maximum for the numeric and logical arrays, Age
, Weight
, and Smoker
, with no grouping.
load('hospital');
dsa = hospital(:,{'Age','Weight','Smoker'});
statarray = grpstats(dsa,[],{'mean','min','max'})
statarray = GroupCount mean_Age min_Age max_Age mean_Weight All 100 38.28 25 50 154 min_Weight max_Weight mean_Smoker min_Smoker max_Smoker All 111 202 0.34 false true
The observation name All
indicates that all observations in dsa
were used to compute the summary statistics.
Group Means for a Matrix Using One or More Grouping Variables
Load the sample data.
load('carsmall')
All variables are measured for 100 cars. Origin
is the country of origin for each car (France, Germany, Italy, Japan, Sweden, or USA). Cylinders
has three unique values, 4
, 6
, and 8
, indicating the number of cylinders in each car.
Calculate the mean acceleration, grouped by country of origin.
load('carsmall')
means = grpstats(Acceleration,Origin)
means = 14.4377 18.0500 15.8867 16.3778 16.6000 15.5000
means
is a 6-by-1 vector of mean accelerations, where each value corresponds to a country of origin.
Calculate the mean acceleration, grouped by both country of origin and number of cylinders.
load('carsmall')
means = grpstats(Acceleration,{Origin,Cylinders})
means = 17.0818 16.5267 11.6406 18.0500 15.9143 15.5000 16.3375 16.7000 16.6000 15.5000
There are 18 possible combinations of grouping variable values because Origin
has 6 unique values and Cylinders
has 3 unique values. Only 10 of the possible combinations appear in the data, so means
is a 10-by-1 vector of group means corresponding to the observed combinations of values.
Return the group names along with the mean acceleration for each group.
load('carsmall')
[means,grps] = grpstats(Acceleration,{Origin,Cylinders},{'mean','gname'})
means = 17.0818 16.5267 11.6406 18.0500 15.9143 15.5000 16.3375 16.7000 16.6000 15.5000 grps = 10x2 cell array {'USA' } {'4'} {'USA' } {'6'} {'USA' } {'8'} {'France' } {'4'} {'Japan' } {'4'} {'Japan' } {'6'} {'Germany'} {'4'} {'Germany'} {'6'} {'Sweden' } {'4'} {'Italy' } {'4'}
The output grps
shows the 10 observed combinations of grouping variable values. For example, the mean acceleration of 4-cylinder cars made in France is 18.05.
Multiple Summary Statistics for a Matrix Organized by Group
Load the sample data.
load carsmall
The variable Acceleration
was measured for 100 cars. The variable Origin
is the country of origin for each car (France, Germany, Italy, Japan, Sweden, or USA).
Return the minimum and maximum acceleration grouped by country of origin.
load carsmall
[grpMin,grpMax,grp] = grpstats(Acceleration,Origin,{'min','max','gname'})
grpMin = 8.0000 15.3000 13.9000 12.2000 15.7000 15.5000 grpMax = 22.2000 21.9000 18.2000 24.6000 17.5000 15.5000 grp = 6x1 cell array {'USA' } {'France' } {'Japan' } {'Germany'} {'Sweden' } {'Italy' }
The sample car with the lowest acceleration is made in the USA, and the sample car with the highest acceleration is made in Germany.
Plot Prediction Intervals for a New Observation in Each Group
Load the sample data.
load('carsmall')
The variable Weight
was measured for 100 cars. The variable Model_Year
has three unique values, 70
, 76
, and 82
, which correspond to model years 1970, 1976, and 1982.
Calculate the mean weight and 90% prediction intervals for each model year.
load('carsmall')
[means,pred,grp] = grpstats(Weight,Model_Year,...
{'mean','predci','gname'},'Alpha',0.1);
Plot error bars showing the mean weight and 90% prediction intervals, grouped by model year. Label the horizontal axis with the group names.
load('carsmall')
[means,pred,grp] = grpstats(Weight,Model_Year,...
{'mean','predci','gname'},'Alpha',0.1);
ngrps = length(grp); % Number of groups
errorbar((1:ngrps)',means,pred(:,2)-means)
xlim([0.5 3.5])
set(gca,'xtick',1:ngrps,'xticklabel',grp)
title('90% Prediction Intervals for Weight by Year')
fig2plotly(gcf);
Plot Group Means and Confidence Intervals
Load the sample data.
load('carsmall')
The variables Acceleration
and Weight
are the acceleration and weight values measured for 100 cars. The variable Cylinders
is the number of cylinders in each car. The variable Model_Year
has three unique values, 70
, 76
, and 82
, which correspond to model years 1970, 1976, and 1982.
Plot mean acceleration, grouped by Cylinders
, with 95% confidence intervals.
load('carsmall')
grpstats(Acceleration,Cylinders,0.05);
fig2plotly(gcf);
The mean acceleration for cars with 8 cylinders is significantly lower than for cars with 4 or 6 cylinders.
Plot mean acceleration and weight, grouped by Cylinders
, and 95% confidence intervals. Scale the Weight
values by 1000 so the means of Weight
and Acceleration
are the same order of magnitude.
load('carsmall')
grpstats([Acceleration,Weight/1000],Cylinders,0.05);
fig2plotly(gcf);
The average weight of cars increases with the number of cylinders, and the average acceleration decreases with the number of cylinders.
Plot mean acceleration, grouped by both Cylinders
and Model_Year
. Specify 95% confidence intervals.
load('carsmall')
grpstats(Acceleration,{Cylinders,Model_Year},0.05);
fig2plotly(gcf);
There are nine possible combinations of grouping variable values because there are three unique values in Cylinders
and three unique values in Model_Year
. The plot does not show 8-cylinder cars with model year 1982 because the data did not include this combination.
The mean acceleration of 8-cylinder cars made in 1976 is significantly larger than the mean acceleration of 8-cylinder cars made in 1970.