# Group By in MATLAB®

How to use Group By in MATLAB® with Plotly.

## Dataset Array Summary Statistics Organized by Group

load('hospital');


The dataset array hospital has 100 observations and 7 variables.

Create a dataset array with only the variables Sex, Age, Weight, and Smoker.

load('hospital');

dsa = hospital(:,{'Sex','Age','Weight','Smoker'});


Sex is a nominal array, with levels Male and Female. The variables Age and Weight have numeric values, and Smoker has logical values.

Compute the mean for the numeric and logical arrays, Age, Weight, and Smoker, grouped by the levels in Sex.

load('hospital');

dsa = hospital(:,{'Sex','Age','Weight','Smoker'});

statarray = grpstats(dsa,'Sex')

statarray =

Sex       GroupCount    mean_Age    mean_Weight    mean_Smoker
Female    Female    53            37.717      130.47         0.24528
Male      Male      47            38.915      180.53         0.44681


statarray is a dataset array with two rows, corresponding to the levels in Sex. GroupCount is the number of observations in each group. The means of Age, Weight, and Smoker, grouped by Sex, are given in mean_Age, mean_Weight, and mean_Smoker.

Compute the mean for Age and Weight, grouped by the values in Smoker.

load('hospital');

dsa = hospital(:,{'Sex','Age','Weight','Smoker'});

statarray = grpstats(dsa,'Smoker','mean','DataVars',{'Age','Weight'})

statarray =

Smoker    GroupCount    mean_Age    mean_Weight
0    false     66             37.97      149.91
1    true      34            38.882      161.94


In this case, not all variables in dsa (excluding the grouping variable, Smoker) are numeric or logical arrays; the variable Sex is a nominal array. When not all variables in the input dataset array are numeric or logical arrays, you must specify the variables for which you want to calculate summary statistics using DataVars.

Compute the minimum and maximum weight, grouped by the combinations of values in Sex and Smoker.

load('hospital');

dsa = hospital(:,{'Sex','Age','Weight','Smoker'});

statarray = grpstats(dsa,{'Sex','Smoker'},{'min','max'},...
'DataVars','Weight')

statarray =

Sex       Smoker    GroupCount    min_Weight    max_Weight
Female_0    Female    false     40            111           147
Female_1    Female    true      13            115           146
Male_0      Male      false     26            158           194
Male_1      Male      true      21            164           202


There are two unique values in Smoker and two levels in Sex, for a total of four possible combinations of values: Female Nonsmoker (Female_0), Female Smoker (Female_1), Male Nonsmoker (Male_0), and Male Smoker (Male_1).

Specify the names for the columns in the output.

load('hospital');

dsa = hospital(:,{'Sex','Age','Weight','Smoker'});

statarray = grpstats(dsa,{'Sex','Smoker'},{'min','max'},...
'DataVars','Weight','VarNames',{'Gender','Smoker',...
'GroupCount','LowestWeight','HighestWeight'})

statarray =

Gender    Smoker    GroupCount    LowestWeight    HighestWeight
Female_0    Female    false     40            111             147
Female_1    Female    true      13            115             146
Male_0      Male      false     26            158             194
Male_1      Male      true      21            164             202


## Summary Statistics for a Dataset Array Without Grouping

load('hospital');


The dataset array hospital has 100 observations and 7 variables.

Create a dataset array with only the variables Age, Weight, and Smoker.

load('hospital');

dsa = hospital(:,{'Age','Weight','Smoker'});


The variables Age and Weight have numeric values, and Smoker has logical values.

Compute the mean, minimum, and maximum for the numeric and logical arrays, Age, Weight, and Smoker, with no grouping.

load('hospital');

dsa = hospital(:,{'Age','Weight','Smoker'});

statarray = grpstats(dsa,[],{'mean','min','max'})

statarray =

GroupCount    mean_Age    min_Age    max_Age    mean_Weight
All    100           38.28       25         50         154

min_Weight    max_Weight    mean_Smoker    min_Smoker    max_Smoker
All    111           202           0.34           false         true


The observation name All indicates that all observations in dsa were used to compute the summary statistics.

## Group Means for a Matrix Using One or More Grouping Variables

load('carsmall')


All variables are measured for 100 cars. Origin is the country of origin for each car (France, Germany, Italy, Japan, Sweden, or USA). Cylinders has three unique values, 4, 6, and 8, indicating the number of cylinders in each car.

Calculate the mean acceleration, grouped by country of origin.

load('carsmall')

means = grpstats(Acceleration,Origin)

means =

14.4377
18.0500
15.8867
16.3778
16.6000
15.5000


means is a 6-by-1 vector of mean accelerations, where each value corresponds to a country of origin.

Calculate the mean acceleration, grouped by both country of origin and number of cylinders.

load('carsmall')

means = grpstats(Acceleration,{Origin,Cylinders})

means =

17.0818
16.5267
11.6406
18.0500
15.9143
15.5000
16.3375
16.7000
16.6000
15.5000


There are 18 possible combinations of grouping variable values because Origin has 6 unique values and Cylinders has 3 unique values. Only 10 of the possible combinations appear in the data, so means is a 10-by-1 vector of group means corresponding to the observed combinations of values.

Return the group names along with the mean acceleration for each group.

load('carsmall')

[means,grps] = grpstats(Acceleration,{Origin,Cylinders},{'mean','gname'})

means =

17.0818
16.5267
11.6406
18.0500
15.9143
15.5000
16.3375
16.7000
16.6000
15.5000

grps =

10x2 cell array

{'USA'    }    {'4'}
{'USA'    }    {'6'}
{'USA'    }    {'8'}
{'France' }    {'4'}
{'Japan'  }    {'4'}
{'Japan'  }    {'6'}
{'Germany'}    {'4'}
{'Germany'}    {'6'}
{'Sweden' }    {'4'}
{'Italy'  }    {'4'}


The output grps shows the 10 observed combinations of grouping variable values. For example, the mean acceleration of 4-cylinder cars made in France is 18.05.

## Multiple Summary Statistics for a Matrix Organized by Group

load carsmall


The variable Acceleration was measured for 100 cars. The variable Origin is the country of origin for each car (France, Germany, Italy, Japan, Sweden, or USA).

Return the minimum and maximum acceleration grouped by country of origin.

load carsmall

[grpMin,grpMax,grp] = grpstats(Acceleration,Origin,{'min','max','gname'})

grpMin =

8.0000
15.3000
13.9000
12.2000
15.7000
15.5000

grpMax =

22.2000
21.9000
18.2000
24.6000
17.5000
15.5000

grp =

6x1 cell array

{'USA'    }
{'France' }
{'Japan'  }
{'Germany'}
{'Sweden' }
{'Italy'  }


The sample car with the lowest acceleration is made in the USA, and the sample car with the highest acceleration is made in Germany.

## Plot Prediction Intervals for a New Observation in Each Group

load('carsmall')


The variable Weight was measured for 100 cars. The variable Model_Year has three unique values, 70, 76, and 82, which correspond to model years 1970, 1976, and 1982.

Calculate the mean weight and 90% prediction intervals for each model year.

load('carsmall')

[means,pred,grp] = grpstats(Weight,Model_Year,...
{'mean','predci','gname'},'Alpha',0.1);


Plot error bars showing the mean weight and 90% prediction intervals, grouped by model year. Label the horizontal axis with the group names.

load('carsmall')

[means,pred,grp] = grpstats(Weight,Model_Year,...
{'mean','predci','gname'},'Alpha',0.1);

ngrps = length(grp); % Number of groups
errorbar((1:ngrps)',means,pred(:,2)-means)
xlim([0.5 3.5])
set(gca,'xtick',1:ngrps,'xticklabel',grp)
title('90% Prediction Intervals for Weight by Year')

fig2plotly(gcf);


## Plot Group Means and Confidence Intervals

load('carsmall')


The variables Acceleration and Weight are the acceleration and weight values measured for 100 cars. The variable Cylinders is the number of cylinders in each car. The variable Model_Year has three unique values, 70, 76, and 82, which correspond to model years 1970, 1976, and 1982.

Plot mean acceleration, grouped by Cylinders, with 95% confidence intervals.

load('carsmall')

grpstats(Acceleration,Cylinders,0.05);

fig2plotly(gcf);


The mean acceleration for cars with 8 cylinders is significantly lower than for cars with 4 or 6 cylinders.

Plot mean acceleration and weight, grouped by Cylinders, and 95% confidence intervals. Scale the Weight values by 1000 so the means of Weight and Acceleration are the same order of magnitude.

load('carsmall')

grpstats([Acceleration,Weight/1000],Cylinders,0.05);

fig2plotly(gcf);


The average weight of cars increases with the number of cylinders, and the average acceleration decreases with the number of cylinders.

Plot mean acceleration, grouped by both Cylinders and Model_Year. Specify 95% confidence intervals.

load('carsmall')

grpstats(Acceleration,{Cylinders,Model_Year},0.05);

fig2plotly(gcf);


There are nine possible combinations of grouping variable values because there are three unique values in Cylinders and three unique values in Model_Year. The plot does not show 8-cylinder cars with model year 1982 because the data did not include this combination.

The mean acceleration of 8-cylinder cars made in 1976 is significantly larger than the mean acceleration of 8-cylinder cars made in 1970.