#
Group By
in
MATLAB^{®}

How to use Group By in MATLAB^{®} with Plotly.

## Dataset Array Summary Statistics Organized by Group

Load the sample data.

```
load('hospital');
```

The dataset array `hospital`

has 100 observations and 7 variables.

Create a dataset array with only the variables `Sex`

, `Age`

, `Weight`

, and `Smoker`

.

```
load('hospital');
dsa = hospital(:,{'Sex','Age','Weight','Smoker'});
```

`Sex`

is a nominal array, with levels `Male`

and `Female`

. The variables `Age`

and `Weight`

have numeric values, and `Smoker`

has logical values.

Compute the mean for the numeric and logical arrays, `Age`

, `Weight`

, and `Smoker`

, grouped by the levels in `Sex`

.

```
load('hospital');
dsa = hospital(:,{'Sex','Age','Weight','Smoker'});
statarray = grpstats(dsa,'Sex')
```

statarray = Sex GroupCount mean_Age mean_Weight mean_Smoker Female Female 53 37.717 130.47 0.24528 Male Male 47 38.915 180.53 0.44681

`statarray`

is a dataset array with two rows, corresponding to the levels in `Sex`

. `GroupCount`

is the number of observations in each group. The means of `Age`

, `Weight`

, and `Smoker`

, grouped by `Sex`

, are given in `mean_Age`

, `mean_Weight`

, and `mean_Smoker`

.

Compute the mean for `Age`

and `Weight`

, grouped by the values in `Smoker`

.

```
load('hospital');
dsa = hospital(:,{'Sex','Age','Weight','Smoker'});
statarray = grpstats(dsa,'Smoker','mean','DataVars',{'Age','Weight'})
```

statarray = Smoker GroupCount mean_Age mean_Weight 0 false 66 37.97 149.91 1 true 34 38.882 161.94

In this case, not all variables in `dsa`

(excluding the grouping variable, `Smoker`

) are numeric or logical arrays; the variable `Sex`

is a nominal array. When not all variables in the input dataset array are numeric or logical arrays, you must specify the variables for which you want to calculate summary statistics using `DataVars`

.

Compute the minimum and maximum weight, grouped by the combinations of values in `Sex`

and `Smoker`

.

```
load('hospital');
dsa = hospital(:,{'Sex','Age','Weight','Smoker'});
statarray = grpstats(dsa,{'Sex','Smoker'},{'min','max'},...
'DataVars','Weight')
```

statarray = Sex Smoker GroupCount min_Weight max_Weight Female_0 Female false 40 111 147 Female_1 Female true 13 115 146 Male_0 Male false 26 158 194 Male_1 Male true 21 164 202

There are two unique values in `Smoker`

and two levels in `Sex`

, for a total of four possible combinations of values: Female Nonsmoker (`Female_0`

), Female Smoker (`Female_1`

), Male Nonsmoker (`Male_0`

), and Male Smoker (`Male_1`

).

Specify the names for the columns in the output.

```
load('hospital');
dsa = hospital(:,{'Sex','Age','Weight','Smoker'});
statarray = grpstats(dsa,{'Sex','Smoker'},{'min','max'},...
'DataVars','Weight','VarNames',{'Gender','Smoker',...
'GroupCount','LowestWeight','HighestWeight'})
```

statarray = Gender Smoker GroupCount LowestWeight HighestWeight Female_0 Female false 40 111 147 Female_1 Female true 13 115 146 Male_0 Male false 26 158 194 Male_1 Male true 21 164 202

## Summary Statistics for a Dataset Array Without Grouping

Load the sample data.

```
load('hospital');
```

The dataset array `hospital`

has 100 observations and 7 variables.

Create a dataset array with only the variables `Age`

, `Weight`

, and `Smoker`

.

```
load('hospital');
dsa = hospital(:,{'Age','Weight','Smoker'});
```

The variables `Age`

and `Weight`

have numeric values, and `Smoker`

has logical values.

Compute the mean, minimum, and maximum for the numeric and logical arrays, `Age`

, `Weight`

, and `Smoker`

, with no grouping.

```
load('hospital');
dsa = hospital(:,{'Age','Weight','Smoker'});
statarray = grpstats(dsa,[],{'mean','min','max'})
```

statarray = GroupCount mean_Age min_Age max_Age mean_Weight All 100 38.28 25 50 154 min_Weight max_Weight mean_Smoker min_Smoker max_Smoker All 111 202 0.34 false true

The observation name `All`

indicates that all observations in `dsa`

were used to compute the summary statistics.

## Group Means for a Matrix Using One or More Grouping Variables

Load the sample data.

```
load('carsmall')
```

All variables are measured for 100 cars. `Origin`

is the country of origin for each car (France, Germany, Italy, Japan, Sweden, or USA). `Cylinders`

has three unique values, `4`

, `6`

, and `8`

, indicating the number of cylinders in each car.

Calculate the mean acceleration, grouped by country of origin.

```
load('carsmall')
means = grpstats(Acceleration,Origin)
```

means = 14.4377 18.0500 15.8867 16.3778 16.6000 15.5000

`means`

is a 6-by-1 vector of mean accelerations, where each value corresponds to a country of origin.

Calculate the mean acceleration, grouped by both country of origin and number of cylinders.

```
load('carsmall')
means = grpstats(Acceleration,{Origin,Cylinders})
```

means = 17.0818 16.5267 11.6406 18.0500 15.9143 15.5000 16.3375 16.7000 16.6000 15.5000

There are 18 possible combinations of grouping variable values because `Origin`

has 6 unique values and `Cylinders`

has 3 unique values. Only 10 of the possible combinations appear in the data, so `means`

is a 10-by-1 vector of group means corresponding to the observed combinations of values.

Return the group names along with the mean acceleration for each group.

```
load('carsmall')
[means,grps] = grpstats(Acceleration,{Origin,Cylinders},{'mean','gname'})
```

means = 17.0818 16.5267 11.6406 18.0500 15.9143 15.5000 16.3375 16.7000 16.6000 15.5000 grps = 10x2 cell array {'USA' } {'4'} {'USA' } {'6'} {'USA' } {'8'} {'France' } {'4'} {'Japan' } {'4'} {'Japan' } {'6'} {'Germany'} {'4'} {'Germany'} {'6'} {'Sweden' } {'4'} {'Italy' } {'4'}

The output `grps`

shows the 10 observed combinations of grouping variable values. For example, the mean acceleration of 4-cylinder cars made in France is 18.05.

## Multiple Summary Statistics for a Matrix Organized by Group

Load the sample data.

```
load carsmall
```

The variable `Acceleration`

was measured for 100 cars. The variable `Origin`

is the country of origin for each car (France, Germany, Italy, Japan, Sweden, or USA).

Return the minimum and maximum acceleration grouped by country of origin.

```
load carsmall
[grpMin,grpMax,grp] = grpstats(Acceleration,Origin,{'min','max','gname'})
```

grpMin = 8.0000 15.3000 13.9000 12.2000 15.7000 15.5000 grpMax = 22.2000 21.9000 18.2000 24.6000 17.5000 15.5000 grp = 6x1 cell array {'USA' } {'France' } {'Japan' } {'Germany'} {'Sweden' } {'Italy' }

The sample car with the lowest acceleration is made in the USA, and the sample car with the highest acceleration is made in Germany.

## Plot Prediction Intervals for a New Observation in Each Group

Load the sample data.

```
load('carsmall')
```

The variable `Weight`

was measured for 100 cars. The variable `Model_Year`

has three unique values, `70`

, `76`

, and `82`

, which correspond to model years 1970, 1976, and 1982.

Calculate the mean weight and 90% prediction intervals for each model year.

```
load('carsmall')
[means,pred,grp] = grpstats(Weight,Model_Year,...
{'mean','predci','gname'},'Alpha',0.1);
```

Plot error bars showing the mean weight and 90% prediction intervals, grouped by model year. Label the horizontal axis with the group names.

```
load('carsmall')
[means,pred,grp] = grpstats(Weight,Model_Year,...
{'mean','predci','gname'},'Alpha',0.1);
ngrps = length(grp); % Number of groups
errorbar((1:ngrps)',means,pred(:,2)-means)
xlim([0.5 3.5])
set(gca,'xtick',1:ngrps,'xticklabel',grp)
title('90% Prediction Intervals for Weight by Year')
fig2plotly(gcf);
```

## Plot Group Means and Confidence Intervals

Load the sample data.

```
load('carsmall')
```

The variables `Acceleration`

and `Weight`

are the acceleration and weight values measured for 100 cars. The variable `Cylinders`

is the number of cylinders in each car. The variable `Model_Year`

has three unique values, `70`

, `76`

, and `82`

, which correspond to model years 1970, 1976, and 1982.

Plot mean acceleration, grouped by `Cylinders`

, with 95% confidence intervals.

```
load('carsmall')
grpstats(Acceleration,Cylinders,0.05);
fig2plotly(gcf);
```

The mean acceleration for cars with 8 cylinders is significantly lower than for cars with 4 or 6 cylinders.

Plot mean acceleration and weight, grouped by `Cylinders`

, and 95% confidence intervals. Scale the `Weight`

values by 1000 so the means of `Weight`

and `Acceleration`

are the same order of magnitude.

```
load('carsmall')
grpstats([Acceleration,Weight/1000],Cylinders,0.05);
fig2plotly(gcf);
```

The average weight of cars increases with the number of cylinders, and the average acceleration decreases with the number of cylinders.

Plot mean acceleration, grouped by both `Cylinders`

and `Model_Year`

. Specify 95% confidence intervals.

```
load('carsmall')
grpstats(Acceleration,{Cylinders,Model_Year},0.05);
fig2plotly(gcf);
```

There are nine possible combinations of grouping variable values because there are three unique values in `Cylinders`

and three unique values in `Model_Year`

. The plot does not show 8-cylinder cars with model year 1982 because the data did not include this combination.

The mean acceleration of 8-cylinder cars made in 1976 is significantly larger than the mean acceleration of 8-cylinder cars made in 1970.