Group By in MATLAB®

How to use Group By in MATLAB® with Plotly.


Dataset Array Summary Statistics Organized by Group

Load the sample data.

load('hospital');

The dataset array hospital has 100 observations and 7 variables.

Create a dataset array with only the variables Sex, Age, Weight, and Smoker.

load('hospital');

dsa = hospital(:,{'Sex','Age','Weight','Smoker'});

Sex is a nominal array, with levels Male and Female. The variables Age and Weight have numeric values, and Smoker has logical values.

Compute the mean for the numeric and logical arrays, Age, Weight, and Smoker, grouped by the levels in Sex.

load('hospital');

dsa = hospital(:,{'Sex','Age','Weight','Smoker'});

statarray = grpstats(dsa,'Sex')
statarray = 

              Sex       GroupCount    mean_Age    mean_Weight    mean_Smoker
    Female    Female    53            37.717      130.47         0.24528    
    Male      Male      47            38.915      180.53         0.44681

statarray is a dataset array with two rows, corresponding to the levels in Sex. GroupCount is the number of observations in each group. The means of Age, Weight, and Smoker, grouped by Sex, are given in mean_Age, mean_Weight, and mean_Smoker.

Compute the mean for Age and Weight, grouped by the values in Smoker.

load('hospital');

dsa = hospital(:,{'Sex','Age','Weight','Smoker'});

statarray = grpstats(dsa,'Smoker','mean','DataVars',{'Age','Weight'})
statarray = 

         Smoker    GroupCount    mean_Age    mean_Weight
    0    false     66             37.97      149.91     
    1    true      34            38.882      161.94

In this case, not all variables in dsa (excluding the grouping variable, Smoker) are numeric or logical arrays; the variable Sex is a nominal array. When not all variables in the input dataset array are numeric or logical arrays, you must specify the variables for which you want to calculate summary statistics using DataVars.

Compute the minimum and maximum weight, grouped by the combinations of values in Sex and Smoker.

load('hospital');

dsa = hospital(:,{'Sex','Age','Weight','Smoker'});

statarray = grpstats(dsa,{'Sex','Smoker'},{'min','max'},...
                     'DataVars','Weight')
statarray = 

                Sex       Smoker    GroupCount    min_Weight    max_Weight
    Female_0    Female    false     40            111           147       
    Female_1    Female    true      13            115           146       
    Male_0      Male      false     26            158           194       
    Male_1      Male      true      21            164           202

There are two unique values in Smoker and two levels in Sex, for a total of four possible combinations of values: Female Nonsmoker (Female_0), Female Smoker (Female_1), Male Nonsmoker (Male_0), and Male Smoker (Male_1).

Specify the names for the columns in the output.

load('hospital');

dsa = hospital(:,{'Sex','Age','Weight','Smoker'});

statarray = grpstats(dsa,{'Sex','Smoker'},{'min','max'},...
          'DataVars','Weight','VarNames',{'Gender','Smoker',...
                    'GroupCount','LowestWeight','HighestWeight'})
statarray = 

                Gender    Smoker    GroupCount    LowestWeight    HighestWeight
    Female_0    Female    false     40            111             147          
    Female_1    Female    true      13            115             146          
    Male_0      Male      false     26            158             194          
    Male_1      Male      true      21            164             202

Summary Statistics for a Dataset Array Without Grouping

Load the sample data.

load('hospital');

The dataset array hospital has 100 observations and 7 variables.

Create a dataset array with only the variables Age, Weight, and Smoker.

load('hospital');

dsa = hospital(:,{'Age','Weight','Smoker'});

The variables Age and Weight have numeric values, and Smoker has logical values.

Compute the mean, minimum, and maximum for the numeric and logical arrays, Age, Weight, and Smoker, with no grouping.

load('hospital');

dsa = hospital(:,{'Age','Weight','Smoker'});

statarray = grpstats(dsa,[],{'mean','min','max'})
statarray = 

           GroupCount    mean_Age    min_Age    max_Age    mean_Weight
    All    100           38.28       25         50         154        


           min_Weight    max_Weight    mean_Smoker    min_Smoker    max_Smoker
    All    111           202           0.34           false         true

The observation name All indicates that all observations in dsa were used to compute the summary statistics.

Group Means for a Matrix Using One or More Grouping Variables

Load the sample data.

load('carsmall')

All variables are measured for 100 cars. Origin is the country of origin for each car (France, Germany, Italy, Japan, Sweden, or USA). Cylinders has three unique values, 4, 6, and 8, indicating the number of cylinders in each car.

Calculate the mean acceleration, grouped by country of origin.

load('carsmall')

means = grpstats(Acceleration,Origin)
means =

   14.4377
   18.0500
   15.8867
   16.3778
   16.6000
   15.5000

means is a 6-by-1 vector of mean accelerations, where each value corresponds to a country of origin.

Calculate the mean acceleration, grouped by both country of origin and number of cylinders.

load('carsmall')

means = grpstats(Acceleration,{Origin,Cylinders})
means =

   17.0818
   16.5267
   11.6406
   18.0500
   15.9143
   15.5000
   16.3375
   16.7000
   16.6000
   15.5000

There are 18 possible combinations of grouping variable values because Origin has 6 unique values and Cylinders has 3 unique values. Only 10 of the possible combinations appear in the data, so means is a 10-by-1 vector of group means corresponding to the observed combinations of values.

Return the group names along with the mean acceleration for each group.

load('carsmall')

[means,grps] = grpstats(Acceleration,{Origin,Cylinders},{'mean','gname'})
means =

   17.0818
   16.5267
   11.6406
   18.0500
   15.9143
   15.5000
   16.3375
   16.7000
   16.6000
   15.5000


grps =

  10x2 cell array

    {'USA'    }    {'4'}
    {'USA'    }    {'6'}
    {'USA'    }    {'8'}
    {'France' }    {'4'}
    {'Japan'  }    {'4'}
    {'Japan'  }    {'6'}
    {'Germany'}    {'4'}
    {'Germany'}    {'6'}
    {'Sweden' }    {'4'}
    {'Italy'  }    {'4'}

The output grps shows the 10 observed combinations of grouping variable values. For example, the mean acceleration of 4-cylinder cars made in France is 18.05.

Multiple Summary Statistics for a Matrix Organized by Group

Load the sample data.

load carsmall

The variable Acceleration was measured for 100 cars. The variable Origin is the country of origin for each car (France, Germany, Italy, Japan, Sweden, or USA).

Return the minimum and maximum acceleration grouped by country of origin.

load carsmall

[grpMin,grpMax,grp] = grpstats(Acceleration,Origin,{'min','max','gname'})
grpMin =

    8.0000
   15.3000
   13.9000
   12.2000
   15.7000
   15.5000


grpMax =

   22.2000
   21.9000
   18.2000
   24.6000
   17.5000
   15.5000


grp =

  6x1 cell array

    {'USA'    }
    {'France' }
    {'Japan'  }
    {'Germany'}
    {'Sweden' }
    {'Italy'  }

The sample car with the lowest acceleration is made in the USA, and the sample car with the highest acceleration is made in Germany.

Plot Prediction Intervals for a New Observation in Each Group

Load the sample data.

load('carsmall')

The variable Weight was measured for 100 cars. The variable Model_Year has three unique values, 70, 76, and 82, which correspond to model years 1970, 1976, and 1982.

Calculate the mean weight and 90% prediction intervals for each model year.

load('carsmall')

[means,pred,grp] = grpstats(Weight,Model_Year,...
                      {'mean','predci','gname'},'Alpha',0.1);

Plot error bars showing the mean weight and 90% prediction intervals, grouped by model year. Label the horizontal axis with the group names.

load('carsmall')

[means,pred,grp] = grpstats(Weight,Model_Year,...
                      {'mean','predci','gname'},'Alpha',0.1);

ngrps = length(grp); % Number of groups
errorbar((1:ngrps)',means,pred(:,2)-means)
xlim([0.5 3.5])
set(gca,'xtick',1:ngrps,'xticklabel',grp)
title('90% Prediction Intervals for Weight by Year')

fig2plotly(gcf);

Plot Group Means and Confidence Intervals

Load the sample data.

load('carsmall')

The variables Acceleration and Weight are the acceleration and weight values measured for 100 cars. The variable Cylinders is the number of cylinders in each car. The variable Model_Year has three unique values, 70, 76, and 82, which correspond to model years 1970, 1976, and 1982.

Plot mean acceleration, grouped by Cylinders, with 95% confidence intervals.

load('carsmall')

grpstats(Acceleration,Cylinders,0.05);

fig2plotly(gcf);

The mean acceleration for cars with 8 cylinders is significantly lower than for cars with 4 or 6 cylinders.

Plot mean acceleration and weight, grouped by Cylinders, and 95% confidence intervals. Scale the Weight values by 1000 so the means of Weight and Acceleration are the same order of magnitude.

load('carsmall')

grpstats([Acceleration,Weight/1000],Cylinders,0.05);

fig2plotly(gcf);

The average weight of cars increases with the number of cylinders, and the average acceleration decreases with the number of cylinders.

Plot mean acceleration, grouped by both Cylinders and Model_Year. Specify 95% confidence intervals.

load('carsmall')

grpstats(Acceleration,{Cylinders,Model_Year},0.05);

fig2plotly(gcf);

There are nine possible combinations of grouping variable values because there are three unique values in Cylinders and three unique values in Model_Year. The plot does not show 8-cylinder cars with model year 1982 because the data did not include this combination.

The mean acceleration of 8-cylinder cars made in 1976 is significantly larger than the mean acceleration of 8-cylinder cars made in 1970.