Egyptian skulls

This example shows how discriminant analysis (DA) can be used to analyse whether there are any differences between male Egyptian skulls of different time periods from the area of Thebes. Various measurements were made on male Egyptian skulls from five different time periods, namely:

bullet4000 BC
bullet3300 BC
bullet1850 BC
bullet200 BC
bullet150 AD

Thirty skills are available from each time period. For each skull, the following four variables were measured:

bulletMaximal Breadth of Skull (MB)
bulletBasibregmatic Height of Skull (BH)
bulletBasialveolar Length of Skull (BL)
bulletNasal Height of Skull (NH)

The format of the data is illustrated in Table 1. The underlying question is whether there is a change in skull size over time. If this is the case, a possible explanation could be interbreeding of the Egyptians with immigrant populations over the years. The question can also be formulated in a more general context: can we discriminate between the samples of different groups (periods)? The most appropriate multivariate technique to answer this question is without no doubt discriminant analysis.

Table 1: Male Egyptian skull data.

Period

MB

BH

BL

NH

4000 BC

131

138

89

49

4000 BC

125

131

92

48

4000 BC

131

132

99

50

...

...

...

...

...

4000 BC

127

129

106

48

4000 BC

131

136

114

54

4000 BC

124

138

101

46

3300 BC

124

138

101

48

3300 BC

133

134

97

48

3300 BC

138

134

98

45

...

...

...

...

...

3300 BC

130

132

93

52

3300 BC

135

132

98

54

3300 BC

130

128

101

51

1850 BC

137

141

96

52

1850 BC

129

133

93

47

1850 BC

132

138

87

48

...

...

...

...

...

1850 BC

133

131

96

49

1850 BC

138

133

100

55

1850 BC

138

133

91

46

200 BC

137

134

107

54

200 BC

141

128

95

53

200 BC

141

130

87

49

...

...

...

...

...

200 BC

136

138

94

55

200 BC

132

136

92

52

200 BC

135

130

100

51

150 AD

137

123

91

50

150 AD

136

131

95

49

150 AD

128

126

91

57

...

...

...

...

...

150 AD

140

135

103

48

150 AD

147

129

87

48

150 AD

136

133

97

51

The data matrix is of dimension 150x5. The 150 samples (rows) can be divided in 5 groups of 30 (indicated by a change in color). The first column is the group identifier. To apply discriminant analysis on the Egyptian skull data yourself, download the excel file skulls.xls, and copy and paste the data into Brodgar. Please note that the  response variables are in columns, and the first row and column contain labels. It is probably slightly confusing that the four variables MB, BH, BL and NH are called response variables in this text. From a statistical point of view, the column labelled Period is the response variable, whereas MB, BH, BL and NH are explanatory variables. However, in Brodgar MB, BH, BL and NH need to be imported as response variables (which is done by default, see the window Info Y from the Import data button) and the row labels (Period) are used as group identifier.

 

Running DA in Brodgar

To apply discriminant analysis in Brodgar, select the "Dimension reduction techniques " window via the "Data exploration" button. Choose "Discriminant analysis" and click on the "Go" button. The user now has the option to deselect samples (which will not be done in this example) and identify the groups. Click on the "Go" button for identifying groups. By identifying the last sample in each group, Brodgar knows which samples belong to a particular group. So, Brodgar places all subsequent samples in the same group until it encounters a sample labelled as ENDPOINT. This means that:

bulletSamples of the same group need to be placed together in the spreadsheet (see also Table 1).
bulletThe column containing the group identifier should not be used in the analysis as a response variable.
bulletThe column containing the group identifier should be the first column in the spreadsheet, and it should be imported as labels.

Once all endpoints have been selected, click the Finish button. A text file will appear, giving the following information:

Discriminant analysis can only be applied if:
-At least 2 groups of samples have been selected.
-The number of samples in the smallest group is larger than the number of response variables.
Brodgar will check this now.

Number of selected groups: 5
Total number of groups is larger or equal than 2: OK.

The number of selected response variables is: 4

The groups consist of the following number of samples:
Group 1: 30 samples 
Group 2: 30 samples 
Group 3: 30 samples 
Group 4: 30 samples 
Group 5: 30 samples 
The number of samples in each group is larger than 4: OK.

DA CAN BE APPLIED. CLOSE THIS WINDOW AND CLICK ON GO-BUTTON OF STEP 3.
----------------------------------------------------------------------

The groups consist of the following samples:
Group 1, (30 samples): 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
4000 BC 
Group 2, (30 samples): 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
3300 BC 
Group 3, (30 samples): 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
1850 BC 
Group 4, (30 samples): 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
200 BC 
Group 5, (30 samples): 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
150 AD 
Please note that samples containing missing values will be deleted.

If everything is correct, the "Go" button for applying DA is enabled. Clicking it starts the DA calculations and a graphical window containing the results will appear automatically.

 

Results for the Egyptian skull data

Figure 1 shows the graphical output produced by Brodgar for the Egyptian skull data. The window labelled Sample scores shows the discriminant scores of the first 2 axes. The label "1" corresponds to a sample of group 1 (4000 BC). The triangles represent group averages. In Graph settings, colors, fonts and font sizes of labels and variables can be changed.

Figure 1: Samples scores of the first 2 axes. Each sample is represented by its group number. The five groups are 4000 BC (1), 3300 BC (2), 1850 BC (3), 200 BC (4) and 150 AD (5). Triangles represent group averages.

Figure 2 shows the same information, but now triangles are replaced by group numbers and the circles refer to  areas containing 90% of the samples. If the circles or labels do not fit in the graph, go to the discriminant analysis options, multiply the maximum (or minimum) value with a larger value, and redo the analysis.

Figure 2: Samples scores of the first 2 axes. numbers represent group averages and the circles the area containing 90% of the samples.

Note that the group averages of period 1 and 2 (4000 and 3300 BC) are on the right hand side and the group averages of period 4 and 5 (200 BC and 150 AD) on the left hand side. This might indicate that skulls from 4000 and 3000 BC and those of 200 BC and 150 AD are different (in terms of the four measurements) . Two questions now arise:

bullet

Are the differences between the groups significant?

bullet

And if so, which of the 4 variables caused these differences?

We start with the first question. Clicking the button labelled "Numerical information", gives the following information:

Output for discriminant analysis

The number of variables is 4
The number of observations is 150
The number of groups is 5
The number of discriminant functions is 4
The number of rows (observations) containing missing values is 0
Please note that these rows were omitted.

Observations per group
1 30
2 30
3 30
4 30
5 30

How many dimensions?
Eigenvalues (=lambda)
axis 	lambda lambda as % lambda cumulative %
1 	0.425 	88.227 		88.227
2 	0.039 	8.094 		96.321
3 	0.016 	3.259 		99.581
4 	0.002 	0.419 		100.000

Dimensionality tests for group separation
See p. 211-212 in Huberty (1994)
H0: No separation on any dimension
B0= 59.259
B0 is a chi-squared statistic with degrees of freedom: 16.000

H0: Separation on at most one dimension
B1= 8.072
B1 is a chi-squared statistic with degrees of freedom: 9.000

H0: Separation on at most two dimension
B2= 2.543
B2 is a chi-squared statistic with degrees of freedom: 4.000

Critical values for the Chi-square distribution are available
from the menu in the DA window.
......

The first two eigenvalues are 0.43 and 0.04 respectively. This indicates that the discrimination in the x-direction is far more important than the discrimination in the y-direction. The first axis explains 88% of the total variation. We are particularly interested in the dimensionality tests.  These can be used as a rough guide to judge whether the separation between the groups is significant. In the first test, the null hypothesis is: there is no separation between the groups on any dimension (or: axis). The test statistic is B0= 59.259, follows a Chi-square distribution and has 16 degrees of freedom. The critical value of the Chi-square distribution for alpha=0.05 and df=16 is 26.296. Because B0 is more extreme, the null hypothesis is rejected. Hence, there is a significant separation between the five groups on at least one axis. Critical values of the Chi-square (and F) distribution are available in Brodgar (click on Statistical tables from the menu in Figure 2). The null hypothesis in the second test is that there is separation on at most one dimension.  The test statistic B1=8.07 with 4 degrees of freedom. The critical value is 9.49, hence the null hypothesis is accepted. Based on the first two eigenvalues, this does not come as a surprise. 

We now discuss tools to identify which of the variables MB, BH, BL and NH are the most important for discrimination. Canonical correlations are the correlations between the 2 discriminant functions (axes) and the original four variables, see Figure 3. These are extremely useful for the interpretation of the axes. For example, the direction of the line for the variable BL (Basialveolar Length of Skull) indicates that the separation of groups 1 and 2 was mainly due to this variable. Similarly, the separation of samples from period 4 and 5 was mainly caused by MB (and NH in a lesser extend). The minor discrimination in the y-direction was mainly caused by BH.

Figure 3: Canonical correlations. The x and y coordinates of the head of each line represent the correlation between the variable and the first and second discriminant axes respectively.

The numerical output with respect to selecting and oredering variables is as follows:

....
Selecting and ordering response variables
Each variable is left out in turn
Variable Wilks lambda F to remove R2 naive rank
1 		0.741 	4.118 	0.104 	2
2 		0.695 	1.673 	0.045 	3
3 		0.780 	6.235 	0.149 	1
4 		0.688 	1.280 	0.035 	4
Degrees of freedom for F to remove: 4 and 142

Total sum of Mahalanobis distances between group means
Each variable is left out in turn
Variable Mah. dist. naive rank
1 	8.112 		2
2 	10.431 		3
3 	6.545 		1
4 	10.594 		4
....

To determine the importance of the variables for discriminating between the groups, Brodgar carries out a backward selection. The underlying idea is as follows. The discrimination between all the groups is measured by the total sum of Mahalanobis distances. In the backwards selection approach, Brodgar leaves out one variable in turn and the total sum of Mahalanobis distances between group means is calculated again. An important variable (with respect to discrimination between groups) will result in a larger decrease in the total sum of Mahalanobis distances, compared to a not so important variable. The naive rank indicates which variable resulted in the highest decrease (labelled 1) and which variable was the least important for discriminating between group means (highest rank). The later variable can then be omitted from the analysis (choose: deselect Y in the data import process) and repeat the discriminant analysis with the remaining variables. This process can be continued until no significant discrimination can be made along the X-th axis. In this case, X=1. Alternatively, scree plots or prior knowledge might help deciding when to stop. Brodgar also calculates the so-called F to remove statistic. This is an alternative to the total sum of Mahalanobis distances between groups. We prefer to work with Mahalanobis distances. Both approaches indicated that variable 4 (NH) was the least important for discrimination between the five groups. This does not come as a surprise because of the short length of the corresponding line in Figure 3. We decided to omit this variable from the analysis and repeated the discriminant analysis using the remaining 3 variables. The hypothesis tests indicated that the discrimination along the first axis was still significant. The numerical output with respect to the Mahalanobis distances was as follows:

Total sum of Mahalanobis distances between group means
Each variable is left out in turn
Variable Mah. dist. naive rank
1 	6.590 		2
2 	9.539 		3
3 	5.607 		1 

Hence, the second variable BH (Basibregmatic Height of Skull) was the least important one for discrimination.  Because Brodgar needs at least 2 variables, the backwards selection was not taken any further. Hence, the variables Maximal Breadth of Skull (MB) and Basialveolar Length of Skull (BL) were the most important variables for discriminating between the 5 different periods along the first axis. Keep in mind that the hypothesis tests look at differences between all the groups, not between the two most extreme groups (group 1 and 5). Looking at differences between these two extreme groups is an analysis on itself, which can also be carried out with discriminant analysis in Brodgar. An interesting alternative analysis would be to consider samples from the period 4000, 3300 and1850 BC as one group, and samples from 200 BC and 150 AD as a second group.

Conclusion

There is a significant difference between the samples of the five different periods. Figure 2 shows a trend along the first axis from left (4000 BC) to right (150 AD). Based on the canonical correlations, Maximal Breadth of Skull (MB) and Basialveolar Length of Skull (BL) were the most important variables for discriminating between the periods.

 

Home