|
|
Egyptian skullsThis example shows how discriminant analysis (DA) can be used to analyse whether there are any differences between male Egyptian skulls of different time periods from the area of Thebes. Various measurements were made on male Egyptian skulls from five different time periods, namely:
Thirty skills are available from each time period. For each skull, the following four variables were measured:
The format of the data is illustrated in Table 1. The underlying question is whether there is a change in skull size over time. If this is the case, a possible explanation could be interbreeding of the Egyptians with immigrant populations over the years. The question can also be formulated in a more general context: can we discriminate between the samples of different groups (periods)? The most appropriate multivariate technique to answer this question is without no doubt discriminant analysis. Table 1: Male Egyptian skull data.
The data matrix is of dimension 150x5. The 150 samples (rows) can be divided in 5 groups of 30 (indicated by a change in color). The first column is the group identifier. To apply discriminant analysis on the Egyptian skull data yourself, download the excel file skulls.xls, and copy and paste the data into Brodgar. Please note that the response variables are in columns, and the first row and column contain labels. It is probably slightly confusing that the four variables MB, BH, BL and NH are called response variables in this text. From a statistical point of view, the column labelled Period is the response variable, whereas MB, BH, BL and NH are explanatory variables. However, in Brodgar MB, BH, BL and NH need to be imported as response variables (which is done by default, see the window Info Y from the Import data button) and the row labels (Period) are used as group identifier.
Running DA in BrodgarTo apply discriminant analysis in Brodgar, select the "Dimension reduction techniques " window via the "Data exploration" button. Choose "Discriminant analysis" and click on the "Go" button. The user now has the option to deselect samples (which will not be done in this example) and identify the groups. Click on the "Go" button for identifying groups. By identifying the last sample in each group, Brodgar knows which samples belong to a particular group. So, Brodgar places all subsequent samples in the same group until it encounters a sample labelled as ENDPOINT. This means that:
Once all endpoints have been selected, click the Finish button. A text file will appear, giving the following information: Discriminant analysis can only be applied if: -At least 2 groups of samples have been selected. -The number of samples in the smallest group is larger than the number of response variables. Brodgar will check this now. Number of selected groups: 5 Total number of groups is larger or equal than 2: OK. The number of selected response variables is: 4 The groups consist of the following number of samples: Group 1: 30 samples Group 2: 30 samples Group 3: 30 samples Group 4: 30 samples Group 5: 30 samples The number of samples in each group is larger than 4: OK. DA CAN BE APPLIED. CLOSE THIS WINDOW AND CLICK ON GO-BUTTON OF STEP 3. ---------------------------------------------------------------------- The groups consist of the following samples: Group 1, (30 samples): 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC 4000 BC Group 2, (30 samples): 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC 3300 BC Group 3, (30 samples): 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC 1850 BC Group 4, (30 samples): 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC 200 BC Group 5, (30 samples): 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD 150 AD Please note that samples containing missing values will be deleted. If everything is correct, the "Go" button for applying DA is enabled. Clicking it starts the DA calculations and a graphical window containing the results will appear automatically.
Results for the Egyptian skull dataFigure 1 shows the graphical output produced by Brodgar for the Egyptian skull data. The window labelled Sample scores shows the discriminant scores of the first 2 axes. The label "1" corresponds to a sample of group 1 (4000 BC). The triangles represent group averages. In Graph settings, colors, fonts and font sizes of labels and variables can be changed.
Figure 1: Samples scores of the first 2 axes. Each sample is represented by its group number. The five groups are 4000 BC (1), 3300 BC (2), 1850 BC (3), 200 BC (4) and 150 AD (5). Triangles represent group averages. Figure 2 shows the same information, but now triangles are replaced by group numbers and the circles refer to areas containing 90% of the samples. If the circles or labels do not fit in the graph, go to the discriminant analysis options, multiply the maximum (or minimum) value with a larger value, and redo the analysis.
Figure 2: Samples scores of the first 2 axes. numbers represent group averages and the circles the area containing 90% of the samples. Note that the group averages of period 1 and 2 (4000 and 3300 BC) are on the right hand side and the group averages of period 4 and 5 (200 BC and 150 AD) on the left hand side. This might indicate that skulls from 4000 and 3000 BC and those of 200 BC and 150 AD are different (in terms of the four measurements) . Two questions now arise:
We start with the first question. Clicking the button labelled "Numerical information", gives the following information: Output for discriminant analysis The number of variables is 4 The number of observations is 150 The number of groups is 5 The number of discriminant functions is 4 The number of rows (observations) containing missing values is 0 Please note that these rows were omitted. Observations per group 1 30 2 30 3 30 4 30 5 30 How many dimensions? Eigenvalues (=lambda) axis lambda lambda as % lambda cumulative % 1 0.425 88.227 88.227 2 0.039 8.094 96.321 3 0.016 3.259 99.581 4 0.002 0.419 100.000 Dimensionality tests for group separation See p. 211-212 in Huberty (1994) H0: No separation on any dimension B0= 59.259 B0 is a chi-squared statistic with degrees of freedom: 16.000 H0: Separation on at most one dimension B1= 8.072 B1 is a chi-squared statistic with degrees of freedom: 9.000 H0: Separation on at most two dimension B2= 2.543 B2 is a chi-squared statistic with degrees of freedom: 4.000 Critical values for the Chi-square distribution are available from the menu in the DA window. ...... The first two eigenvalues are 0.43 and 0.04 respectively. This indicates that the discrimination in the x-direction is far more important than the discrimination in the y-direction. The first axis explains 88% of the total variation. We are particularly interested in the dimensionality tests. These can be used as a rough guide to judge whether the separation between the groups is significant. In the first test, the null hypothesis is: there is no separation between the groups on any dimension (or: axis). The test statistic is B0= 59.259, follows a Chi-square distribution and has 16 degrees of freedom. The critical value of the Chi-square distribution for alpha=0.05 and df=16 is 26.296. Because B0 is more extreme, the null hypothesis is rejected. Hence, there is a significant separation between the five groups on at least one axis. Critical values of the Chi-square (and F) distribution are available in Brodgar (click on Statistical tables from the menu in Figure 2). The null hypothesis in the second test is that there is separation on at most one dimension. The test statistic B1=8.07 with 4 degrees of freedom. The critical value is 9.49, hence the null hypothesis is accepted. Based on the first two eigenvalues, this does not come as a surprise. We now discuss tools to identify which of the variables MB, BH, BL and NH are the most important for discrimination. Canonical correlations are the correlations between the 2 discriminant functions (axes) and the original four variables, see Figure 3. These are extremely useful for the interpretation of the axes. For example, the direction of the line for the variable BL (Basialveolar Length of Skull) indicates that the separation of groups 1 and 2 was mainly due to this variable. Similarly, the separation of samples from period 4 and 5 was mainly caused by MB (and NH in a lesser extend). The minor discrimination in the y-direction was mainly caused by BH.
Figure 3: Canonical correlations. The x and y coordinates of the head of each line represent the correlation between the variable and the first and second discriminant axes respectively. The numerical output with respect to selecting and oredering variables is as follows: .... Selecting and ordering response variables Each variable is left out in turn Variable Wilks lambda F to remove R2 naive rank 1 0.741 4.118 0.104 2 2 0.695 1.673 0.045 3 3 0.780 6.235 0.149 1 4 0.688 1.280 0.035 4 Degrees of freedom for F to remove: 4 and 142 Total sum of Mahalanobis distances between group means Each variable is left out in turn Variable Mah. dist. naive rank 1 8.112 2 2 10.431 3 3 6.545 1 4 10.594 4 .... To determine the importance of the variables for discriminating between the groups, Brodgar carries out a backward selection. The underlying idea is as follows. The discrimination between all the groups is measured by the total sum of Mahalanobis distances. In the backwards selection approach, Brodgar leaves out one variable in turn and the total sum of Mahalanobis distances between group means is calculated again. An important variable (with respect to discrimination between groups) will result in a larger decrease in the total sum of Mahalanobis distances, compared to a not so important variable. The naive rank indicates which variable resulted in the highest decrease (labelled 1) and which variable was the least important for discriminating between group means (highest rank). The later variable can then be omitted from the analysis (choose: deselect Y in the data import process) and repeat the discriminant analysis with the remaining variables. This process can be continued until no significant discrimination can be made along the X-th axis. In this case, X=1. Alternatively, scree plots or prior knowledge might help deciding when to stop. Brodgar also calculates the so-called F to remove statistic. This is an alternative to the total sum of Mahalanobis distances between groups. We prefer to work with Mahalanobis distances. Both approaches indicated that variable 4 (NH) was the least important for discrimination between the five groups. This does not come as a surprise because of the short length of the corresponding line in Figure 3. We decided to omit this variable from the analysis and repeated the discriminant analysis using the remaining 3 variables. The hypothesis tests indicated that the discrimination along the first axis was still significant. The numerical output with respect to the Mahalanobis distances was as follows: Total sum of Mahalanobis distances between group means Each variable is left out in turn Variable Mah. dist. naive rank 1 6.590 2 2 9.539 3 3 5.607 1 Hence, the second variable BH (Basibregmatic Height of Skull) was the least important one for discrimination. Because Brodgar needs at least 2 variables, the backwards selection was not taken any further. Hence, the variables Maximal Breadth of Skull (MB) and Basialveolar Length of Skull (BL) were the most important variables for discriminating between the 5 different periods along the first axis. Keep in mind that the hypothesis tests look at differences between all the groups, not between the two most extreme groups (group 1 and 5). Looking at differences between these two extreme groups is an analysis on itself, which can also be carried out with discriminant analysis in Brodgar. An interesting alternative analysis would be to consider samples from the period 4000, 3300 and1850 BC as one group, and samples from 200 BC and 150 AD as a second group. ConclusionThere is a significant difference between the samples of the five different periods. Figure 2 shows a trend along the first axis from left (4000 BC) to right (150 AD). Based on the canonical correlations, Maximal Breadth of Skull (MB) and Basialveolar Length of Skull (BL) were the most important variables for discriminating between the periods.
|