|
|
|
On this page you find:
A few key facts:
1. ScreenshotsBelow, links are given to various screen shots and examples of Brodgar. Examples of ordination methods: Examples of time series methods General screenshots To reduce download time, the quality of the figures were reduced.
2. Brodgar and RVersion 2.0.6+ of Brodgar contains an interface to the statistics package R (version 1.8.1). In this section, the following information is provided:
What is R?R is a free implementation of the S language and has become popular since the late 1990s. Various textbooks on R and S-Plus have been published and Internet newsgroups and mailing lists provide useful information. S-Plus is one of the best all-round statistical software packages, but unfortunately it is rather expensive. The syntax in R is for 95% identical to that of S-Plus, and the same textbooks and manuals can be used. The only disadvantage is that R requires programming whereas S-Plus contains a GUI allowing users to click buttons. Someone who is not used to programming might find R off-putting in the beginning. Biological data analyses require specialised methods which are only available in software packages like Brodgar or CANOCO, but also more general methods (data exploration, regression, generalised linear modeling (GLM) and generalised additive modeling (GAM) methods). These later methods are all available in R. To make these general statistical methods available to Brodgar users, an R-interface was added to Brodgar. With two mouse clicks in Brodgar, the user can obtain R lattice graphs or apply GAM without the need of 20-30 lines of code. Examples of R utilities available from BrodgarBrodgar v.2.0.6+ can access various tools from R, for example:
Some of the method are illustrated below. CoplotsIn multivariate data, the relationship between two variables may be obscured by a third one. If one plots y against x, effects of z are ignored. The coplot allows one to plot y against x, while taking account of a third variable z. Figure 1 shows an example. The Dune Meadow data set consists of abundances of 33 plant species measured at 20 sites in a dune area. Various explanatory variables (soil and management related) were measured at each site. For each site, the total abundance was calculated. The (response) variable (total abundance) is on the y axis and A1(soil variable) is on the x axis, with six separate plots conditional on the values of Moisture (soil variable) shown in the top panel.
Figure 1. Coplot of total abundance index function and two explanatory variables (A1 and moisture) for the Dune Meadow data. The panels are ordered from the lower left to the upper right. This order corresponds to increasing values of moisture. The six lines in the upper panel show the range of moisture per graph. Results show that for lower values of moisture, the relationship between total abundance of species and A1 is positive, whereas for larger values of moisture the relationship becomes negative. Brodgar allows one to use regression lines or smoothing curves in the plots. It is also possible to have no lines. PairsAnother useful tool is the pairs function. It shows the pair-wise scatterplots and these can be used to detect relationships between variables and multi-colinearity. Figure 2 shows an example for the same data as in Figure 1. Note that there are no clear linear relationships between the variables. Two values of A1 are rather large, which might suggest to apply a transformation on A1. The lines are obtained by smoothing x on y. It is also possible to use a regression line or no line at all.
Figure 2. Pair plot of total abundance index function and two explanatory variables (A1 and moisture) for the Dune Meadow data. The lines are obtained by smoothing x on y. DotplotsThis is a plot in which each observation is presented by a single dot. The value is presented along the horizontal axis. Dotplots can be used to identify outliers. Dotplots for four dune species are given in Figure 3. The 20 sites are plotted along the vertical axes and the horizontal axes show the values (square root transformed) at the sites. Isolated points on the right hand side indicate outliers, which is not the case for these four species. However, the dotplots do show the large number of zero observations. It is useful to make dotplots for species, explanatory variables and index functions.
Figure 3. Dotplots of four plant species from the Dune Meadow data. The vertical axes contain the samples and the values of the species are along the horizontal axes. Lattice (Trellis) graphsThese are probably the most useful graphical exploration tools in S-Plus and R. The name “Trellis” is copyright protected and for that reason Trellis graphs are called lattice plots in R. An example for the Dune Meadow data is presented in Figure 4. Along (all) the x-axes, the explanatory variable A1 (soil related) is plotted. The panels contain the abundances of species and a smoothing curve is added. Lattice plots give a good indication what kind of relationships can be expected, e.g. linear or non-linear.
Figure 4. Lattice plots of 6 species from the Dune Meadow data. A1 is an explanatory (soil) variable. Boxplots and histogramsA boxplot visualises the mean and spread for a univariate variable. The midpoint of a boxplot is given by the median. The 25% quartiles define the hinges (end of the boxes). Differences between the hinges is called the spread. Lines are drawn from each hinge to 1.5 times the spread. Any point beyond this line is called an outlier. Figure 5 shows the boxplots of all 30 plant species. No transformation was used. It is interesting to make boxplots and histograms of all species and explanatory variables, print the graphs and redraw them for another transformation. This will give information which transformation (if one at all) should be applied.
Figure 5. Boxplots of various plant species (no transformation was applied) of the Dune Meadow data. Regression treesTwo aspects which can cause problems in multivariate data are:
A useful tool to investigate relationships between one response variable and multiple explanatory variables is the regression tree. This is a simple tool which is best explained with help of an example. The spider species data set consists of abundances of 12 spiders measured in 28 traps. Five explanatory variables were measured at each site. Total abundance per site was calculated and the relationship between total abundance and the 5 explanatory variables is explored with help of a regression tree, see Figure 6. The response variable (total abundance) is a vector of length 28. The regression tree indicates that the 28 values of the index function (total abundance) can be split up in two groups; group 1 consists of 20 samples with herb cover smaller than 4.283, and 8 samples with herb cover larger or equal than 4.283. The later group can be further split up in two groups, namely those with moss cover smaller than 0.89 (5 sites with an average of 38 species) and larger than 0.89 (3 samples with an average of 50 species). Similar statements can be made for the left branch. Regression trees are a useful extension of generalised additive modelling. Further details can be found in Quinn and Keough (2002).
Figure 6. Regression tree for total abundance and 5 explanatory variables for the spider data set. Additive modellingA multiple linear regression model is given by: yi =
α + β1 xi1 + … + βp xip
+ εi The additive model is a special case of generalised
additive modelling model, and is defined by: yi =
α + f1i (x1)+ … + fpi (xp)+
εi where each of the functions fj(.) are smoothing curves (e.g. loess curves). The shape of these curves can be used to get an idea of the relationship between response variable and explanatory variables. Loyn (1987) analysed the abundance of birds measured in 56
forest patches. For each patch, mean bird abundance, area (size of patch), years
since isolation and distance to nearest patch are available. In first instance,
we use the following additive model: Birdi =
α + f1(Yeari) + f2(Patch Areai)+
f2(Distancei)+ εi The index i refers to forest patch, where i=1,..,56.. One option is to make a scatterplot (pairs) of the data, but the problem is that these plots only show pair-wise interactions. The additive model overcomes this. The estimated smoothing curves and 95% point-wise confidence intervals are presented in Figure 7. The effect of Area is slightly non-linear (though this is only due to one site) whereas distance and year show a linear relationship.
ClusteringBrodgar contains hierarchical clustering. The process consists of the following steps.
Figure 9 shows an example for the Dune Meadow data. Hierarchical clustering using the Jaccard index and average linkage was used.
Figure 9. Dendrogram for Dune Meadow data. Clustering was applied on the samples. What do you need to do?The R interface in Brodgar works as follows. The user selects a method, e.g. GAM and clicks on a button, and then:
The user does not see anything of R, except for the graphs and/or numerical output. Obviously, you will need to download and install R. References.Fox, J. (2002). An R and S-Plus companion to applied regression. Saga Publications. Jongman, R.H.G. and Ter Braak, C.J.F. and van Tongeren, O.F.R. (1995). Data analysis in community and landscape ecology. Cambridge University Press, Cambridge. Legendre, P. and Legendre, L. (1998). Numerical Ecology. Second English Edition. Elsevier Science B.V. Loyn, R.H. (1987). Effects of patch area and habitat on bird abundances, species numbers and tree health in fragmented Victorian forests. In: Nature Conservation: The role of Remnants of Native Vegetation (Saunders, D.A., Arnold, G.W., Burbidge, A.A. and Hopkins A.J.M. eds.). pp. 65-77. Surrey Beatty and Sons, Chipping Norton, NSW. Quinn, G.P. and Keough, M.J. (2002). Experimental design and data analysis for biologists. Cambridge University Press.
3A. Freebie 1: Time series analysis examplesFor the fast reader: links to DFA, MAFA and chronological clustering examples: Two papers on DFA:
Examples
Background informationUnderlying questions in time series studiesCommon characteristics in environmental time series studies are that the series are (i) short, (ii) non-stationary, (iii) made up of many response variables which are interacting with each other, and (iv) have missing values. Common questions in these studies are:
These questions can be summarised by one simple question: what's going on? Brodgar contains various different time series techniques to answer this question. Three important ones are:
Each of these methods is discussed next. Dynamic factor analysisBiological time series are in general to short for techniques like spectral analysis, wavelet analysis, auto regressive (AR) models and auto regressive integrated moving average (ARIMA) models. Furthermore, aspects like missing values and the presence of many response variables are not handled well by these techniques. To address the multivariate nature of the response variables, standard multivariate techniques such as canonical correspondence analysis, principal component analysis, multidimensional scaling, are sometimes used. However, these techniques do not handle missing data properly and furthermore they do note take dependencies over time into account. A more promising approach is structural time series analysis (Harvey 1989). Although this set of techniques originates from fields related to econometrics and psychology, it has several aspects that are of interest to biologists. Its main feature is that the time series are modelled in terms of a trend, seasonal effects, a cycle, explanatory variables and noise, each of which is allowed to be stochastic. This means that one might end up with a seasonal component which changes slightly from year to year, a cyclic component which is not necessarily a cosine function, a trend which is not restricted to be a straight line or a polynomial, or explanatory variables which only have a significant influence in a certain period. One particular interesting approach is dynamic factor analysis. In this multivariate technique, underlying common components are identified, namely: common trends, common seasonal patterns, common cycli and, effects of explanatory variables. If the time series are short, as in most environmental studies, cycli and seasonal patterns can be omitted, resulting in the estimation of common trends and effects of explanatory variables only. Traditionally, the parameter estimation process in dynamic factor analysis was carried out by direct optimisation of a maximum likelihood criterion (Harvey 1989). Due to numerical problems, this limits the number of time series that can be analysed. Zuur et al. (2003a,b, 2004) however, addressed this limitation by using a different estimation procedure, the so-called EM algorithm. Furthermore, they extended the technique by including explanatory variables. Links to dynamic factor analysis examples: MAFAMAFA stands for min/max autocorrelation factor analysis. MAFA can be described in various ways, e.g.:
MAFA is perhaps best explained with an example: Chronological clusteringMAFA and dynamic factor analysis are techniques which can be used to estimate trends in multivariate time series. Application of these techniques on biological data assumes that the underlying ecosystem is gradually changing over time. However, these techniques are less appropriate if the ecosystem changes rapidly from one state to another. Ordinary clustering techniques might be applied to identify sudden changes, but these methods are likely to result in groups of years that are difficult to interpret. For example, how does one explain a group containing 1970, 1976, 1992 and 2003? Chronological clustering, as the name already suggests, is especially designed for clustering of time series. The method is fully described in Legendre et al. (1985), Bell and Legendre (1987), and Legendre and Legendre (1998). The first two papers are downloadable from Legendre's website (search on "chronological clustering" in Google), and are easy to read for non-statisticians. Explaining chronological lustering is best done with examples:
To identify breakpoints in multivariate times series, Brodgar can also apply regime shift analysis, as explained in Hare & Mantua (2000), see the Brodgar manual for an example. Besides dynamic factor analysis, MAFA and chronological clustering, Brodgar is capable of carrying out ‘standard’ multivariate techniques and multivariate time series techniques like principal component analysis, canonical correspondence analysis, discriminant analysis, redundancy analysis, multidimensional scaling, ARIMAX, spectral analysis, etc. For short time series data (say less than 15 points in time), some of the multivariate methods can be used. For example, partial RDA and partial CCA can be used to determine how much variation in the response variables is due to time. The emphasis in Brodgar is on biological and environmental time series. However, Brodgar can be used in many other fields. For example, various Brodgar users work on economical time series data. References
3B. Freebie 2: Examples of ordination techniquesThe following dimension reduction techniques are available in Brodgar:
A short non-statistical introduction of each method is presented below. We explain what each of these techniques can do and various examples are presented. References to easy-to-read text books and publications are given at the end of this page.
Example 1: Canonical correspondence analysis applied on hunting spider data Keywords: Species-environmental relationships. How to read a triplot and a biplot. Canonical correspondence analysis. Example 2: Discriminant analysis applied on Egyptian skulls of different time periods from the area of Thebes. Keywords: Discrimination between groups of samples. Discriminant analysis. Hypothesis tests (is the discrimination significant?). Identification which variables contributed most to the discrimination. Example 3: Zoobenthic species measured in an intertidal area in Argentina. Keywords: Zoobenthic species-environmental relationships. When to use PCA and RDA and not CA or CCA? Biplots and triplots. Superimposing explanatory variables on a biplot. Indirect gradient analysis. Coenoclines. Discriminant analysis to detect differences in species behavior at 3 transects. Example 4: Using coenoclines to detect linear or unimodal species-environmental relationships Keywords: Coenoclines. Using PCA or CA? Using RDA or CCA? Example 5: Variance partitioning for dune meadow data Keywords: Identify the amount of variation in response variables due to a subset of explanatory variables. Partial CCA. Identify pure management effects, spatial effects, pure temporal effects, etc.
4. Brodgar compliance statement on R GNU licenseThe software package R is distributed under the GNU general public license. Brodgar, which is not distributed under the GNU license, creates ascii files containing R script commands which are sourced into the binary version of R (using the BATCH mode). Although Brodgar is linked to R, it does not contain it. The user will need to (i) download a compiled version of R, (ii) install it and (iii) tell Brodgar where it can find R. Hence, R and Brodgar are two different packages (Brodgar is the shell and R the compiler) which are an arm length apart. As a consequence, Brodgar complies with the GNU license. See http://www.gnu.org/copyleft/gpl.html for details. As an extra service, Brodgar's own R library files are freely available in the Brodgar installation directory (also in the online evaluation version). The interested user can modify and extend these library files, although faulty modifications affect your warrantee. Brodgar also contains a considerable number of techniques (50%) which do not make use of R. The R libraries gam and mvpart are distributed under the GPL license as well, and the vioplot library under BSD. Therefore, the same holds for these libraries. |