Brodgar software

Home Stats courses Consultancy Books & papers Brodgar software Download & order Staff & Expertise Contact us

 

 

On this page you find:

  1. Brodgar screenshots

  2. Information on the R interface in Brodgar.

  3. Two statistical freebies (information on some of the methods; time series and ordination).

  4. Brodgar compliance statement on R GNU license

A few key facts:

bullet

95% of the statistical methods described in the 700 page Springer book Analysing Ecological Data by Zuur et al. (2007) were done in Brodgar. The book is priced at 79 USD. 

bullet

Brodgar is cheap; a student license starts at 150GBP (approx. 290USD) and developing countries have 50% reduction.

bullet

Classroom and departmental licenses are cheap too. Brodgar is great for classroom teaching. We have used it in classes with 100 MSc students. 

bullet

Importing data is simple and the statistical analysis is a matter of  'click-and-go'.

bullet

Various universities have purchased departmental and 'use-at-home' licenses for their staff.

bullet

Lots of papers have been published with Brodgar.

 

1. Screenshots

Below, links are given to various screen shots and examples of Brodgar.

Examples of ordination methods:

bulletCanonical correspondence analysis applied on hunting spider data
bulletDiscriminant analysis applied on Egyptian skulls of different time periods from the area of Thebes.
bulletZoobenthic species measured in an intertidal area  in Argentina
bulletUsing coenoclines to detect linear or unimodal species-environmental relationships
bulletVariance partitioning for dune meadow data

Examples of time series methods

bullet

DFA applied on fisheries time series data

bullet

MAFA applied on zoobenthic time series data

bullet

Chronological clustering applied to 100 climatic time series: identifying regime shifts in Pacific North America.

General screenshots

bulletR interface.....
bulletImport data window
bulletBrodgar's own spreadsheet
bulletData exploration
bulletPCA biplot
bulletCCA options
bulletCCA triplot (1)
bulletCCA triplot (2)
bulletCCA triplot (3) 
bulletDFA model formulation
bulletDFA output during runtime
bulletDFA validation
bulletDFA all fitted curves in one graph
bulletDFA estimated common trends
bulletDFA estimated factor loadings
bulletDFA fitted curves 
bulletMAFA smoothing functions

To reduce download time, the quality of the figures were reduced.

2. Brodgar and R

Version 2.0.6+ of Brodgar contains an interface to the statistics package R (version 1.8.1). In this section, the following information is provided:

bulletWhat is R?
bulletWhich tools are accessible from Brodgar?.
bulletExamples of R utilities available from Brodgar
bulletWhat do you need to do?
bulletPotential problems.
bulletGraphs.
bulletWhat is next?

What is R?

R is a free implementation of the S language and has become popular since the late 1990s. Various textbooks on R and S-Plus have been published and Internet newsgroups and mailing lists provide useful information. S-Plus is one of the best all-round statistical software packages, but unfortunately it is rather expensive.  The syntax in R is for 95% identical to that of S-Plus, and the same textbooks and manuals can be used. The only disadvantage is that R requires programming whereas S-Plus contains a GUI allowing users to click buttons. Someone who is not used to programming might find R off-putting in the beginning. 

Biological data analyses require specialised methods which are only available in software  packages like Brodgar or CANOCO, but also more general methods (data exploration, regression, generalised linear modeling (GLM) and generalised additive modeling (GAM) methods). These later methods are all available in R. To make these general statistical methods available to Brodgar users, an R-interface was added to Brodgar. With two mouse clicks in Brodgar, the user can obtain R lattice graphs or apply GAM without the need of 20-30 lines of code.

Examples of R utilities available from Brodgar

Brodgar v.2.0.6+ can access various tools from R, for example:

  1. Boxplots. Shows distribution of data and identifies possible outliers.
  2. Dotplots. Outlier identification.
  3. Histograms. Shows distribution and possible outliers.
  4. Lattice graphs. These are the S-Plus Trellis graphs. A lattice graph is one of the most useful tools in the data exploration. In one graph, one can see the type of relationship between response variables and explanatory variables.
  5. Pairs. Shows two-way interaction between variables.
  6. Conditional boxplots. Shows boxplots for different values of a nominal variable.
  7. Coplot. Visualise the relationship between Y and X, while conditioning on a third explanatory variable Z (or even a fourth explanatory variables).
  8. Linear regression. Model the relationship between one response variable and multiple explanatory variables. 
  9. Generalised linear modeling (GLM). For count data, 0-1 data and proportional data. 
  10. Additive modeling. Use smoothing methods to find the relationship between one response variable and multiple explanatory variables.
  11. Generalised additive modeling (GAM). For count data, 0-1 data and proportional data.
  12. Mixed modelling.
  13. Regression trees. Useful tool to find relationships between one response variable and multiple explanatory variables.
  14. Clustering (using the  Jaccard coefficient,  Community coefficient, Similarity ratio, Percentage similarity, Ochiai coefficient, Chord distance, Euclidean distance, squared Euclidean distance, correlation coefficient, covariance coefficient, maximum distance, Manhattan distance, Canberra coefficient or binary distance).

Some of the method are illustrated below.

Coplots

In multivariate data, the relationship between two variables may be obscured by a third one. If one plots y against x, effects of z are ignored. The coplot allows one to plot y against x, while taking account of a third variable z. Figure 1 shows an example. The Dune Meadow data set consists of abundances of 33 plant species measured at 20 sites in a dune area. Various explanatory variables (soil and management related) were measured at each site. For each site, the total abundance was calculated. The (response) variable (total abundance)  is on the y axis and A1(soil variable) is on the x axis, with six separate plots conditional on the values of Moisture (soil variable) shown in the top panel.

Figure 1. Coplot of total abundance index function and two explanatory variables (A1 and moisture) for the Dune Meadow data.

The panels are ordered from the lower left to the upper right. This order corresponds to increasing values of moisture. The six lines in the upper panel show the range of moisture per graph. Results show that for lower values of moisture, the relationship between total abundance of species and A1 is positive, whereas for larger values of moisture the relationship becomes negative. Brodgar allows one to use regression lines or smoothing curves in the plots. It is also possible to have no lines.

Pairs

Another useful tool is the pairs function. It shows the pair-wise scatterplots and these can be used to detect relationships between variables and multi-colinearity. Figure 2 shows an example for the same data as in Figure 1. Note that there are no clear linear relationships between the variables. Two values of A1 are rather large, which might suggest to apply a transformation on A1. The lines are obtained by smoothing x on y. It is also possible to use a regression line or no line at all.

Figure 2. Pair plot of total abundance index function and two explanatory variables (A1 and moisture) for the Dune Meadow data. The lines are obtained by smoothing x on y.

Dotplots

This is a plot in which each observation is presented by a single dot. The value is presented along the horizontal axis. Dotplots can be used to identify outliers. Dotplots for four dune species are given in Figure 3. The 20 sites are plotted along the vertical axes and the horizontal axes show the values (square root transformed) at the sites. Isolated points on the right hand side indicate outliers, which is not the case for these four species. However, the dotplots do show the large number of zero observations. It is useful to make dotplots for species, explanatory variables and index functions.

Figure 3. Dotplots of four plant species from the Dune Meadow data. The vertical axes contain the samples and the values of the species are along the horizontal axes.

Lattice (Trellis) graphs

These are probably the most useful graphical exploration tools in S-Plus and R. The name “Trellis” is copyright protected and for that reason Trellis graphs are called lattice plots in R. An example for the Dune Meadow data is presented in Figure 4. Along (all) the x-axes, the explanatory variable A1 (soil related) is plotted. The panels contain the abundances of species and a smoothing curve is added. Lattice plots give a good indication what kind of relationships can be expected, e.g. linear or non-linear.

Figure 4. Lattice plots of 6 species from the Dune Meadow data. A1 is an explanatory (soil) variable.

Boxplots and histograms

A boxplot visualises the mean and spread for a univariate variable. The midpoint of a boxplot is given by the median. The 25% quartiles define the hinges (end of the boxes). Differences between the hinges is called the spread.  Lines are drawn from each hinge to 1.5 times the spread. Any point beyond this line is called an outlier. Figure 5 shows the boxplots of all 30 plant species. No transformation was used. It is interesting to make boxplots and histograms of all species and explanatory variables, print the graphs and redraw them for another transformation. This will give information which transformation (if one at all) should be applied.

Figure 5. Boxplots of various plant species (no transformation was applied) of the Dune Meadow data.

Regression trees

Two aspects which can cause problems in multivariate data are:

  1. The explanatory variables may interactions with each other

  2. The relationship between response variables and explanatory variables may be non-linear.

A useful tool to investigate relationships between one response variable and multiple explanatory variables is the regression tree. This is a simple tool which is best explained with help of an example. The spider species data set consists of abundances of 12 spiders measured in 28 traps. Five explanatory variables were measured at each site. Total abundance per site was calculated and the relationship between total abundance and the 5 explanatory variables is explored with help of a regression tree, see Figure 6. The response variable (total abundance) is a vector of length 28. The regression tree indicates that the 28 values of the index function (total abundance) can be split up in two groups; group 1 consists of 20 samples with herb cover smaller than 4.283, and 8 samples with herb cover larger or equal than 4.283. The later group can be further split up in two groups, namely those with moss cover smaller than 0.89 (5 sites with an average of 38 species) and larger than 0.89 (3 samples with an average of 50 species). Similar statements can be made for the left branch. Regression trees are a useful extension of generalised additive modelling. Further details can be found in Quinn and Keough (2002).

Figure 6. Regression tree for total abundance and 5 explanatory variables for the spider data set.

Additive modelling

A multiple linear regression model is given by:

yi = α + β1 xi1 + … + βp xip + εi

The additive model is a special case of generalised additive modelling model, and is defined by:

yi = α + f1i (x1)+ … + fpi (xp)+ εi

where each of the functions fj(.) are smoothing curves (e.g. loess curves). The shape of these curves can be used to get an idea of the relationship between response variable and explanatory variables.

Loyn (1987) analysed the abundance of birds measured in 56 forest patches. For each patch, mean bird abundance, area (size of patch), years since isolation and distance to nearest patch are available. In first instance, we use the following additive model:

Birdi = α + f1(Yeari) + f2(Patch Areai)+ f2(Distancei)+ εi

The index i refers to forest patch, where i=1,..,56.. One option is to make a scatterplot (pairs) of the data, but the problem is that these plots only show pair-wise interactions. The additive model overcomes this. The estimated smoothing curves and 95% point-wise confidence intervals are presented in Figure 7. The effect of Area is slightly non-linear (though this is only due to one site) whereas distance and year show a linear relationship.

Figure 7. Results of additive modelling for the bird abundance data using 4 degrees of freedom for each smoother.

Clustering

Brodgar contains hierarchical clustering. The process consists of the following steps.

  1. Choose whether clustering should be applied on the samples or on the rows.
  2. Choose a measure of similarity. The following options are available: Jaccard coefficient,  Community coefficient, Similarity ratio, Percentage similarity, Ochiai coefficient, Chord distance, Euclidean distance, squared Euclidean distance, correlation coefficient, covariance coefficient, maximum distance, Manhattan distance, Canberra coefficient, and binary distance. Some of these coefficients treat the data as presence/absence (e.g. Jaccard, community coefficient, binary). An excellent description of these measures of similarity can be found in Jongman et al. (1995), and  Legendre and Legendre (1998).
  3. Choose an agglomeration method. This determines how groups are connected into new groups. We advise to use "average".
  4. Select samples and variables. The buttons "Select all variables" and "Select all samples" can be used as well.

Figure 9 shows an example for the Dune Meadow data. Hierarchical clustering using the Jaccard index and average linkage was used.

Figure 9. Dendrogram for Dune Meadow data. Clustering was applied on the samples.

What do you need to do?

The R interface in Brodgar works as follows. The user selects a method, e.g. GAM and clicks on a button, and then: 

  1. Brodgar writes the required R code to a file.
  2. Brodgar starts R in BATCH mode. 
  3. R does all the calculations.
  4. R gives the output to Brodgar.
  5. Brodgar shows the results to the user.

The user does not see anything of R, except for the graphs and/or numerical output. Obviously, you will need to download and install R. 

References.

Fox, J. (2002). An R and S-Plus companion to applied regression. Saga Publications.

Jongman, R.H.G. and Ter Braak, C.J.F. and van Tongeren, O.F.R. (1995). Data analysis in community and landscape ecology. Cambridge University Press, Cambridge.

Legendre, P. and Legendre, L. (1998). Numerical Ecology. Second English Edition. Elsevier Science B.V.

Loyn, R.H. (1987). Effects of patch area and habitat on bird abundances, species numbers and tree health in fragmented Victorian forests. In: Nature Conservation: The role of Remnants of Native Vegetation (Saunders, D.A., Arnold, G.W., Burbidge, A.A. and Hopkins A.J.M. eds.). pp. 65-77. Surrey Beatty and Sons, Chipping Norton, NSW.  

Quinn, G.P. and Keough, M.J. (2002). Experimental design  and data analysis for biologists. Cambridge University Press.

 

3A. Freebie 1: Time series analysis examples

For the fast reader: links to DFA, MAFA and chronological clustering examples:

Two papers on DFA:

bullet

Zuur, A.F., Tuck, I.D. and Bailey, N. (2003b). Dynamic factor analysis to estimate common trends in fisheries time series. Canadian Journal of Fisheries and Aquatic Sciences, 60: 542-552. 

bullet

Zuur A.F. and Pierce G.J. (2004). Common trends in Northeast Atlantic Squid time series. Journal of Sea Research, 52: 57-72.

Examples
bullet

Dynamic factor analysis explained in more detail

bullet

DFA applied on fisheries time series data

bullet

MAFA applied on zoobenthic time series data

bullet

Chronological clustering applied to sticklebacks time series data (Bell & Legendre 1987).

Background information

Underlying questions in time series studies

Common characteristics in environmental time series studies are that the series are (i) short, (ii) non-stationary, (iii) made up of many response variables which are interacting with each other, and (iv) have missing values. Common questions in these studies are:  

bullet

What are the general  patterns over time in the measured variables?

bullet

Are there interactions between the measured variables?

bullet

Are there any underlying explanatory variables?

bullet

Are there any shifts, or sudden changes over time?

These questions can be summarised by one simple question: what's going on? Brodgar contains various different time series techniques to answer this question. Three important ones are: 

bullet

Dynamic factor analysis (Zuur et al. 2003a,b,2004). 

bullet

MAFA (Solow 1994, Shapiro & Switzer 1989).

bullet

Chronological clustering (Legendre et al. 1985, Bell & Legendre, 1987, Legendre & Legendre, 1998).

Each of these methods is discussed next.

Dynamic factor analysis

Biological time series are in general to short for techniques like spectral analysis, wavelet analysis, auto regressive (AR) models and auto regressive integrated moving average (ARIMA) models. Furthermore, aspects like missing values and the presence of many response variables are not handled well by these techniques.  To address the multivariate nature of the response variables, standard multivariate techniques such as canonical correspondence analysis, principal component analysis, multidimensional scaling, are sometimes used. However, these techniques do not handle missing data properly and furthermore they do note take dependencies over time into account. A more promising approach is structural time series analysis (Harvey 1989). Although this set of techniques originates from fields related to econometrics and psychology, it has several aspects that are of interest to biologists. Its main feature is that the time series are modelled in terms of a trend, seasonal effects, a cycle, explanatory variables and noise, each of which is allowed to be stochastic. This means that one might end up with  a seasonal component which changes slightly from year to year, a cyclic component which is not necessarily a cosine function, a trend which is not restricted to be a straight line or a polynomial, or explanatory variables which only have a significant influence in a certain period. 

One particular interesting approach is dynamic factor analysis. In this multivariate technique, underlying common components are identified, namely: common trends,  common seasonal patterns, common cycli and, effects of explanatory variables. If the time series are short, as in most environmental studies, cycli and seasonal patterns can be omitted, resulting in the estimation of common trends and effects of explanatory variables only. Traditionally, the parameter estimation process in dynamic factor analysis was carried out by direct optimisation of a maximum likelihood criterion (Harvey 1989). Due to numerical problems, this limits the number of time series that can be analysed. Zuur et al. (2003a,b, 2004) however, addressed this limitation by using a different estimation procedure, the so-called EM algorithm. Furthermore, they extended the technique by including explanatory variables.

Links to dynamic factor analysis examples:

bullet

Dynamic factor analysis explained in more detail

bullet

DFA applied on fisheries time series data

MAFA

MAFA stands for min/max autocorrelation factor analysis. MAFA can be described in various ways, e.g.:

bullet

A type of principal component analysis especially for (short) time series.

bullet

A method for extracting trends from multiple time series.

bullet

A method for estimating index functions from time series.

bullet

A smoothing method.

bullet

A signal extraction procedure. 

MAFA is perhaps best explained with an example:

bullet

MAFA applied on zoobenthic time series data

Chronological clustering

MAFA and dynamic factor analysis are techniques which can be used to estimate trends in multivariate time series. Application of these techniques on biological data assumes that the underlying  ecosystem is gradually changing over time. However, these techniques are less appropriate if the ecosystem changes rapidly from one state to another. Ordinary clustering techniques might be applied to identify sudden changes, but these methods are likely to result in groups of years that are difficult to interpret. For example, how does one explain a group containing 1970, 1976, 1992 and 2003? Chronological clustering, as the name already suggests, is especially designed for clustering of time series. The method is fully described in Legendre et al. (1985), Bell and Legendre (1987), and Legendre and Legendre (1998). The first two papers are downloadable from Legendre's website (search on "chronological clustering" in Google), and are easy to read for non-statisticians. Explaining chronological lustering is best done with examples:

bullet

Chronological clustering applied to 100 climatic time series: identifying regime shifts in Pacific North America.

bulletChronological clustering applied to sticklebacks time series data (Bell & Legendre 1987).

To identify breakpoints in multivariate times series, Brodgar can also apply regime shift analysis, as explained in Hare & Mantua (2000), see the Brodgar manual for an example.

Besides dynamic factor analysis, MAFA and chronological clustering, Brodgar is capable of carrying out ‘standard’ multivariate techniques and multivariate time series techniques like principal component analysis, canonical correspondence analysis, discriminant analysis, redundancy analysis, multidimensional scaling, ARIMAX, spectral analysis, etc. For short time series data (say less than 15 points in time), some of the multivariate methods can be used. For example, partial RDA and partial CCA can be used to determine how much variation in the response variables is due to time. 

The  emphasis in Brodgar is on biological and environmental time series. However, Brodgar can be used in many other fields. For example, various Brodgar users work on economical time series data.

References

bullet

Bell, M.A. and Legendre, P. (1987). Multicharacter Chronological Clustering in a Sequence of Fossil Sticklebacks. Systematic Zoology, 36: 52-61.

bullet

Hare, S.R. and Mantua, N.J. (2000). Emperical evidence for North Pacific regime shifts in Pacific North America. Progress in Oceanography, 47: 103-145.

bullet

Legendre,  P., Dallot, S. and Legendre, L. (1985). Succession of species within a community: Chronological clustering, with application to marine and freshwater zooplankton. Am. Nat. 125: 257-288. 

bullet

Legendre,  P., Dallot, S. and Legendre, L. (1985). Succession of species within a community: Chronological clustering, with application to marine and freshwater zooplankton. Am. Nat. 125: 257-288.

bullet

Solow, A.R. (1994). Detecting Change in the Composition of a Multispecies Community. Biometrics, 50, 556-565.

bullet

Shapiro D.E. and Switzer P. (1989). Extracting time trends from multiple monitoring sites. Technical report No. 132. Department of Statistics, Standford University, California.

bullet

Zuur, A.F., Fryer, R.J., Jolliffe, i.T., Dekker, R. and Beukema, J.J. (2003a). Estimating common trends in multivariate time series using dynamic factor  analysis. Environmetrics, 14(7):665-685.

bullet

Zuur, A.F., Tuck, I.D. and Bailey, N. (2003b). Dynamic factor analysis to estimate common trends in fisheries time series. Canadian Journal of Fisheries and Aquatic Sciences, 60:542-552.

bullet

Zuur A.F. and Pierce G.J. (2004). Common trends in Northeast Atlantic Squid time series. Journal of Sea Research, 52:57-72.

3B. Freebie 2: Examples of ordination techniques

The following dimension reduction techniques are available in Brodgar:

bulletPrincipal component analysis (PCA)
bulletCorrespondence analysis (CA)
bulletRedundancy analysis (RDA)
bulletCanonical correspondence analysis (CCA)
bulletPartial CCA
bulletPartial RDA
bulletCanonical correlation analysis
bulletVariance partitioning
bulletFactor analysis (FA)
bulletMultidimensional scaling (MDS)
bulletDiscriminant analysis (DA)
bulletProcrustes analysis
bulletAlternative transformations to distance based RDA (db-RDA)
bulletPermutation tests and forward selection methods in RDA (and CCA).

A short non-statistical introduction of each method is presented below. We  explain what each of these techniques can do and various examples are presented. References to easy-to-read text books and publications are given at the end of this page.

bulletThe PCA biplot is an M-dimensional graphical representation of the correlation (or covariance) matrix of the response variables. In most cases, M=2.
bulletThe correspondence analysis biplot is an M-dimensional graphical representation of Chi-squared distances between response variables.
bulletRedundancy analysis is a principal component analysis in which the axes are restricted to be linear combinations of explanatory variables.
bulletCanonical correspondence analysis is a correspondence analysis in which the axes are restricted to be linear combinations of explanatory variables.
bulletIn partial CCA and partial RDA, effects of particular (e.g. spatial or temporal) explanatory variables are removed. CCA and partial CCA (or RDA and partial RDA) can be used for variance partitioning. In this method the explanatory variables are divided in two sets: X and W. Using the total sum of eigenvalues of various analysis, it is then possible to quantify the amount of variance in related to X, to W, and the shared amount of variance in the response variables.   
bulletDiscriminant analysis (alias canonical variate analysis) can be used to find differences between groups of samples and identify which of the response variables are causing these differences.

Example 1: Canonical correspondence analysis applied on hunting spider data

Keywords: Species-environmental relationships. How to read a triplot and  a biplot. Canonical correspondence analysis.

Example 2: Discriminant analysis applied on Egyptian skulls of different time periods from the area of Thebes.

Keywords: Discrimination between groups of samples. Discriminant analysis. Hypothesis tests (is the discrimination significant?). Identification which variables contributed most to the discrimination.

Example 3: Zoobenthic species measured in an intertidal area  in Argentina

Keywords: Zoobenthic species-environmental relationships. When to use PCA and RDA and not CA or CCA? Biplots and triplots. Superimposing explanatory variables on a biplot. Indirect gradient analysis. Coenoclines. Discriminant analysis to detect differences in species behavior at 3 transects.

Example 4: Using coenoclines to detect linear or unimodal species-environmental relationships

Keywords: Coenoclines. Using PCA or CA? Using RDA or CCA?

Example 5: Variance partitioning for dune meadow data

Keywords: Identify the amount of variation in response variables due to a subset of explanatory variables. Partial CCA. Identify pure management effects, spatial effects, pure temporal effects, etc.

4. Brodgar compliance statement on R GNU license

The software package R is distributed under the GNU general public license. Brodgar, which is not distributed under the GNU license, creates ascii files containing R script commands which are sourced into the binary version of R (using the BATCH mode). Although Brodgar is linked to R, it does not contain it. The user will need to (i) download a compiled version of R, (ii) install it and (iii) tell Brodgar where it can find R. Hence, R and Brodgar are two different packages (Brodgar is the shell and R the compiler) which are an arm length apart.  As a consequence, Brodgar complies with the GNU license. See http://www.gnu.org/copyleft/gpl.html for details. As an extra service, Brodgar's own R library files are freely available in the Brodgar installation directory (also in the online evaluation version). The interested user can modify and extend these library files, although faulty modifications affect your warrantee. Brodgar also contains a considerable number of techniques (50%) which do not make use of R. The R libraries gam and mvpart are distributed under the GPL license as well, and the vioplot library under BSD. Therefore, the same holds for these libraries.