Brodgar compliance statement on R GNU license

The software package R is distributed under the GNU general public license. Brodgar, which is not distributed under the GNU license, creates ascii files containing R script commands which are sourced into the binary version of R (using the BATCH mode). Although Brodgar is linked to R, it does not contain it. The user will need to (i) download a compiled version of R, (ii) install it and (iii) tell Brodgar where it can find R. Hence, R and Brodgar are two different packages (Brodgar is the shell and R the compiler) which are an arm length apart.  As a consequence, Brodgar complies with the GNU license. See http://www.gnu.org/copyleft/gpl.html for details. As an extra service, Brodgar's own R library files are freely available in the Brodgar installation directory (also in the online evaluation version). The interested user can modify and extend these library files, although faulty modifications affect your warrantee. Brodgar also contains a considerable number of techniques (50%) which do not make use of R.

 

Brodgar and R

Version 2.0.6+ of Brodgar contains an interface to the statistics package R (version 1.8.1). In this section, the following information is provided:

bulletWhat is R?
bulletWhich tools are accessible from Brodgar?.
bulletExamples of R utilities available from Brodgar
bulletWhat do you need to do?
bulletPotential problems.
bulletGraphs.
bulletWhat is next?

What is R?

R is a free implementation of the S language and has become popular since the late 1990s. Various textbooks on R and S-Plus have been published and Internet newsgroups and mailing lists provide useful information. S-Plus is one of the best all-round statistical software packages, but unfortunately it is rather expensive.  The syntax in R is for 95% identical to that of S-Plus, and the same textbooks and manuals can be used. The only disadvantage is that R requires programming whereas S-Plus contains a GUI allowing users to click buttons. Someone who is not used to programming might find R off-putting in the beginning. 

Biological data analyses require specialised methods which are only available in software  packages like Brodgar or CANOCO, but also more general methods (data exploration, regression, generalised linear modeling (GLM) and generalised additive modeling (GAM) methods). These later methods are all available in R. To make these general statistical methods available to Brodgar users, an R-interface was added to Brodgar. With two mouse clicks in Brodgar, the user can obtain R lattice graphs or apply GAM without the need of 20-30 lines of code.

Examples of R utilities available from Brodgar

Brodgar v.2.0.6+ can access various tools from R, for example:

  1. Boxplots. Shows distribution of data and identifies possible outliers.
  2. Dotplots. Outlier identification.
  3. Histograms. Shows distribution and possible outliers.
  4. Lattice graphs. These are the S-Plus Trellis graphs. A lattice graph is one of the most useful tools in the data exploration. In one graph, one can see the type of relationship between response variables and explanatory variables.
  5. Pairs. Shows two-way interaction between variables.
  6. Conditional boxplots. Shows boxplots for different values of a nominal variable.
  7. Coplot. Visualise the relationship between Y and X, while conditioning on a third explanatory variable Z (or even a fourth explanatory variables).
  8. Linear regression. Model the relationship between one response variable and multiple explanatory variables. 
  9. Generalised linear modeling (GLM). For count data, 0-1 data and proportional data. 
  10. Additive modeling. Use smoothing methods to find the relationship between one response variable and multiple explanatory variables.
  11. Generalised additive modeling (GAM). For count data, 0-1 data and proportional data.
  12. Mixed modelling.
  13. Regression trees. Useful tool to find relationships between one response variable and multiple explanatory variables.
  14. Clustering (using the  Jaccard coefficient,  Community coefficient, Similarity ratio, Percentage similarity, Ochiai coefficient, Chord distance, Euclidean distance, squared Euclidean distance, correlation coefficient, covariance coefficient, maximum distance, Manhattan distance, Canberra coefficient or binary distance).

Some of the method are illustrated below.

Coplots

In multivariate data, the relationship between two variables may be obscured by a third one. If one plots y against x, effects of z are ignored. The coplot allows one to plot y against x, while taking account of a third variable z. Figure 1 shows an example. The Dune Meadow data set consists of abundances of 33 plant species measured at 20 sites in a dune area. Various explanatory variables (soil and management related) were measured at each site. For each site, the total abundance was calculated. The (response) variable (total abundance)  is on the y axis and A1(soil variable) is on the x axis, with six separate plots conditional on the values of Moisture (soil variable) shown in the top panel.

Figure 1. Coplot of total abundance index function and two explanatory variables (A1 and moisture) for the Dune Meadow data.

The panels are ordered from the lower left to the upper right. This order corresponds to increasing values of moisture. The six lines in the upper panel show the range of moisture per graph. Results show that for lower values of moisture, the relationship between total abundance of species and A1 is positive, whereas for larger values of moisture the relationship becomes negative. Brodgar allows one to use regression lines or smoothing curves in the plots. It is also possible to have no lines.

Pairs

Another useful tool is the pairs function. It shows the pair-wise scatterplots and these can be used to detect relationships between variables and multi-colinearity. Figure 2 shows an example for the same data as in Figure 1. Note that there are no clear linear relationships between the variables. Two values of A1 are rather large, which might suggest to apply a transformation on A1. The lines are obtained by smoothing x on y. It is also possible to use a regression line or no line at all.

Figure 2. Pair plot of total abundance index function and two explanatory variables (A1 and moisture) for the Dune Meadow data. The lines are obtained by smoothing x on y.

Dotplots

This is a plot in which each observation is presented by a single dot. The value is presented along the horizontal axis. Dotplots can be used to identify outliers. Dotplots for four dune species are given in Figure 3. The 20 sites are plotted along the vertical axes and the horizontal axes show the values (square root transformed) at the sites. Isolated points on the right hand side indicate outliers, which is not the case for these four species. However, the dotplots do show the large number of zero observations. It is useful to make dotplots for species, explanatory variables and index functions.

Figure 3. Dotplots of four plant species from the Dune Meadow data. The vertical axes contain the samples and the values of the species are along the horizontal axes.

Lattice (Trellis) graphs

These are probably the most useful graphical exploration tools in S-Plus and R. The name “Trellis” is copyright protected and for that reason Trellis graphs are called lattice plots in R. An example for the Dune Meadow data is presented in Figure 4. Along (all) the x-axes, the explanatory variable A1 (soil related) is plotted. The panels contain the abundances of species and a smoothing curve is added. Lattice plots give a good indication what kind of relationships can be expected, e.g. linear or non-linear.

Figure 4. Lattice plots of 6 species from the Dune Meadow data. A1 is an explanatory (soil) variable.

Boxplots and histograms

A boxplot visualises the mean and spread for a univariate variable. The midpoint of a boxplot is given by the median. The 25% quartiles define the hinges (end of the boxes). Differences between the hinges is called the spread.  Lines are drawn from each hinge to 1.5 times the spread. Any point beyond this line is called an outlier. Figure 5 shows the boxplots of all 30 plant species. No transformation was used. It is interesting to make boxplots and histograms of all species and explanatory variables, print the graphs and redraw them for another transformation. This will give information which transformation (if one at all) should be applied.

Figure 5. Boxplots of various plant species (no transformation was applied) of the Dune Meadow data.

Regression trees

Two aspects which can cause problems in multivariate data are:

  1. The explanatory variables may interactions with each other

  2. The relationship between response variables and explanatory variables may be non-linear.

A useful tool to investigate relationships between one response variable and multiple explanatory variables is the regression tree. This is a simple tool which is best explained with help of an example. The spider species data set consists of abundances of 12 spiders measured in 28 traps. Five explanatory variables were measured at each site. Total abundance per site was calculated and the relationship between total abundance and the 5 explanatory variables is explored with help of a regression tree, see Figure 6. The response variable (total abundance) is a vector of length 28. The regression tree indicates that the 28 values of the index function (total abundance) can be split up in two groups; group 1 consists of 20 samples with herb cover smaller than 4.283, and 8 samples with herb cover larger or equal than 4.283. The later group can be further split up in two groups, namely those with moss cover smaller than 0.89 (5 sites with an average of 38 species) and larger than 0.89 (3 samples with an average of 50 species). Similar statements can be made for the left branch. Regression trees are a useful extension of generalised additive modelling. Further details can be found in Quinn and Keough (2002).

Figure 6. Regression tree for total abundance and 5 explanatory variables for the spider data set.

Additive modelling

A multiple linear regression model is given by:

yi = α + β1 xi1 + … + βp xip + εi

The additive model is a special case of generalised additive modelling model, and is defined by:

yi = α + f1i (x1)+ … + fpi (xp)+ εi

where each of the functions fj(.) are smoothing curves (e.g. loess curves). The shape of these curves can be used to get an idea of the relationship between response variable and explanatory variables.

Loyn (1987) analysed the abundance of birds measured in 56 forest patches. For each patch, mean bird abundance, area (size of patch), years since isolation and distance to nearest patch are available. In first instance, we use the following additive model:

Birdi = α + f1(Yeari) + f2(Patch Areai)+ f2(Distancei)+ εi

The index i refers to forest patch, where i=1,..,56.. One option is to make a scatterplot (pairs) of the data, but the problem is that these plots only show pair-wise interactions. The additive model overcomes this. The estimated smoothing curves and 95% point-wise confidence intervals are presented in Figure 7. The effect of Area is slightly non-linear (though this is only due to one site) whereas distance and year show a linear relationship.

Figure 7. Results of additive modelling for the bird abundance data using 4 degrees of freedom for each smoother.

Clustering

Brodgar contains hierarchical clustering. The process consists of the following steps.

  1. Choose whether clustering should be applied on the samples or on the rows.
  2. Choose a measure of similarity. The following options are available: Jaccard coefficient,  Community coefficient, Similarity ratio, Percentage similarity, Ochiai coefficient, Chord distance, Euclidean distance, squared Euclidean distance, correlation coefficient, covariance coefficient, maximum distance, Manhattan distance, Canberra coefficient, and binary distance. Some of these coefficients treat the data as presence/absence (e.g. Jaccard, community coefficient, binary). An excellent description of these measures of similarity can be found in Jongman et al. (1995), and  Legendre and Legendre (1998).
  3. Choose an agglomeration method. This determines how groups are connected into new groups. We advise to use "average".
  4. Select samples and variables. The buttons "Select all variables" and "Select all samples" can be used as well.

Figure 9 shows an example for the Dune Meadow data. Hierarchical clustering using the Jaccard index and average linkage was used.

Figure 9. Dendrogram for Dune Meadow data. Clustering was applied on the samples.

What do you need to do?

The R interface in Brodgar works as follows. The user selects a method, e.g. GAM and clicks on a button, and then: 

  1. Brodgar writes the required R code to a file.
  2. Brodgar starts R in BATCH mode. 
  3. R does all the calculations.
  4. R gives the output to Brodgar.
  5. Brodgar shows the results to the user.

The user does not see anything of R, except for the graphs and/or numerical output. Obviously, you will need to download and install R. 

References.

Fox, J. (2002). An R and S-Plus companion to applied regression. Saga Publications.

Jongman, R.H.G. and Ter Braak, C.J.F. and van Tongeren, O.F.R. (1995). Data analysis in community and landscape ecology. Cambridge University Press, Cambridge.

Legendre, P. and Legendre, L. (1998). Numerical Ecology. Second English Edition. Elsevier Science B.V.

Loyn, R.H. (1987). Effects of patch area and habitat on bird abundances, species numbers and tree health in fragmented Victorian forests. In: Nature Conservation: The role of Remnants of Native Vegetation (Saunders, D.A., Arnold, G.W., Burbidge, A.A. and Hopkins A.J.M. eds.). pp. 65-77. Surrey Beatty and Sons, Chipping Norton, NSW.  

Quinn, G.P. and Keough, M.J. (2002). Experimental design  and data analysis for biologists. Cambridge University Press.

Home: www.brodgar.com