|
|
Brodgar compliance statement on R GNU licenseThe software package R is distributed under the GNU general public license. Brodgar, which is not distributed under the GNU license, creates ascii files containing R script commands which are sourced into the binary version of R (using the BATCH mode). Although Brodgar is linked to R, it does not contain it. The user will need to (i) download a compiled version of R, (ii) install it and (iii) tell Brodgar where it can find R. Hence, R and Brodgar are two different packages (Brodgar is the shell and R the compiler) which are an arm length apart. As a consequence, Brodgar complies with the GNU license. See http://www.gnu.org/copyleft/gpl.html for details. As an extra service, Brodgar's own R library files are freely available in the Brodgar installation directory (also in the online evaluation version). The interested user can modify and extend these library files, although faulty modifications affect your warrantee. Brodgar also contains a considerable number of techniques (50%) which do not make use of R. Brodgar and RVersion 2.0.6+ of Brodgar contains an interface to the statistics package R (version 1.8.1). In this section, the following information is provided:
What is R?R is a free implementation of the S language and has become popular since the late 1990s. Various textbooks on R and S-Plus have been published and Internet newsgroups and mailing lists provide useful information. S-Plus is one of the best all-round statistical software packages, but unfortunately it is rather expensive. The syntax in R is for 95% identical to that of S-Plus, and the same textbooks and manuals can be used. The only disadvantage is that R requires programming whereas S-Plus contains a GUI allowing users to click buttons. Someone who is not used to programming might find R off-putting in the beginning. Biological data analyses require specialised methods which are only available in software packages like Brodgar or CANOCO, but also more general methods (data exploration, regression, generalised linear modeling (GLM) and generalised additive modeling (GAM) methods). These later methods are all available in R. To make these general statistical methods available to Brodgar users, an R-interface was added to Brodgar. With two mouse clicks in Brodgar, the user can obtain R lattice graphs or apply GAM without the need of 20-30 lines of code.
Examples of R utilities available from BrodgarBrodgar v.2.0.6+ can access various tools from R, for example:
Some of the method are illustrated below. CoplotsIn multivariate data, the relationship between two variables may be obscured by a third one. If one plots y against x, effects of z are ignored. The coplot allows one to plot y against x, while taking account of a third variable z. Figure 1 shows an example. The Dune Meadow data set consists of abundances of 33 plant species measured at 20 sites in a dune area. Various explanatory variables (soil and management related) were measured at each site. For each site, the total abundance was calculated. The (response) variable (total abundance) is on the y axis and A1(soil variable) is on the x axis, with six separate plots conditional on the values of Moisture (soil variable) shown in the top panel.
Figure 1. Coplot of total abundance index function and two explanatory variables (A1 and moisture) for the Dune Meadow data. The panels are ordered from the lower left to the upper right. This order corresponds to increasing values of moisture. The six lines in the upper panel show the range of moisture per graph. Results show that for lower values of moisture, the relationship between total abundance of species and A1 is positive, whereas for larger values of moisture the relationship becomes negative. Brodgar allows one to use regression lines or smoothing curves in the plots. It is also possible to have no lines. PairsAnother useful tool is the pairs function. It shows the pair-wise scatterplots and these can be used to detect relationships between variables and multi-colinearity. Figure 2 shows an example for the same data as in Figure 1. Note that there are no clear linear relationships between the variables. Two values of A1 are rather large, which might suggest to apply a transformation on A1. The lines are obtained by smoothing x on y. It is also possible to use a regression line or no line at all.
Figure 2. Pair plot of total abundance index function and two explanatory variables (A1 and moisture) for the Dune Meadow data. The lines are obtained by smoothing x on y. DotplotsThis is a plot in which each observation is presented by a single dot. The value is presented along the horizontal axis. Dotplots can be used to identify outliers. Dotplots for four dune species are given in Figure 3. The 20 sites are plotted along the vertical axes and the horizontal axes show the values (square root transformed) at the sites. Isolated points on the right hand side indicate outliers, which is not the case for these four species. However, the dotplots do show the large number of zero observations. It is useful to make dotplots for species, explanatory variables and index functions.
Figure 3. Dotplots of four plant species from the Dune Meadow data. The vertical axes contain the samples and the values of the species are along the horizontal axes. Lattice (Trellis) graphsThese are probably the most useful graphical exploration tools in S-Plus and R. The name “Trellis” is copyright protected and for that reason Trellis graphs are called lattice plots in R. An example for the Dune Meadow data is presented in Figure 4. Along (all) the x-axes, the explanatory variable A1 (soil related) is plotted. The panels contain the abundances of species and a smoothing curve is added. Lattice plots give a good indication what kind of relationships can be expected, e.g. linear or non-linear.
Figure 4. Lattice plots of 6 species from the Dune Meadow data. A1 is an explanatory (soil) variable. Boxplots and histogramsA boxplot visualises the mean and spread for a univariate variable. The midpoint of a boxplot is given by the median. The 25% quartiles define the hinges (end of the boxes). Differences between the hinges is called the spread. Lines are drawn from each hinge to 1.5 times the spread. Any point beyond this line is called an outlier. Figure 5 shows the boxplots of all 30 plant species. No transformation was used. It is interesting to make boxplots and histograms of all species and explanatory variables, print the graphs and redraw them for another transformation. This will give information which transformation (if one at all) should be applied.
Figure 5. Boxplots of various plant species (no transformation was applied) of the Dune Meadow data. Regression treesTwo aspects which can cause problems in multivariate data are:
A useful tool to investigate relationships between one response variable and multiple explanatory variables is the regression tree. This is a simple tool which is best explained with help of an example. The spider species data set consists of abundances of 12 spiders measured in 28 traps. Five explanatory variables were measured at each site. Total abundance per site was calculated and the relationship between total abundance and the 5 explanatory variables is explored with help of a regression tree, see Figure 6. The response variable (total abundance) is a vector of length 28. The regression tree indicates that the 28 values of the index function (total abundance) can be split up in two groups; group 1 consists of 20 samples with herb cover smaller than 4.283, and 8 samples with herb cover larger or equal than 4.283. The later group can be further split up in two groups, namely those with moss cover smaller than 0.89 (5 sites with an average of 38 species) and larger than 0.89 (3 samples with an average of 50 species). Similar statements can be made for the left branch. Regression trees are a useful extension of generalised additive modelling. Further details can be found in Quinn and Keough (2002).
Figure 6. Regression tree for total abundance and 5 explanatory variables for the spider data set. Additive modellingA multiple linear regression model is given by: yi =
α + β1 xi1 + … + βp xip
+ εi The additive model is a special case of generalised
additive modelling model, and is defined by: yi =
α + f1i (x1)+ … + fpi (xp)+
εi where each of the functions fj(.) are smoothing curves (e.g. loess curves). The shape of these curves can be used to get an idea of the relationship between response variable and explanatory variables. Loyn (1987) analysed the abundance of birds measured in 56
forest patches. For each patch, mean bird abundance, area (size of patch), years
since isolation and distance to nearest patch are available. In first instance,
we use the following additive model: Birdi =
α + f1(Yeari) + f2(Patch Areai)+
f2(Distancei)+ εi The index i refers to forest patch, where i=1,..,56.. One option is to make a scatterplot (pairs) of the data, but the problem is that these plots only show pair-wise interactions. The additive model overcomes this. The estimated smoothing curves and 95% point-wise confidence intervals are presented in Figure 7. The effect of Area is slightly non-linear (though this is only due to one site) whereas distance and year show a linear relationship.
ClusteringBrodgar contains hierarchical clustering. The process consists of the following steps.
Figure 9 shows an example for the Dune Meadow data. Hierarchical clustering using the Jaccard index and average linkage was used.
Figure 9. Dendrogram for Dune Meadow data. Clustering was applied on the samples.
What do you need to do?The R interface in Brodgar works as follows. The user selects a method, e.g. GAM and clicks on a button, and then:
The user does not see anything of R, except for the graphs and/or numerical output. Obviously, you will need to download and install R. References.Fox, J. (2002). An R and S-Plus companion to applied regression. Saga Publications. Jongman, R.H.G. and Ter Braak, C.J.F. and van Tongeren, O.F.R. (1995). Data analysis in community and landscape ecology. Cambridge University Press, Cambridge. Legendre, P. and Legendre, L. (1998). Numerical Ecology. Second English Edition. Elsevier Science B.V. Loyn, R.H. (1987). Effects of patch area and habitat on bird abundances, species numbers and tree health in fragmented Victorian forests. In: Nature Conservation: The role of Remnants of Native Vegetation (Saunders, D.A., Arnold, G.W., Burbidge, A.A. and Hopkins A.J.M. eds.). pp. 65-77. Surrey Beatty and Sons, Chipping Norton, NSW. Quinn, G.P. and Keough, M.J. (2002). Experimental design and data analysis for biologists. Cambridge University Press. Home: www.brodgar.com
|