We used Brodgar in various of our own papers.
Most of the examples in "A protocol for data exploration to avoid common statistical problems" by Zuur et al. (2010) can be carried out in Brodgar.
Examples of R utilities available from Brodgar
In multivariate data, the relationship between two variables may be obscured by a third one. If one plots y against x, effects of z are ignored. The coplot allows one to plot y against x, while taking account of a third variable z. Figure 1 shows an example. The Dune Meadow data set consists of abundances of 33 plant species measured at 20 sites in a dune area. Various explanatory variables (soil and management related) were measured at each site. For each site, the total abundance was calculated. The (response) variable (total abundance) is on the y axis and A1(soil variable) is on the x axis, with six separate plots conditional on the values of Moisture (soil variable) shown in the top panel.
Figure 1. Coplot of total abundance index function and two explanatory variables (A1 and moisture) for the Dune Meadow data.
The panels are ordered from the lower left to the upper right. This order corresponds to increasing values of moisture. The six lines in the upper panel show the range of moisture per graph. Results show that for lower values of moisture, the relationship between total abundance of species and A1 is positive, whereas for larger values of moisture the relationship becomes negative. Brodgar allows one to use regression lines or smoothing curves in the plots. It is also possible to have no lines.
Another useful tool is the pairs function. It shows the pair-wise scatterplots and these can be used to detect relationships between variables and multi-colinearity. Figure 2 shows an example for the same data as in Figure 1. Note that there are no clear linear relationships between the variables. Two values of A1 are rather large, which might suggest to apply a transformation on A1. The lines are obtained by smoothing x on y. It is also possible to use a regression line or no line at all.
Figure 2. Pair plot of total abundance index function and two explanatory variables (A1 and moisture) for the Dune Meadow data. The lines are obtained by smoothing x on y.
This is a plot in which each observation is presented by a single dot. The value is presented along the horizontal axis. Dotplots can be used to identify outliers. Dotplots for four dune species are given in Figure 3. The 20 sites are plotted along the vertical axes and the horizontal axes show the values (square root transformed) at the sites. Isolated points on the right hand side indicate outliers, which is not the case for these four species. However, the dotplots do show the large number of zero observations. It is useful to make dotplots for species, explanatory variables and index functions.
Figure 3. Dotplots of four plant species from the Dune Meadow data. The vertical axes contain the samples and the values of the species are along the horizontal axes.
Lattice (Trellis) graphs
These are probably the most useful graphical exploration tools in S-Plus and R. The name “Trellis” is copyright protected and for that reason Trellis graphs are called lattice plots in R. An example for the Dune Meadow data is presented in Figure 4. Along (all) the x-axes, the explanatory variable A1 (soil related) is plotted. The panels contain the abundances of species and a smoothing curve is added. Lattice plots give a good indication what kind of relationships can be expected, e.g. linear or non-linear.
Figure 4. Lattice plots of 6 species from the Dune Meadow data. A1 is an explanatory (soil) variable.
Boxplots and histograms
A boxplot visualises the mean and spread for a univariate variable. The midpoint of a boxplot is given by the median. The 25% quartiles define the hinges (end of the boxes). Differences between the hinges is called the spread. Lines are drawn from each hinge to 1.5 times the spread. Any point beyond this line is called an outlier. Figure 5 shows the boxplots of all 30 plant species. No transformation was used. It is interesting to make boxplots and histograms of all species and explanatory variables, print the graphs and redraw them for another transformation. This will give information which transformation (if one at all) should be applied.
Figure 5. Boxplots of various plant species (no transformation was applied) of the Dune Meadow data.
Two aspects which can cause problems in multivariate data are that the explanatory variables may interactions with each other and the relationship between response variables and explanatory variables may be non-linear. A useful tool to investigate relationships between one response variable and multiple explanatory variables is the regression tree. This is a simple tool which is best explained with help of an example. The spider species data set consists of abundances of 12 spiders measured in 28 traps. Five explanatory variables were measured at each site. Total abundance per site was calculated and the relationship between total abundance and the 5 explanatory variables is explored with help of a regression tree, see Figure 6. The response variable (total abundance) is a vector of length 28. The regression tree indicates that the 28 values of the index function (total abundance) can be split up in two groups; group 1 consists of 20 samples with herb cover smaller than 4.283, and 8 samples with herb cover larger or equal than 4.283. The later group can be further split up in two groups, namely those with moss cover smaller than 0.89 (5 sites with an average of 38 species) and larger than 0.89 (3 samples with an average of 50 species). Similar statements can be made for the left branch. Regression trees are a useful extension of generalised additive modelling. Further details can be found in Quinn and Keough (2002).
Figure 6. Regression tree for total abundance and 5 explanatory variables for the spider data set.
A multiple linear regression model is given by:
yi = α + β1 xi1 + … + βp xip + εi
The additive model is a special case of generalised additive modelling model, and is defined by:
yi = α + f1i (x1)+ … + fpi (xp)+ εi
where each of the functions fj(.) are smoothing curves (e.g. loess curves). The shape of these curves can be used to get an idea of the relationship between response variable and explanatory variables.
Loyn (1987) analysed the abundance of birds measured in 56 forest patches. For each patch, mean bird abundance, area (size of patch), years since isolation and distance to nearest patch are available. In first instance, we use the following additive model:
Birdi = α + f1(Yeari) + f2(Patch Areai)+ f2(Distancei)+ εi
The index i refers to forest patch, where i=1,..,56.. One option is to make a scatterplot (pairs) of the data, but the problem is that these plots only show pair-wise interactions. The additive model overcomes this. The estimated smoothing curves and 95% point-wise confidence intervals are presented in Figure 7. The effect of Area is slightly non-linear (though this is only due to one site) whereas distance and year show a linear relationship.
Figure 7. Results of additive modelling for the bird abundance data using 4 degrees of freedom for each smoother.
Brodgar contains hierarchical clustering. The process consists of the following steps.
- Choose whether clustering should be applied on the samples or on the rows.
- Choose a measure of similarity. The following options are available: Jaccard coefficient, Community coefficient, Similarity ratio, Percentage similarity, Ochiai coefficient, Chord distance, Euclidean distance, squared Euclidean distance, correlation coefficient, covariance coefficient, maximum distance, Manhattan distance, Canberra coefficient, and binary distance. Some of these coefficients treat the data as presence/absence (e.g. Jaccard, community coefficient, binary). An excellent description of these measures of similarity can be found in Jongman et al. (1995), and Legendre and Legendre (1998).
- Choose an agglomeration method. This determines how groups are connected into new groups. We advise to use "average".
- Select samples and variables. The buttons "Select all variables" and "Select all samples" can be used as well.
Figure 9 shows an example for the Dune Meadow data. Hierarchical clustering using the Jaccard index and average linkage was used.
Figure 9. Dendrogram for Dune Meadow data. Clustering was applied on the samples.
Jongman, R.H.G. and Ter Braak, C.J.F. and van Tongeren, O.F.R. (1995). Data analysis in community and landscape ecology. Cambridge University Press, Cambridge.
Legendre, P. and Legendre, L. (1998). Numerical Ecology. Second English Edition. Elsevier Science B.V.
Loyn, R.H. (1987). Effects of patch area and habitat on bird abundances, species numbers and tree health in fragmented Victorian forests. In: Nature Conservation: The role of Remnants of Native Vegetation (Saunders, D.A., Arnold, G.W., Burbidge, A.A. and Hopkins A.J.M. eds.). pp. 65-77. Surrey Beatty and Sons, Chipping Norton, NSW.
Quinn, G.P. and Keough, M.J. (2002). Experimental design and data analysis for biologists. Cambridge University Press.