Skip to contents

Compute pairwise similarity metrics

The function similarity computes well-known and customized pairwise similarity metrics based on a co-occurrence matrix, such as vegemat. In the example below, the Simpson similarity index is computed between each pair of sites.

sim <- similarity(vegemat, 
                  metric = "Simpson", 
                  formula = NULL, 
                  method = "prodmat")
sim[1:10,]
## Data.frame of similarity between sites
##  - Total number of sites:  715 
##  - Total number of species:  3697 
##  - Number of rows:  255255 
##  - Number of similarity metrics:  1 
## 
## 
##    Site1 Site2   Simpson
## 2     35    36 0.9767442
## 3     35    37 0.9689922
## 4     35    38 0.9457364
## 5     35    39 0.9457364
## 6     35    84 0.2790698
## 7     35    85 0.9147287
## 8     35    86 1.0000000
## 9     35    87 0.9922481
## 10    35    88 0.9844961
## 11    35    89 0.6821705

The resulting data.frame is stored in a bioregion.pairwise.metric object, which contains the Simpson similarity metric between each pair of sites. The function similarity can handle three types of metrics: metrics based on abc, metrics based on ABC, and one metric based on the Euclidean distance.

The first kind of metrics such as Jaccard, the turnover component of Jaccard (Baselga, 2012), Simpson or Sørensen are based on presence data with \(a\) the number of species shared by a pair of sites, \(b\) species only present in the first site and \(c\) species only present in the second site.

\[\displaystyle Jaccard = 1 - \frac{b + c}{a + b + c}\] \[\displaystyle Jaccardturn = 1 - \frac{2 \cdot min(b, c)}{a + 2 \cdot min(b, c)}\] \[\displaystyle Sorensen = 1 - \frac{b + c}{2 \cdot a + b + c}\] \[\displaystyle Simpson = 1 - \frac{min(b, c)}{a + min(b, c)}\] Two methods can be used to compute the abc based metrics. The first method is based on a matrix product (performed with the tcrossprod function from the R package Matrix). The method is fast but is greedy in memory… The second method is based on a three loops function coded in C++ and largely inspired by the bcdist function from the R package ecodist (version 2.0.7). It is less efficient than the matrix product but allows to handle co-occurrence matrix with a large number of sites and/or species.

The second kind of metrics such as Bray-Curtis and the turnover component of Bray-Curtis (Baselga, 2012) are based on abundance data with \(A\) the sum of the lesser values for common species shared by a pair of sites. \(B\) and \(C\) are the total number of specimens counted at both sites minus \(A\). Only three loops function is available for the ABC based metrics.

\[\displaystyle Bray = 1 - \frac{B + C}{2 \cdot A + B + C}\] \[\displaystyle Brayturn = 1 - \frac{min(B, C)}{A + min(B, C)}\]

The main advantage of the similarity function is to compute and return several metrics, to allow the computation of customized metric with the formula argument and to include the possibility of returning the quantities \(a\), \(b\) and \(c\) and/or \(A\), \(B\) and \(C\). This feature is particularly interesting to compute similarity metrics on large co-occurrence matrix.

sim <- similarity(vegemat, 
                  metric = c("abc","ABC","Simpson","Bray"), 
                  formula = c("1 - pmin(b,c) / (a + pmin(b,c))", "1 - (B + C) / (2*A + B + C)"))
sim[1:10,]
## Data.frame of similarity between sites
##  - Total number of sites:  715 
##  - Total number of species:  3697 
##  - Number of rows:  255255 
##  - Number of similarity metrics:  4 
## 
## 
##    Site1 Site2   Simpson       Bray   a  b   c   A   B     C
## 2     35    36 0.9767442 0.01901485 126  3 741 420   3 43333
## 3     35    37 0.9689922 0.03745203 125  4 534 366  57 18756
## 4     35    38 0.9457364 0.04025289 122  7 440 347  76 16471
## 5     35    39 0.9457364 0.09754761 122  7 501 356  67  6520
## 6     35    84 0.2790698 0.18757921  36 93 177  74 349   292
## 7     35    85 0.9147287 0.13256181 118 11 614 378  45  4902
## 8     35    86 1.0000000 0.02663928 129  0 753 415   8 30319
## 9     35    87 0.9922481 0.02332663 128  1 909 406  17 33981
## 10    35    88 0.9844961 0.02198536 127  2 812 395  28 35115
## 11    35    89 0.6821705 0.15954416  88 41 177 196 227  1838
##    1 - pmin(b,c) / (a + pmin(b,c)) 1 - (B + C) / (2*A + B + C)
## 2                        0.9767442                  0.01901485
## 3                        0.9689922                  0.03745203
## 4                        0.9457364                  0.04025289
## 5                        0.9457364                  0.09754761
## 6                        0.2790698                  0.18757921
## 7                        0.9147287                  0.13256181
## 8                        1.0000000                  0.02663928
## 9                        0.9922481                  0.02332663
## 10                       0.9844961                  0.02198536
## 11                       0.6821705                  0.15954416

It is also possible to compute Euclidean similarity between each pair of sites following this equation:

\[\displaystyle Euclidean = \frac{1}{1 + d_{ij}}\] where \(d_{ij}\) is the Euclidean distance between site \(i\) and site \(j\) in terms of species composition.

Compute pairwise dissimilarity metrics

The dissimilarity function is very similar, with the sole exception that it computes the dissimilarity version of the available metrics. It corresponds to 1 minus the similarity metrics, except for the Euclidean dissimilarity, which corresponds to the Euclidean distance (i.e. \(d_{ij}\)).

dissim <- dissimilarity(vegemat, 
                        metric = "Sorensen", 
                        formula = "(b + c) / (2*a + b + c)")
dissim[1:10,]
## Data.frame of dissimilarity between sites
##  - Total number of sites:  715 
##  - Total number of species:  3697 
##  - Number of rows:  255255 
##  - Number of dissimilarity metrics:  2 
## 
## 
##    Site1 Site2  Sorensen (b + c) / (2*a + b + c)
## 2     35    36 0.7469880               0.7469880
## 3     35    37 0.6827411               0.6827411
## 4     35    38 0.6468886               0.6468886
## 5     35    39 0.6755319               0.6755319
## 6     35    84 0.7894737               0.7894737
## 7     35    85 0.7259001               0.7259001
## 8     35    86 0.7448071               0.7448071
## 9     35    87 0.7804460               0.7804460
## 10    35    88 0.7621723               0.7621723
## 11    35    89 0.5532995               0.5532995

From similarity to dissimilarity (and back again)

The functions similarity_to_dissimilarity and dissimilarity_to_similarity and can be used to switch easily switch between similarity and dissimilarity metrics.

sim1 <- similarity(vegemat, 
                  metric = c("abc","Sorensen"), 
                  formula = "1 - (b + c) / (2*a + b + c)")
sim1[1:10,]
## Data.frame of similarity between sites
##  - Total number of sites:  715 
##  - Total number of species:  3697 
##  - Number of rows:  255255 
##  - Number of similarity metrics:  2 
## 
## 
##    Site1 Site2  Sorensen   a  b   c 1 - (b + c) / (2*a + b + c)
## 2     35    36 0.2530120 126  3 741                   0.2530120
## 3     35    37 0.3172589 125  4 534                   0.3172589
## 4     35    38 0.3531114 122  7 440                   0.3531114
## 5     35    39 0.3244681 122  7 501                   0.3244681
## 6     35    84 0.2105263  36 93 177                   0.2105263
## 7     35    85 0.2740999 118 11 614                   0.2740999
## 8     35    86 0.2551929 129  0 753                   0.2551929
## 9     35    87 0.2195540 128  1 909                   0.2195540
## 10    35    88 0.2378277 127  2 812                   0.2378277
## 11    35    89 0.4467005  88 41 177                   0.4467005
dissim1 <- dissimilarity(vegemat, 
                         metric = c("abc","Sorensen"), 
                         formula = "(b + c) / (2*a + b + c)")
dissim1[1:10,]
## Data.frame of dissimilarity between sites
##  - Total number of sites:  715 
##  - Total number of species:  3697 
##  - Number of rows:  255255 
##  - Number of dissimilarity metrics:  2 
## 
## 
##    Site1 Site2  Sorensen   a  b   c (b + c) / (2*a + b + c)
## 2     35    36 0.7469880 126  3 741               0.7469880
## 3     35    37 0.6827411 125  4 534               0.6827411
## 4     35    38 0.6468886 122  7 440               0.6468886
## 5     35    39 0.6755319 122  7 501               0.6755319
## 6     35    84 0.7894737  36 93 177               0.7894737
## 7     35    85 0.7259001 118 11 614               0.7259001
## 8     35    86 0.7448071 129  0 753               0.7448071
## 9     35    87 0.7804460 128  1 909               0.7804460
## 10    35    88 0.7621723 127  2 812               0.7621723
## 11    35    89 0.5532995  88 41 177               0.5532995
dissim2 <- similarity_to_dissimilarity(sim1)
dissim2[1:10,]
## Data.frame of dissimilarity between sites
##  - Total number of sites:  715 
##  - Total number of species:  3697 
##  - Number of rows:  255255 
##  - Number of dissimilarity metrics:  2 
## 
## 
##    Site1 Site2  Sorensen   a  b   c 1 - (b + c) / (2*a + b + c)
## 2     35    36 0.7469880 126  3 741                   0.7469880
## 3     35    37 0.6827411 125  4 534                   0.6827411
## 4     35    38 0.6468886 122  7 440                   0.6468886
## 5     35    39 0.6755319 122  7 501                   0.6755319
## 6     35    84 0.7894737  36 93 177                   0.7894737
## 7     35    85 0.7259001 118 11 614                   0.7259001
## 8     35    86 0.7448071 129  0 753                   0.7448071
## 9     35    87 0.7804460 128  1 909                   0.7804460
## 10    35    88 0.7621723 127  2 812                   0.7621723
## 11    35    89 0.5532995  88 41 177                   0.5532995
sim2 <- dissimilarity_to_similarity(dissim1)
sim2[1:10,]
## Data.frame of similarity between sites
##  - Total number of sites:  715 
##  - Total number of species:  3697 
##  - Number of rows:  255255 
##  - Number of similarity metrics:  2 
## 
## 
##    Site1 Site2  Sorensen   a  b   c (b + c) / (2*a + b + c)
## 2     35    36 0.2530120 126  3 741               0.2530120
## 3     35    37 0.3172589 125  4 534               0.3172589
## 4     35    38 0.3531114 122  7 440               0.3531114
## 5     35    39 0.3244681 122  7 501               0.3244681
## 6     35    84 0.2105263  36 93 177               0.2105263
## 7     35    85 0.2740999 118 11 614               0.2740999
## 8     35    86 0.2551929 129  0 753               0.2551929
## 9     35    87 0.2195540 128  1 909               0.2195540
## 10    35    88 0.2378277 127  2 812               0.2378277
## 11    35    89 0.4467005  88 41 177               0.4467005

Comparison with other R packages

bioregion is not the only R package that allows the computation of similarity, dissimilarity, distance, and \(\beta\)-diversity metrics based on a (site-species) co-occurrence matrix. In this section, we focus on several functions provided by the packages adespatial, betapart, ecodist, and vegan.

The table below displays the main differences between these packages in terms of dissimilarity metric computation. The bioregion package is the only one that allows the computation of several metrics returned as a data.frame in a network format. It also supports the use of custom formulas based on the a, b, c, A, B and C components. This flexibility is also available in vegan via the designdist function.

bioregion also allows the computation of the a, b, c, A, B and C components. Users can choose to extract none of the components, only the three abc components, only the three ABC components, or all six components. These components (abc or ABC separately) can also be accessed in adespatial using the beta.div.comp function, and in betapart using betapart.core and betapart.core.abund. Finally, the vegan package, via the designdist function, also allows the extraction of the a, b, c, A, B or C components separately.

adespatial (0.3-28) betapart (1.6) bioregion (1.2.0) ecodist (2.1.3) vegan (2.6-10)
Format matrix matrix data.frame (network) matrix matrix
Custom formulas no no yes no yes
abc/ABC yes (abc or ABC) yes (abc or ABC) yes (abc and ABC) no yes (a, b, c, A, B or C)
Several metrics no no yes no no

Basic comparison (Jaccard and Bray-Curtis)

We first present a basic comparison of the computation of two well-known metrics, Jaccard and Bray-Curtis, based on a small co-occurrence matrix.

nbsite <- 100
nbsp <- 200

comat <- matrix(runif(nbsite*nbsp), nbsite, nbsp)
rownames(comat) <- paste0("s", 1:nbsite)
colnames(comat) <- paste0("sp", 1:nbsp)

comatbin <- comat
comatbin[comat > 0.7] <- 1
comatbin[comat <= 0.7] <- 0

Jaccard

We propose below a comparison of computation time for generating a Jaccard dissimilarity matrix based on the comatbin defined above. It should be noted that the output of beta.div and dissimilarityrequires an additional step to obtain a dissimilarity matrix. Specifically, beta.div returns the root mean square of the desired quantity, and as shown above, dissimilarity outputs the metric in a network (data.frame) format, as it supports the computation of multiple metrics.

The beta.pair function is used for the betapart package. Two functions, distance and bcdist, are used from the ecodist package. We also use the two available functions, vegdist and designdist, from the vegan package.

dissim <- list()

comp_j <- microbenchmark(
  adespatial = { d <- beta.div(comatbin, 
                               method = "jaccard",
                               samp = FALSE,
                               nperm = 1,
                               save.D = TRUE) 
                 d <- as.matrix(d$D)
                 d <- d*d
                 dissim$adespatial <- d},
  betapart = { d <- beta.pair(comatbin, 
                              index.family = "jaccard")
               d <- as.matrix(d$beta.jac)
               dissim$betapart <- d},
  bioregion = { d <- dissimilarity(comatbin, 
                                   metric = "Jaccard")
                d <- net_to_mat(d,   
                                weight = TRUE,
                                squared = TRUE,
                                symmetrical = TRUE)
                     dissim$bioregion <- d },
  ecodist_dist = { d <- distance(comatbin, 
                                 method = "jaccard")
                   d <- as.matrix(d)
                   dissim$ecodist_dist <- d },
  ecodist_bcdist = { d <- bcdist(comatbin)
                     d <- as.matrix(d)
                     dissim$ecodist_bcdist <- d },
  vegan_veg = { d <- vegdist(comatbin,
                             method = "jaccard")
                d <- as.matrix(d)
                dissim$vegan_veg <- d },
  vegan_design = { d <- designdist(comatbin,
                                   method = "(A+B-2*J)/(A+B-J)",
                                   terms = "binary")
                   d <- as.matrix(d)
                   dissim$vegan_design <- d },
                      
  times = 10
)

After ensuring that all the functions return the same dissimilarity matrix,

all_identical <- all(
  identical(trunc(dissim$adespatial, digits=4), 
            trunc(dissim$betapart, digits=4)),
  identical(trunc(dissim$adespatial, digits=4), 
            trunc(dissim$bioregion, digits=4)),
  identical(trunc(dissim$adespatial, digits=4), 
            trunc(dissim$ecodist_dist , digits=4)),
  identical(trunc(dissim$adespatial, digits=4), 
            trunc(dissim$ecodist_bcdist , digits=4)),
  identical(trunc(dissim$adespatial, digits=4), 
            trunc(dissim$vegan_veg, digits=4)),
  identical(trunc(dissim$adespatial, digits=4), 
            trunc(dissim$vegan_design, digits=4))
)
print(all_identical)
## [1] TRUE

we proceed with a comparison of their computation times.

comp_j
## Unit: milliseconds
##            expr        min         lq       mean     median         uq
##      adespatial 266.907806 273.464811 276.157895 275.234227 278.524567
##        betapart   1.719514   1.819800   1.895796   1.850317   1.945454
##       bioregion  14.264644  14.600399  15.489273  15.442287  16.277023
##    ecodist_dist 149.756243 153.309953 184.150184 157.443764 166.562952
##  ecodist_bcdist   3.334351   3.591531   3.578092   3.620690   3.639270
##       vegan_veg   1.275797   1.314749   2.018940   2.271892   2.296659
##    vegan_design   1.271639   1.356777   1.443011   1.403630   1.421038
##         max neval
##  285.461520    10
##    2.219655    10
##   16.927915    10
##  417.257313    10
##    3.753963    10
##    2.646380    10
##    2.043176    10

On this very small example, the functions from vegan outperform the others in terms of computation time. They are followed by bcdist and beta.pair. Our dissimilarity function performs slightly slower, partly due to the additional reformatting step (from matrix to network and back). The distance and beta.div functions rank significantly lower in terms of speed.

Bray-Curtis

We present below a similar example using the Bray-Curtis dissimilarity metric, based on the comat defined above.

dissim <- list()

comp_bc <- microbenchmark(
  adespatial = { d <- beta.div(comat, 
                               method = "percentdiff",
                               samp = FALSE,
                               nperm = 1,
                               save.D = TRUE) 
                 d <- as.matrix(d$D)
                 d <- d*d
                 dissim$adespatial <- d},
  betapart = { d <- beta.pair.abund(comat, 
                                    index.family = "bray")
               d <- as.matrix(d$beta.bray)
               dissim$betapart <- d},
  bioregion = { d <- dissimilarity(comatbin, 
                                    metric = "Bray")
                 d <- net_to_mat(d,   
                                 weight = TRUE,
                                 squared = TRUE,
                                 symmetrical = TRUE)
                 dissim$bioregion <- d },
  ecodist_dist = { d <- distance(comat, 
                                 method = "bray-curtis")
                   d <- as.matrix(d)
                   dissim$ecodist_dist <- d },
  ecodist_bcdist = { d <- bcdist(comat)
                     d <- as.matrix(d)
                     dissim$ecodist_bcdist <- d },
  vegan_veg = { d <- vegdist(comatbin,
                             method = "bray")
                d <- as.matrix(d)
                dissim$vegan_veg <- d },
  vegan_design = { d <- designdist(comat,
                                   method = "(A+B-2*J)/(A+B)",
                                   terms = "minimum")
                   d <- as.matrix(d)
                   dissim$vegan_design <- d },
                      
  times = 10
)

Here again after ensuring that all the functions return the same dissimilarity matrix,

all_identical <- all(
  identical(trunc(dissim$adespatial, digits=4), 
            trunc(dissim$betapart, digits=4)),
  identical(trunc(dissim$adespatial, digits=4), 
            trunc(dissim$bioregion, digits=4)),
  identical(trunc(dissim$adespatial, digits=4), 
            trunc(dissim$ecodist_dist , digits=4)),
  identical(trunc(dissim$adespatial, digits=4), 
            trunc(dissim$ecodist_bcdist , digits=4)),
  identical(trunc(dissim$adespatial, digits=4), 
            trunc(dissim$vegan_veg, digits=4)),
  identical(trunc(dissim$adespatial, digits=4), 
            trunc(dissim$vegan_design, digits=4))
)
print(all_identical)
## [1] TRUE

we proceed with a comparison of their computation times.

comp_bc
## Unit: milliseconds
##            expr        min         lq       mean     median         uq
##      adespatial 274.685615 277.470533 286.220730 284.254156 292.046957
##        betapart 303.083768 318.810145 329.650220 324.713853 348.828076
##       bioregion   9.987406  10.436712  11.505953  10.736557  13.388361
##    ecodist_dist  80.630066  89.549193 106.905346 104.900510 128.176130
##  ecodist_bcdist   5.930017   5.988696   6.084050   6.037327   6.207645
##       vegan_veg   1.253766   1.268053   1.776226   1.771245   2.279186
##    vegan_design   1.304370   1.366526   1.483375   1.455547   1.594911
##         max neval
##  307.025081    10
##  353.413589    10
##   14.040496    10
##  129.245964    10
##    6.245815    10
##    2.320954    10
##    1.681292    10

The functions from vegan continue to outperform the others. They are again followed by bcdist, while our dissimilarity function remains slightly slower. The distance and beta.div functions, joined by beta.pair.abund, rank significantly lower in terms of computation speed.

Systematic comparison (Jaccard and Bray-Curtis)

These differences in computation times become noticeable only for (site-species) co-occurrence matrix objects of significant size. To better understand the computation times for such matrices, we perform a similar experiment using a subset of functions (beta.pair and beta.pair.abund, dissimilarity, bcdist, vegdist, and designdist) on datasets with varying numbers of sites and species.

The plot below shows the results obtained for the Jaccard dissimilarity metric based on 16 combinations of numbers of sites and species (chosen among 500, 1000, 2000, and 5000), averaged over three replications. We also include the results from a single simulation with 10,000 sites and species.

A similar plot showing the results obtained for the Bray-Curtis dissimilarity metric is available below. In this case, the results based on a single simulation with 10,000 sites and species exclude beta.pair.abund because it took too much time to compute.


Reference

Baselga, A. (2012). The relationship between species replacement, dissimilarity derived from nestedness, and nestedness. Global Ecology and Biogeography, 21(12), 1223–1232.