3. Pairwise similarity/dissimilarity metrics
Maxime Lenormand, Boris Leroy and Pierre Denelle
2025-05-22
Source:vignettes/a3_pairwise_metrics.Rmd
a3_pairwise_metrics.Rmd
Compute pairwise similarity metrics
The function similarity
computes well-known and customized pairwise similarity metrics based on
a co-occurrence matrix
, such as vegemat.
In the example below, the Simpson
similarity index is
computed between each pair of sites.
sim <- similarity(vegemat,
metric = "Simpson",
formula = NULL,
method = "prodmat")
sim[1:10,]
## Data.frame of similarity between sites
## - Total number of sites: 715
## - Total number of species: 3697
## - Number of rows: 255255
## - Number of similarity metrics: 1
##
##
## Site1 Site2 Simpson
## 2 35 36 0.9767442
## 3 35 37 0.9689922
## 4 35 38 0.9457364
## 5 35 39 0.9457364
## 6 35 84 0.2790698
## 7 35 85 0.9147287
## 8 35 86 1.0000000
## 9 35 87 0.9922481
## 10 35 88 0.9844961
## 11 35 89 0.6821705
The resulting data.frame
is stored in a
bioregion.pairwise.metric
object, which contains the
Simpson similarity metric between each pair of sites. The function similarity
can handle three types of metrics: metrics based on abc
,
metrics based on ABC
, and one metric based on the Euclidean
distance.
The first kind of metrics such as Jaccard, the turnover component of Jaccard (Baselga, 2012), Simpson or Sørensen are based on presence data with \(a\) the number of species shared by a pair of sites, \(b\) species only present in the first site and \(c\) species only present in the second site.
\[\displaystyle Jaccard = 1 - \frac{b +
c}{a + b + c}\] \[\displaystyle
Jaccardturn = 1 - \frac{2 \cdot min(b, c)}{a + 2 \cdot min(b,
c)}\] \[\displaystyle Sorensen = 1 -
\frac{b + c}{2 \cdot a + b + c}\] \[\displaystyle Simpson = 1 - \frac{min(b, c)}{a +
min(b, c)}\] Two methods can be used to compute the
abc
based metrics. The first method is based on a matrix
product (performed with the tcrossprod
function from the R package Matrix).
The method is fast but is greedy in memory… The second method is based
on a three
loops function coded in C++ and largely inspired by the bcdist
function from the R package ecodist
(version 2.0.7). It is less efficient than the matrix product but allows
to handle co-occurrence matrix with a large number of sites and/or
species.
The second kind of metrics such as Bray-Curtis and the turnover
component of Bray-Curtis (Baselga, 2012)
are based on abundance data with \(A\)
the sum of the lesser values for common species shared by a pair of
sites. \(B\) and \(C\) are the total number of specimens
counted at both sites minus \(A\). Only
three loops function is available for the ABC
based
metrics.
\[\displaystyle Bray = 1 - \frac{B + C}{2 \cdot A + B + C}\] \[\displaystyle Brayturn = 1 - \frac{min(B, C)}{A + min(B, C)}\]
The main advantage of the similarity
function is to compute and return several metrics, to allow the
computation of customized metric with the formula
argument
and to include the possibility of returning the quantities \(a\), \(b\)
and \(c\) and/or \(A\), \(B\)
and \(C\). This feature is particularly
interesting to compute similarity metrics on large co-occurrence
matrix.
sim <- similarity(vegemat,
metric = c("abc","ABC","Simpson","Bray"),
formula = c("1 - pmin(b,c) / (a + pmin(b,c))", "1 - (B + C) / (2*A + B + C)"))
sim[1:10,]
## Data.frame of similarity between sites
## - Total number of sites: 715
## - Total number of species: 3697
## - Number of rows: 255255
## - Number of similarity metrics: 4
##
##
## Site1 Site2 Simpson Bray a b c A B C
## 2 35 36 0.9767442 0.01901485 126 3 741 420 3 43333
## 3 35 37 0.9689922 0.03745203 125 4 534 366 57 18756
## 4 35 38 0.9457364 0.04025289 122 7 440 347 76 16471
## 5 35 39 0.9457364 0.09754761 122 7 501 356 67 6520
## 6 35 84 0.2790698 0.18757921 36 93 177 74 349 292
## 7 35 85 0.9147287 0.13256181 118 11 614 378 45 4902
## 8 35 86 1.0000000 0.02663928 129 0 753 415 8 30319
## 9 35 87 0.9922481 0.02332663 128 1 909 406 17 33981
## 10 35 88 0.9844961 0.02198536 127 2 812 395 28 35115
## 11 35 89 0.6821705 0.15954416 88 41 177 196 227 1838
## 1 - pmin(b,c) / (a + pmin(b,c)) 1 - (B + C) / (2*A + B + C)
## 2 0.9767442 0.01901485
## 3 0.9689922 0.03745203
## 4 0.9457364 0.04025289
## 5 0.9457364 0.09754761
## 6 0.2790698 0.18757921
## 7 0.9147287 0.13256181
## 8 1.0000000 0.02663928
## 9 0.9922481 0.02332663
## 10 0.9844961 0.02198536
## 11 0.6821705 0.15954416
It is also possible to compute Euclidean similarity between each pair of sites following this equation:
\[\displaystyle Euclidean = \frac{1}{1 + d_{ij}}\] where \(d_{ij}\) is the Euclidean distance between site \(i\) and site \(j\) in terms of species composition.
Compute pairwise dissimilarity metrics
The dissimilarity function is very similar, with the sole exception that it computes the dissimilarity version of the available metrics. It corresponds to 1 minus the similarity metrics, except for the Euclidean dissimilarity, which corresponds to the Euclidean distance (i.e. \(d_{ij}\)).
dissim <- dissimilarity(vegemat,
metric = "Sorensen",
formula = "(b + c) / (2*a + b + c)")
dissim[1:10,]
## Data.frame of dissimilarity between sites
## - Total number of sites: 715
## - Total number of species: 3697
## - Number of rows: 255255
## - Number of dissimilarity metrics: 2
##
##
## Site1 Site2 Sorensen (b + c) / (2*a + b + c)
## 2 35 36 0.7469880 0.7469880
## 3 35 37 0.6827411 0.6827411
## 4 35 38 0.6468886 0.6468886
## 5 35 39 0.6755319 0.6755319
## 6 35 84 0.7894737 0.7894737
## 7 35 85 0.7259001 0.7259001
## 8 35 86 0.7448071 0.7448071
## 9 35 87 0.7804460 0.7804460
## 10 35 88 0.7621723 0.7621723
## 11 35 89 0.5532995 0.5532995
From similarity to dissimilarity (and back again)
The functions similarity_to_dissimilarity and dissimilarity_to_similarity and can be used to switch easily switch between similarity and dissimilarity metrics.
sim1 <- similarity(vegemat,
metric = c("abc","Sorensen"),
formula = "1 - (b + c) / (2*a + b + c)")
sim1[1:10,]
## Data.frame of similarity between sites
## - Total number of sites: 715
## - Total number of species: 3697
## - Number of rows: 255255
## - Number of similarity metrics: 2
##
##
## Site1 Site2 Sorensen a b c 1 - (b + c) / (2*a + b + c)
## 2 35 36 0.2530120 126 3 741 0.2530120
## 3 35 37 0.3172589 125 4 534 0.3172589
## 4 35 38 0.3531114 122 7 440 0.3531114
## 5 35 39 0.3244681 122 7 501 0.3244681
## 6 35 84 0.2105263 36 93 177 0.2105263
## 7 35 85 0.2740999 118 11 614 0.2740999
## 8 35 86 0.2551929 129 0 753 0.2551929
## 9 35 87 0.2195540 128 1 909 0.2195540
## 10 35 88 0.2378277 127 2 812 0.2378277
## 11 35 89 0.4467005 88 41 177 0.4467005
dissim1 <- dissimilarity(vegemat,
metric = c("abc","Sorensen"),
formula = "(b + c) / (2*a + b + c)")
dissim1[1:10,]
## Data.frame of dissimilarity between sites
## - Total number of sites: 715
## - Total number of species: 3697
## - Number of rows: 255255
## - Number of dissimilarity metrics: 2
##
##
## Site1 Site2 Sorensen a b c (b + c) / (2*a + b + c)
## 2 35 36 0.7469880 126 3 741 0.7469880
## 3 35 37 0.6827411 125 4 534 0.6827411
## 4 35 38 0.6468886 122 7 440 0.6468886
## 5 35 39 0.6755319 122 7 501 0.6755319
## 6 35 84 0.7894737 36 93 177 0.7894737
## 7 35 85 0.7259001 118 11 614 0.7259001
## 8 35 86 0.7448071 129 0 753 0.7448071
## 9 35 87 0.7804460 128 1 909 0.7804460
## 10 35 88 0.7621723 127 2 812 0.7621723
## 11 35 89 0.5532995 88 41 177 0.5532995
dissim2 <- similarity_to_dissimilarity(sim1)
dissim2[1:10,]
## Data.frame of dissimilarity between sites
## - Total number of sites: 715
## - Total number of species: 3697
## - Number of rows: 255255
## - Number of dissimilarity metrics: 2
##
##
## Site1 Site2 Sorensen a b c 1 - (b + c) / (2*a + b + c)
## 2 35 36 0.7469880 126 3 741 0.7469880
## 3 35 37 0.6827411 125 4 534 0.6827411
## 4 35 38 0.6468886 122 7 440 0.6468886
## 5 35 39 0.6755319 122 7 501 0.6755319
## 6 35 84 0.7894737 36 93 177 0.7894737
## 7 35 85 0.7259001 118 11 614 0.7259001
## 8 35 86 0.7448071 129 0 753 0.7448071
## 9 35 87 0.7804460 128 1 909 0.7804460
## 10 35 88 0.7621723 127 2 812 0.7621723
## 11 35 89 0.5532995 88 41 177 0.5532995
sim2 <- dissimilarity_to_similarity(dissim1)
sim2[1:10,]
## Data.frame of similarity between sites
## - Total number of sites: 715
## - Total number of species: 3697
## - Number of rows: 255255
## - Number of similarity metrics: 2
##
##
## Site1 Site2 Sorensen a b c (b + c) / (2*a + b + c)
## 2 35 36 0.2530120 126 3 741 0.2530120
## 3 35 37 0.3172589 125 4 534 0.3172589
## 4 35 38 0.3531114 122 7 440 0.3531114
## 5 35 39 0.3244681 122 7 501 0.3244681
## 6 35 84 0.2105263 36 93 177 0.2105263
## 7 35 85 0.2740999 118 11 614 0.2740999
## 8 35 86 0.2551929 129 0 753 0.2551929
## 9 35 87 0.2195540 128 1 909 0.2195540
## 10 35 88 0.2378277 127 2 812 0.2378277
## 11 35 89 0.4467005 88 41 177 0.4467005
Comparison with other R packages
bioregion
is not the only R package that allows the
computation of similarity, dissimilarity, distance, and \(\beta\)-diversity metrics based on a
(site-species) co-occurrence matrix
. In this section, we
focus on several functions provided by the packages adespatial, betapart, ecodist, and vegan.
The table below displays the main differences between these packages
in terms of dissimilarity metric computation. The bioregion
package is the only one that allows the computation of several metrics
returned as a data.frame
in a network format. It also
supports the use of custom formulas based on the a
,
b
, c
, A
, B
and
C
components. This flexibility is also available in vegan via the
designdist
function.
bioregion
also allows the computation of the
a
, b
, c
, A
,
B
and C
components. Users can choose to
extract none of the components, only the three abc
components, only the three ABC
components, or all six
components. These components (abc
or ABC
separately) can also be accessed in adespatial
using the beta.div.comp
function, and in betapart using
betapart.core
and betapart.core.abund
.
Finally, the vegan package, via
the designdist
function, also allows the extraction of the
a
, b
, c
, A
,
B
or C
components separately.
adespatial (0.3-28) | betapart (1.6) | bioregion (1.2.0) | ecodist (2.1.3) | vegan (2.6-10) | |
---|---|---|---|---|---|
Format | matrix | matrix | data.frame (network) | matrix | matrix |
Custom formulas | no | no | yes | no | yes |
abc/ABC | yes (abc or ABC) | yes (abc or ABC) | yes (abc and ABC) | no | yes (a, b, c, A, B or C) |
Several metrics | no | no | yes | no | no |
Basic comparison (Jaccard and Bray-Curtis)
We first present a basic comparison of the computation of two
well-known metrics, Jaccard and Bray-Curtis, based on a small
co-occurrence matrix
.
nbsite <- 100
nbsp <- 200
comat <- matrix(runif(nbsite*nbsp), nbsite, nbsp)
rownames(comat) <- paste0("s", 1:nbsite)
colnames(comat) <- paste0("sp", 1:nbsp)
comatbin <- comat
comatbin[comat > 0.7] <- 1
comatbin[comat <= 0.7] <- 0
Jaccard
We propose below a comparison of computation time for generating a
Jaccard dissimilarity matrix based on the comatbin
defined
above. It should be noted that the output of beta.div
and
dissimilarity
requires an additional step to obtain a
dissimilarity matrix. Specifically, beta.div
returns the
root mean square of the desired quantity, and as shown above,
dissimilarity
outputs the metric in a network
(data.frame
) format, as it supports the computation of
multiple metrics.
The beta.pair
function is used for the betapart package.
Two functions, distance
and bcdist
, are used
from the ecodist package.
We also use the two available functions, vegdist
and
designdist
, from the vegan package.
dissim <- list()
comp_j <- microbenchmark(
adespatial = { d <- beta.div(comatbin,
method = "jaccard",
samp = FALSE,
nperm = 1,
save.D = TRUE)
d <- as.matrix(d$D)
d <- d*d
dissim$adespatial <- d},
betapart = { d <- beta.pair(comatbin,
index.family = "jaccard")
d <- as.matrix(d$beta.jac)
dissim$betapart <- d},
bioregion = { d <- dissimilarity(comatbin,
metric = "Jaccard")
d <- net_to_mat(d,
weight = TRUE,
squared = TRUE,
symmetrical = TRUE)
dissim$bioregion <- d },
ecodist_dist = { d <- distance(comatbin,
method = "jaccard")
d <- as.matrix(d)
dissim$ecodist_dist <- d },
ecodist_bcdist = { d <- bcdist(comatbin)
d <- as.matrix(d)
dissim$ecodist_bcdist <- d },
vegan_veg = { d <- vegdist(comatbin,
method = "jaccard")
d <- as.matrix(d)
dissim$vegan_veg <- d },
vegan_design = { d <- designdist(comatbin,
method = "(A+B-2*J)/(A+B-J)",
terms = "binary")
d <- as.matrix(d)
dissim$vegan_design <- d },
times = 10
)
After ensuring that all the functions return the same dissimilarity matrix,
all_identical <- all(
identical(trunc(dissim$adespatial, digits=4),
trunc(dissim$betapart, digits=4)),
identical(trunc(dissim$adespatial, digits=4),
trunc(dissim$bioregion, digits=4)),
identical(trunc(dissim$adespatial, digits=4),
trunc(dissim$ecodist_dist , digits=4)),
identical(trunc(dissim$adespatial, digits=4),
trunc(dissim$ecodist_bcdist , digits=4)),
identical(trunc(dissim$adespatial, digits=4),
trunc(dissim$vegan_veg, digits=4)),
identical(trunc(dissim$adespatial, digits=4),
trunc(dissim$vegan_design, digits=4))
)
print(all_identical)
## [1] TRUE
we proceed with a comparison of their computation times.
comp_j
## Unit: milliseconds
## expr min lq mean median uq
## adespatial 266.907806 273.464811 276.157895 275.234227 278.524567
## betapart 1.719514 1.819800 1.895796 1.850317 1.945454
## bioregion 14.264644 14.600399 15.489273 15.442287 16.277023
## ecodist_dist 149.756243 153.309953 184.150184 157.443764 166.562952
## ecodist_bcdist 3.334351 3.591531 3.578092 3.620690 3.639270
## vegan_veg 1.275797 1.314749 2.018940 2.271892 2.296659
## vegan_design 1.271639 1.356777 1.443011 1.403630 1.421038
## max neval
## 285.461520 10
## 2.219655 10
## 16.927915 10
## 417.257313 10
## 3.753963 10
## 2.646380 10
## 2.043176 10
On this very small example, the functions from vegan outperform the
others in terms of computation time. They are followed by
bcdist
and beta.pair
. Our
dissimilarity
function performs slightly slower, partly due
to the additional reformatting step (from matrix to network and back).
The distance
and beta.div
functions rank
significantly lower in terms of speed.
Bray-Curtis
We present below a similar example using the Bray-Curtis
dissimilarity metric, based on the comat
defined above.
dissim <- list()
comp_bc <- microbenchmark(
adespatial = { d <- beta.div(comat,
method = "percentdiff",
samp = FALSE,
nperm = 1,
save.D = TRUE)
d <- as.matrix(d$D)
d <- d*d
dissim$adespatial <- d},
betapart = { d <- beta.pair.abund(comat,
index.family = "bray")
d <- as.matrix(d$beta.bray)
dissim$betapart <- d},
bioregion = { d <- dissimilarity(comatbin,
metric = "Bray")
d <- net_to_mat(d,
weight = TRUE,
squared = TRUE,
symmetrical = TRUE)
dissim$bioregion <- d },
ecodist_dist = { d <- distance(comat,
method = "bray-curtis")
d <- as.matrix(d)
dissim$ecodist_dist <- d },
ecodist_bcdist = { d <- bcdist(comat)
d <- as.matrix(d)
dissim$ecodist_bcdist <- d },
vegan_veg = { d <- vegdist(comatbin,
method = "bray")
d <- as.matrix(d)
dissim$vegan_veg <- d },
vegan_design = { d <- designdist(comat,
method = "(A+B-2*J)/(A+B)",
terms = "minimum")
d <- as.matrix(d)
dissim$vegan_design <- d },
times = 10
)
Here again after ensuring that all the functions return the same dissimilarity matrix,
all_identical <- all(
identical(trunc(dissim$adespatial, digits=4),
trunc(dissim$betapart, digits=4)),
identical(trunc(dissim$adespatial, digits=4),
trunc(dissim$bioregion, digits=4)),
identical(trunc(dissim$adespatial, digits=4),
trunc(dissim$ecodist_dist , digits=4)),
identical(trunc(dissim$adespatial, digits=4),
trunc(dissim$ecodist_bcdist , digits=4)),
identical(trunc(dissim$adespatial, digits=4),
trunc(dissim$vegan_veg, digits=4)),
identical(trunc(dissim$adespatial, digits=4),
trunc(dissim$vegan_design, digits=4))
)
print(all_identical)
## [1] TRUE
we proceed with a comparison of their computation times.
comp_bc
## Unit: milliseconds
## expr min lq mean median uq
## adespatial 274.685615 277.470533 286.220730 284.254156 292.046957
## betapart 303.083768 318.810145 329.650220 324.713853 348.828076
## bioregion 9.987406 10.436712 11.505953 10.736557 13.388361
## ecodist_dist 80.630066 89.549193 106.905346 104.900510 128.176130
## ecodist_bcdist 5.930017 5.988696 6.084050 6.037327 6.207645
## vegan_veg 1.253766 1.268053 1.776226 1.771245 2.279186
## vegan_design 1.304370 1.366526 1.483375 1.455547 1.594911
## max neval
## 307.025081 10
## 353.413589 10
## 14.040496 10
## 129.245964 10
## 6.245815 10
## 2.320954 10
## 1.681292 10
The functions from vegan continue to
outperform the others. They are again followed by bcdist
,
while our dissimilarity
function remains slightly slower.
The distance
and beta.div
functions, joined by
beta.pair.abund
, rank significantly lower in terms of
computation speed.
Systematic comparison (Jaccard and Bray-Curtis)
These differences in computation times become noticeable only for
(site-species) co-occurrence matrix
objects of significant
size. To better understand the computation times for such matrices, we
perform a similar experiment using a subset of functions
(beta.pair
and beta.pair.abund
,
dissimilarity
, bcdist
, vegdist
,
and designdist
) on datasets with varying numbers of sites
and species.
The plot below shows the results obtained for the Jaccard dissimilarity metric based on 16 combinations of numbers of sites and species (chosen among 500, 1000, 2000, and 5000), averaged over three replications. We also include the results from a single simulation with 10,000 sites and species.

A similar plot showing the results obtained for the Bray-Curtis
dissimilarity metric is available below. In this case, the results based
on a single simulation with 10,000 sites and species exclude
beta.pair.abund
because it took too much time to
compute.
