Skip to contents

The function similarity compute well-known and customized pairwise similarity metrics based on a co-occurrence matrix such as vegemat. In the example below the Simpson similarity index is computed between each pair of sites.

sim <- similarity(vegemat, metric = "Simpson", formula = NULL, method = "prodmat")
sim[1:10,]
## Data.frame of similarity between sites
##  - Total number of sites:  715 
##  - Total number of species:  3697 
##  - Number of rows:  255255 
##  - Number of similarity metrics:  1 
## 
## 
##    Site1 Site2   Simpson
## 2     35    36 0.9767442
## 3     35    37 0.9689922
## 4     35    38 0.9457364
## 5     35    39 0.9457364
## 6     35    84 0.2790698
## 7     35    85 0.9147287
## 8     35    86 1.0000000
## 9     35    87 0.9922481
## 10    35    88 0.9844961
## 11    35    89 0.6821705

The resulting data.frame is stored in a bioregion.pairwise.metric object containing Simpson similarity metric between each pair of sites. The function similarity can handle three types of metrics: the metrics based on abc, the metrics based on ABC and one metric based on the Euclidean distance.

The first kind of metrics such as Jaccard, the turnover component of Jaccard, Simpson or Sorensen are based on presence data with a the number of species shared by a pair of sites, b species only present in the first site and c species only present in the second site. Two methods can be used to compute the abc based metrics. The first method is based on a matrix product (performed with the tcrossprod function from the R package Matrix). The method is fast but is greedy in memory… The second method is based on a three loops function coded in C++ and largely inspired by the bcdist function from the R package ecodist (version 2.0.7). It is less efficient than the matrix product but allows to handle co-occurrence matrix with a large number of sites and/or species.

The second kind of metrics such as Bray-Curtis and the turnover component of Bray-Curtis are based on abundance data with A the sum of the lesser values for common species shared by a pair of sites. B and C are the total number of specimens counted at both sites minus A. Only three loops function is available for the ABC based metrics.

The main advantage of the similarity function is to compute and return several metrics, to allow the computation of customized metric with the formula argument and to include the possibility of returning a, b and c and/or A, B and C. This feature is particularly interesting to compute similarity metrics on large co-occurrence matrix.

sim <- similarity(vegemat, metric = c("abc","ABC","Simpson","Bray"), formula =c("(b + c) / (a + b + c)", "(B + C) / (2*A + B + C)"))
sim[1:10,]
## Data.frame of similarity between sites
##  - Total number of sites:  715 
##  - Total number of species:  3697 
##  - Number of rows:  255255 
##  - Number of similarity metrics:  4 
## 
## 
##    Site1 Site2   Simpson       Bray   a  b   c   A   B     C
## 2     35    36 0.9767442 0.01901485 126  3 741 420   3 43333
## 3     35    37 0.9689922 0.03745203 125  4 534 366  57 18756
## 4     35    38 0.9457364 0.04025289 122  7 440 347  76 16471
## 5     35    39 0.9457364 0.09754761 122  7 501 356  67  6520
## 6     35    84 0.2790698 0.18757921  36 93 177  74 349   292
## 7     35    85 0.9147287 0.13256181 118 11 614 378  45  4902
## 8     35    86 1.0000000 0.02663928 129  0 753 415   8 30319
## 9     35    87 0.9922481 0.02332663 128  1 909 406  17 33981
## 10    35    88 0.9844961 0.02198536 127  2 812 395  28 35115
## 11    35    89 0.6821705 0.15954416  88 41 177 196 227  1838
##    (b + c) / (a + b + c) (B + C) / (2*A + B + C)
## 2              0.8551724               0.9809852
## 3              0.8114630               0.9625480
## 4              0.7855888               0.9597471
## 5              0.8063492               0.9024524
## 6              0.8823529               0.8124208
## 7              0.8411844               0.8674382
## 8              0.8537415               0.9733607
## 9              0.8766859               0.9766734
## 10             0.8650372               0.9780146
## 11             0.7124183               0.8404558

The dissimilarity function is very similar, with the sole exception that it computes the dissimilarity version of the available metrics. The functions dissimilarity_to_similarity and similarity_to_dissimilarity can be used to switch between similarity and dissimilarity metrics.

sim <- similarity(vegemat, metric = c("abc","Simpson"), formula = "(b + c) / (a + b + c)")
sim[1:10,]
## Data.frame of similarity between sites
##  - Total number of sites:  715 
##  - Total number of species:  3697 
##  - Number of rows:  255255 
##  - Number of similarity metrics:  2 
## 
## 
##    Site1 Site2   Simpson   a  b   c (b + c) / (a + b + c)
## 2     35    36 0.9767442 126  3 741             0.8551724
## 3     35    37 0.9689922 125  4 534             0.8114630
## 4     35    38 0.9457364 122  7 440             0.7855888
## 5     35    39 0.9457364 122  7 501             0.8063492
## 6     35    84 0.2790698  36 93 177             0.8823529
## 7     35    85 0.9147287 118 11 614             0.8411844
## 8     35    86 1.0000000 129  0 753             0.8537415
## 9     35    87 0.9922481 128  1 909             0.8766859
## 10    35    88 0.9844961 127  2 812             0.8650372
## 11    35    89 0.6821705  88 41 177             0.7124183
dissim1 <- dissimilarity(vegemat, metric = c("abc","Simpson"), formula = "(b + c) / (a + b + c)")
dissim1[1:10,]
## Data.frame of dissimilarity between sites
##  - Total number of sites:  715 
##  - Total number of species:  3697 
##  - Number of rows:  255255 
##  - Number of dissimilarity metrics:  2 
## 
## 
##    Site1 Site2     Simpson   a  b   c (b + c) / (a + b + c)
## 2     35    36 0.023255814 126  3 741             0.8551724
## 3     35    37 0.031007752 125  4 534             0.8114630
## 4     35    38 0.054263566 122  7 440             0.7855888
## 5     35    39 0.054263566 122  7 501             0.8063492
## 6     35    84 0.720930233  36 93 177             0.8823529
## 7     35    85 0.085271318 118 11 614             0.8411844
## 8     35    86 0.000000000 129  0 753             0.8537415
## 9     35    87 0.007751938 128  1 909             0.8766859
## 10    35    88 0.015503876 127  2 812             0.8650372
## 11    35    89 0.317829457  88 41 177             0.7124183
dissim2 <- similarity_to_dissimilarity(sim)
dissim2[1:10,]
## Data.frame of dissimilarity between sites
##  - Total number of sites:  715 
##  - Total number of species:  3697 
##  - Number of rows:  255255 
##  - Number of dissimilarity metrics:  2 
## 
## 
##    Site1 Site2     Simpson   a  b   c (b + c) / (a + b + c)
## 2     35    36 0.023255814 126  3 741             0.1448276
## 3     35    37 0.031007752 125  4 534             0.1885370
## 4     35    38 0.054263566 122  7 440             0.2144112
## 5     35    39 0.054263566 122  7 501             0.1936508
## 6     35    84 0.720930233  36 93 177             0.1176471
## 7     35    85 0.085271318 118 11 614             0.1588156
## 8     35    86 0.000000000 129  0 753             0.1462585
## 9     35    87 0.007751938 128  1 909             0.1233141
## 10    35    88 0.015503876 127  2 812             0.1349628
## 11    35    89 0.317829457  88 41 177             0.2875817