Skip to contents

The function similarity compute well-known and customized pairwise similarity metrics based on a co-occurence matrix such as vegemat. In the example below the Simpson similarity index is computed between each pair of sites.

sim <- similarity(vegemat, metric = "Simpson", formula = NULL, method = "prodmat")
sim[1:10,]
## Data.frame of similarity between sites
##  - Total number of sites:  11 
##  - Number of rows:  10 
##  - Number of similarity metrics:  1 
## 
## 
##      Site1 Site2   Simpson
## 716     35    36 0.9767442
## 1431    35    37 0.9689922
## 2146    35    38 0.9457364
## 2861    35    39 0.9457364
## 3576    35    84 0.2790698
## 4291    35    85 0.9147287
## 5006    35    86 1.0000000
## 5721    35    87 0.9922481
## 6436    35    88 0.9844961
## 7151    35    89 0.6821705

The resulting data.frame is stored in a bioRgeo.pairwise.metric object containing Simpson similarity metric between each pair of sites. The function similarity can handle three types of metrics: the metrics based on abc, the metrics based on ABC and one metric based on the Euclidean distance.

The first kind of metrics such as Jaccard, the turnover component of Jaccard, Simpson or Sorensen are based on presence data with a the number of species shared by a pair of sites, b species only present in the first site and c species only present in the second site. Two methods can be used to compute the abc based metrics. The first method is based on a matrix product (performed with the tcrossprod function from the R package Matrix). The method is fast but is greedy in memory… The second method is based on a three loops function coded in C++ and largely inspired by the bcdist function from the R package ecodist (version 2.0.7). It is less efficient than the matrix product but allows to handle co-occurence matrix with a large number of sites and/or species.

The second kind of metrics such as Bray-Curtis and the turnover component of Bray-Curtis are based on abundance data with A the sum of the lesser values for common species shared by a pair of sites. B and C are the total number of specimens counted at both sites minus A. Only three loops function is available for the ABC based metrics.

The main advantage of the similarity function is to compute and return several metrics, to allow the computation of customized metric with the formula argument and to include the possibility of returning a, b and c and/or A, B and C. This feature is particularly interesting to compute similarity metrics on large co-occurence matrix.

sim <- similarity(vegemat, metric = c("abc","ABC","Simpson","Bray"), formula =c("(b + c) / (a + b + c)", "(B + C) / (2*A + B + C)"))
sim[1:10,]
## Data.frame of similarity between sites
##  - Total number of sites:  11 
##  - Number of rows:  10 
##  - Number of similarity metrics:  4 
## 
## 
##      Site1 Site2   Simpson       Bray (b + c) / (a + b + c)
## 716     35    36 0.9767442 0.01901485             0.8551724
## 1431    35    37 0.9689922 0.03745203             0.8114630
## 2146    35    38 0.9457364 0.04025289             0.7855888
## 2861    35    39 0.9457364 0.09754761             0.8063492
## 3576    35    84 0.2790698 0.18757921             0.8823529
## 4291    35    85 0.9147287 0.13256181             0.8411844
## 5006    35    86 1.0000000 0.02663928             0.8537415
## 5721    35    87 0.9922481 0.02332663             0.8766859
## 6436    35    88 0.9844961 0.02198536             0.8650372
## 7151    35    89 0.6821705 0.15954416             0.7124183
##      (B + C) / (2*A + B + C)   a  b   c   A   B     C
## 716                0.9809852 126  3 741 420   3 43333
## 1431               0.9625480 125  4 534 366  57 18756
## 2146               0.9597471 122  7 440 347  76 16471
## 2861               0.9024524 122  7 501 356  67  6520
## 3576               0.8124208  36 93 177  74 349   292
## 4291               0.8674382 118 11 614 378  45  4902
## 5006               0.9733607 129  0 753 415   8 30319
## 5721               0.9766734 128  1 909 406  17 33981
## 6436               0.9780146 127  2 812 395  28 35115
## 7151               0.8404558  88 41 177 196 227  1838

The dissimilarity function is very similar, with the sole exception that it computes the dissimilarity version of the available metrics. The functions dissimilarity_to_similarity and similarity_to_dissimilarity can be used to switch between similarity and dissimilarity metrics.

sim <- similarity(vegemat, metric = c("abc","Simpson"), formula = "(b + c) / (a + b + c)")
sim[1:10,]
## Data.frame of similarity between sites
##  - Total number of sites:  11 
##  - Number of rows:  10 
##  - Number of similarity metrics:  2 
## 
## 
##      Site1 Site2   Simpson (b + c) / (a + b + c)   a  b   c
## 716     35    36 0.9767442             0.8551724 126  3 741
## 1431    35    37 0.9689922             0.8114630 125  4 534
## 2146    35    38 0.9457364             0.7855888 122  7 440
## 2861    35    39 0.9457364             0.8063492 122  7 501
## 3576    35    84 0.2790698             0.8823529  36 93 177
## 4291    35    85 0.9147287             0.8411844 118 11 614
## 5006    35    86 1.0000000             0.8537415 129  0 753
## 5721    35    87 0.9922481             0.8766859 128  1 909
## 6436    35    88 0.9844961             0.8650372 127  2 812
## 7151    35    89 0.6821705             0.7124183  88 41 177
dissim1 <- dissimilarity(vegemat, metric = c("abc","Simpson"), formula = "(b + c) / (a + b + c)")
dissim1[1:10,]
## Data.frame of dissimilarity between sites
##  - Total number of sites:  11 
##  - Number of rows:  10 
##  - Number of dissimilarity metrics:  2 
## 
## 
##      Site1 Site2     Simpson (b + c) / (a + b + c)   a  b   c
## 716     35    36 0.023255814             0.1448276 126  3 741
## 1431    35    37 0.031007752             0.1885370 125  4 534
## 2146    35    38 0.054263566             0.2144112 122  7 440
## 2861    35    39 0.054263566             0.1936508 122  7 501
## 3576    35    84 0.720930233             0.1176471  36 93 177
## 4291    35    85 0.085271318             0.1588156 118 11 614
## 5006    35    86 0.000000000             0.1462585 129  0 753
## 5721    35    87 0.007751938             0.1233141 128  1 909
## 6436    35    88 0.015503876             0.1349628 127  2 812
## 7151    35    89 0.317829457             0.2875817  88 41 177
dissim2 <- similarity_to_dissimilarity(sim)
dissim2[1:10,]
## Data.frame of dissimilarity between sites
##  - Total number of sites:  11 
##  - Number of rows:  10 
##  - Number of dissimilarity metrics:  2 
## 
## 
##      Site1 Site2     Simpson (b + c) / (a + b + c)   a  b   c
## 716     35    36 0.023255814             0.1448276 126  3 741
## 1431    35    37 0.031007752             0.1885370 125  4 534
## 2146    35    38 0.054263566             0.2144112 122  7 440
## 2861    35    39 0.054263566             0.1936508 122  7 501
## 3576    35    84 0.720930233             0.1176471  36 93 177
## 4291    35    85 0.085271318             0.1588156 118 11 614
## 5006    35    86 0.000000000             0.1462585 129  0 753
## 5721    35    87 0.007751938             0.1233141 128  1 909
## 6436    35    88 0.015503876             0.1349628 127  2 812
## 7151    35    89 0.317829457             0.2875817  88 41 177