Skip to contents

This function aims at computing pairwise comparisons for several bioregionalizations, usually an output from netclu_, hclu_ or nhclu_ functions. It also provides the confusion matrix from pairwise comparisons, so that the user can compute additional comparison metrics.

Usage

compare_bioregionalizations(
  cluster_object,
  indices = c("rand", "jaccard"),
  cor_frequency = FALSE,
  store_pairwise_membership = TRUE,
  store_confusion_matrix = TRUE
)

Arguments

cluster_object

a bioregion.clusters object or a data.frame or a list of data.frame containing multiple bioregionalizations. At least two bioregionalizations are required. If a list of data.frame is provided, they should all have the same number of rows (i.e., same items in the clustering for all bioregionalizations).

indices

NULL or character. Indices to compute for the pairwise comparison of bioregionalizations. Current available metrics are "rand" and "jaccard".

cor_frequency

a boolean. If TRUE, then computes the correlation between each bioregionalization and the total frequency of co-membership of items across all bioregionalizations. Useful to identify which bioregionalization(s) is(are) most representative of all the computed bioregionalizations.

store_pairwise_membership

a boolean. If TRUE, the pairwise membership of items is stored in the output object.

store_confusion_matrix

a boolean. If TRUE, the confusion matrices of pairwise bioregionalization comparisons are stored in the output object.

Value

A list with 4 to 7 elements:

  • args: arguments provided by the user

  • inputs: information on the input bioregionalizations, such as the number of items being clustered

  • (facultative) pairwise_membership: only if store_pairwise_membership = TRUE. This element contains the pairwise memberships of all items for each bioregionalization, in the form of a boolean matrix where TRUE means that two items are in the same cluster, and FALSE means that two items are not in the same cluster

  • freq_item_pw_membership: A numeric vector containing the number of times each pair of items are clustered together. It corresponds to the sum of rows of the table in pairwise_membership

  • (facultative) bioregionalization_freq_cor: only if cor_frequency = TRUE. A numeric vector indicating the correlation between individual bioregionalizations and the total frequency of pairwise membership across all bioregionalizations. It corresponds to the correlation between individual columns in pairwise_membership and freq_item_pw_membership

  • (facultative) confusion_matrix: only if store_confusion_matrix = TRUE. A list containing all confusion matrices between each pair of bioregionalizations.

  • bioregionalization_comparison: a data.frame containing the results of the comparison of bioregionalizations, where the first column indicates which bioregionalizations are compared, and the next columns correspond to the requested indices.

Details

This function proceeds in two main steps:

  1. The first step is done within each bioregionalization. It will compare all pairs of items and document if they are clustered together (TRUE) or separately (FALSE) in each bioregionalization. For example, if site 1 and site 2 are clustered in the same cluster in bioregionalization 1, then the pairwise membership site1_site2 will be TRUE. The output of this first step is stored in the slot pairwise_membership if store_pairwise_membership = TRUE.

  2. The second step compares all pairs of bioregionalizations by analysing if their pairwise memberships are similar or not. To do so, for each pair of bioregionalizations, the function computes a confusion matrix with four elements:

  • a number of pairs of items grouped in bioregionalization 1 and in bioregionalization 2

  • b number of pairs of items grouped in bioregionalization 1 but not in bioregionalization 2

  • c number of pairs of items not grouped in bioregionalization 1 but grouped in bioregionalization 2

  • d number of pairs of items not grouped in both bioregionalization 1 & 2

The confusion matrix is stored in confusion_matrix if store_confusion_matrix = TRUE.

Based on the confusion matrices, we can compute a range of indices to indicate the agreement among bioregionalizations. As of now, we have implemented:

  • Rand index (a + d)/(a + b + c + d) The Rand index measures agreement among bioregionalizations by accounting for both the pairs of sites that are grouped, but also the pairs of sites that are not grouped.

  • Jaccard index a/(a + b + c) The Jaccard index measures agreement among bioregionalizations by only accounting for pairs of sites that are grouped.

These two metrics are complementary, because the Jaccard index will tell if bioregionalizations are similar in their clustering structure, whereas the Rand index will tell if bioregionalizations are similar not only in the pairs of items clustered together, but also in terms of the pairs of sites that are not clustered together. For example, take two bioregionalizations which never group together the same pairs of sites. Their Jaccard index will be 0, whereas the Rand index can be > 0 due to the sites that are not grouped together.

Additional indices can be manually computed by the users on the basis of the list of confusion matrices.

In some cases, users may be interested in finding which of the bioregionalizations is most representative of all bioregionalizations. To find it out, we can compare the pairwise membership of each bioregionalization with the total frequency of pairwise membership across all bioregionalizations. This correlation can be requested with cor_frequency = TRUE.

Author

Boris Leroy (leroy.boris@gmail.com)
Maxime Lenormand (maxime.lenormand@inrae.fr)
Pierre Denelle (pierre.denelle@gmail.com)

Examples

# A simple case with four bioregionalizations of four items
bioregionalizations <- data.frame(matrix(nr = 4, nc = 4, 
                                c(1,2,1,1,1,2,2,1,2,1,3,1,2,1,4,2),
                                byrow = TRUE))
bioregionalizations
#>   X1 X2 X3 X4
#> 1  1  2  1  1
#> 2  1  2  2  1
#> 3  2  1  3  1
#> 4  2  1  4  2
compare_bioregionalizations(bioregionalizations)
#> 2024-12-16 14:06:25.470258 - Computing pairwise membership comparisons for each bioregionalization...
#> 2024-12-16 14:06:25.470819 - Comparing memberships among bioregionalizations...
#> 2024-12-16 14:06:25.471469 - Computing Rand index...
#> 2024-12-16 14:06:25.471842 - Computing Jaccard index...
#> $args
#> $args$indices
#> [1] "rand"    "jaccard"
#> 
#> $args$cor_frequency
#> [1] FALSE
#> 
#> $args$store_pairwise_membership
#> [1] TRUE
#> 
#> $args$store_confusion_matrix
#> [1] TRUE
#> 
#> 
#> $inputs
#>               number_items number_bioregionalizations 
#>                          4                          4 
#> 
#> $pairwise_membership
#>        X1    X2    X3    X4
#> 1_2  TRUE  TRUE FALSE  TRUE
#> 1_3 FALSE FALSE FALSE  TRUE
#> 1_4 FALSE FALSE FALSE FALSE
#> 2_3 FALSE FALSE FALSE  TRUE
#> 2_4 FALSE FALSE FALSE FALSE
#> 3_4  TRUE  TRUE FALSE FALSE
#> 
#> $freq_item_pw_membership
#> 1_2 1_3 1_4 2_3 2_4 3_4 
#>   3   1   0   1   0   2 
#> 
#> $confusion_matrix
#> $confusion_matrix$`X1%X2`
#> a b c d 
#> 2 0 0 4 
#> 
#> $confusion_matrix$`X1%X3`
#> a b c d 
#> 0 2 0 4 
#> 
#> $confusion_matrix$`X1%X4`
#> a b c d 
#> 1 1 2 2 
#> 
#> $confusion_matrix$`X2%X3`
#> a b c d 
#> 0 2 0 4 
#> 
#> $confusion_matrix$`X2%X4`
#> a b c d 
#> 1 1 2 2 
#> 
#> $confusion_matrix$`X3%X4`
#> a b c d 
#> 0 0 3 3 
#> 
#> 
#> $bioregionalization_comparison
#>   bioregionalization_comparison      rand jaccard
#> 1                         X1%X2 1.0000000    1.00
#> 2                         X1%X3 0.6666667    0.00
#> 3                         X1%X4 0.5000000    0.25
#> 4                         X2%X3 0.6666667    0.00
#> 5                         X2%X4 0.5000000    0.25
#> 6                         X3%X4 0.5000000    0.00
#> 
#> attr(,"class")
#> [1] "bioregion.bioregionalization.comparison"
#> [2] "list"                                   

# Find out which bioregionalizations are most representative
compare_bioregionalizations(bioregionalizations,
                   cor_frequency = TRUE)
#> 2024-12-16 14:06:25.473658 - Computing pairwise membership comparisons for each bioregionalization...
#> 2024-12-16 14:06:25.474113 - Comparing memberships among bioregionalizations...
#> 2024-12-16 14:06:25.474707 - Computing Rand index...
#> 2024-12-16 14:06:25.47504 - Computing Jaccard index...
#> 2024-12-16 14:06:25.475356 - Computing the correlation between each bioregionalization and the vector of frequency of pairwise membership...
#> $args
#> $args$indices
#> [1] "rand"    "jaccard"
#> 
#> $args$cor_frequency
#> [1] TRUE
#> 
#> $args$store_pairwise_membership
#> [1] TRUE
#> 
#> $args$store_confusion_matrix
#> [1] TRUE
#> 
#> 
#> $inputs
#>               number_items number_bioregionalizations 
#>                          4                          4 
#> 
#> $pairwise_membership
#>        X1    X2    X3    X4
#> 1_2  TRUE  TRUE FALSE  TRUE
#> 1_3 FALSE FALSE FALSE  TRUE
#> 1_4 FALSE FALSE FALSE FALSE
#> 2_3 FALSE FALSE FALSE  TRUE
#> 2_4 FALSE FALSE FALSE FALSE
#> 3_4  TRUE  TRUE FALSE FALSE
#> 
#> $freq_item_pw_membership
#> 1_2 1_3 1_4 2_3 2_4 3_4 
#>   3   1   0   1   0   2 
#> 
#> $bioregionalization_freq_cor
#>        X1        X2        X3        X4 
#> 0.8834522 0.8834522 0.0000000 0.4685213 
#> 
#> $confusion_matrix
#> $confusion_matrix$`X1%X2`
#> a b c d 
#> 2 0 0 4 
#> 
#> $confusion_matrix$`X1%X3`
#> a b c d 
#> 0 2 0 4 
#> 
#> $confusion_matrix$`X1%X4`
#> a b c d 
#> 1 1 2 2 
#> 
#> $confusion_matrix$`X2%X3`
#> a b c d 
#> 0 2 0 4 
#> 
#> $confusion_matrix$`X2%X4`
#> a b c d 
#> 1 1 2 2 
#> 
#> $confusion_matrix$`X3%X4`
#> a b c d 
#> 0 0 3 3 
#> 
#> 
#> $bioregionalization_comparison
#>   bioregionalization_comparison      rand jaccard
#> 1                         X1%X2 1.0000000    1.00
#> 2                         X1%X3 0.6666667    0.00
#> 3                         X1%X4 0.5000000    0.25
#> 4                         X2%X3 0.6666667    0.00
#> 5                         X2%X4 0.5000000    0.25
#> 6                         X3%X4 0.5000000    0.00
#> 
#> attr(,"class")
#> [1] "bioregion.bioregionalization.comparison"
#> [2] "list"