Compare cluster memberships among multiple bioregionalizations
Source:R/compare_bioregionalizations.R
compare_bioregionalizations.Rd
This function aims at computing pairwise comparisons for several
bioregionalizations, usually an output from netclu_
, hclu_
or nhclu_
functions. It also provides the confusion matrix from pairwise comparisons,
so that the user can compute additional comparison metrics.
Usage
compare_bioregionalizations(
cluster_object,
indices = c("rand", "jaccard"),
cor_frequency = FALSE,
store_pairwise_membership = TRUE,
store_confusion_matrix = TRUE
)
Arguments
- cluster_object
a
bioregion.clusters
object or adata.frame
or a list ofdata.frame
containing multiple bioregionalizations. At least two bioregionalizations are required. If a list ofdata.frame
is provided, they should all have the same number of rows (i.e., same items in the clustering for all bioregionalizations).- indices
NULL
orcharacter
. Indices to compute for the pairwise comparison of bioregionalizations. Current available metrics are"rand"
and"jaccard"
.- cor_frequency
a
boolean
. IfTRUE
, then computes the correlation between each bioregionalization and the total frequency of co-membership of items across all bioregionalizations. Useful to identify which bioregionalization(s) is(are) most representative of all the computed bioregionalizations.- store_pairwise_membership
a
boolean
. IfTRUE
, the pairwise membership of items is stored in the output object.- store_confusion_matrix
a
boolean
. IfTRUE
, the confusion matrices of pairwise bioregionalization comparisons are stored in the output object.
Value
A list
with 4 to 7 elements:
args
: arguments provided by the userinputs
: information on the input bioregionalizations, such as the number of items being clustered(facultative)
pairwise_membership
: only ifstore_pairwise_membership = TRUE
. This element contains the pairwise memberships of all items for each bioregionalization, in the form of aboolean matrix
whereTRUE
means that two items are in the same cluster, andFALSE
means that two items are not in the same clusterfreq_item_pw_membership
: Anumeric vector
containing the number of times each pair of items are clustered together. It corresponds to the sum of rows of the table inpairwise_membership
(facultative)
bioregionalization_freq_cor
: only ifcor_frequency = TRUE
. Anumeric vector
indicating the correlation between individual bioregionalizations and the total frequency of pairwise membership across all bioregionalizations. It corresponds to the correlation between individual columns inpairwise_membership
andfreq_item_pw_membership
(facultative)
confusion_matrix
: only ifstore_confusion_matrix = TRUE
. Alist
containing all confusion matrices between each pair of bioregionalizations.bioregionalization_comparison
: adata.frame
containing the results of the comparison of bioregionalizations, where the first column indicates which bioregionalizations are compared, and the next columns correspond to the requestedindices
.
Details
This function proceeds in two main steps:
The first step is done within each bioregionalization. It will compare all pairs of items and document if they are clustered together (
TRUE
) or separately (FALSE
) in each bioregionalization. For example, if site 1 and site 2 are clustered in the same cluster in bioregionalization 1, then the pairwise membership site1_site2 will beTRUE
. The output of this first step is stored in the slotpairwise_membership
ifstore_pairwise_membership = TRUE
.The second step compares all pairs of bioregionalizations by analysing if their pairwise memberships are similar or not. To do so, for each pair of bioregionalizations, the function computes a confusion matrix with four elements:
a
number of pairs of items grouped in bioregionalization 1 and in bioregionalization 2b
number of pairs of items grouped in bioregionalization 1 but not in bioregionalization 2c
number of pairs of items not grouped in bioregionalization 1 but grouped in bioregionalization 2d
number of pairs of items not grouped in both bioregionalization 1 & 2
The confusion matrix is stored in confusion_matrix
if
store_confusion_matrix = TRUE
.
Based on the confusion matrices, we can compute a range of indices to indicate the agreement among bioregionalizations. As of now, we have implemented:
Rand index
(a + d)/(a + b + c + d) The Rand index measures agreement among bioregionalizations by accounting for both the pairs of sites that are grouped, but also the pairs of sites that are not grouped.Jaccard index
a/(a + b + c) The Jaccard index measures agreement among bioregionalizations by only accounting for pairs of sites that are grouped.
These two metrics are complementary, because the Jaccard index will tell if bioregionalizations are similar in their clustering structure, whereas the Rand index will tell if bioregionalizations are similar not only in the pairs of items clustered together, but also in terms of the pairs of sites that are not clustered together. For example, take two bioregionalizations which never group together the same pairs of sites. Their Jaccard index will be 0, whereas the Rand index can be > 0 due to the sites that are not grouped together.
Additional indices can be manually computed by the users on the basis of the list of confusion matrices.
In some cases, users may be interested in finding which of the
bioregionalizations is most representative of all bioregionalizations. To
find it out, we can compare the pairwise membership of each
bioregionalization with the total frequency of pairwise membership across
all bioregionalizations. This correlation can be requested with
cor_frequency = TRUE
.
Author
Boris Leroy (leroy.boris@gmail.com)
Maxime Lenormand (maxime.lenormand@inrae.fr)
Pierre Denelle (pierre.denelle@gmail.com)
Examples
# A simple case with four bioregionalizations of four items
bioregionalizations <- data.frame(matrix(nr = 4, nc = 4,
c(1,2,1,1,1,2,2,1,2,1,3,1,2,1,4,2),
byrow = TRUE))
bioregionalizations
#> X1 X2 X3 X4
#> 1 1 2 1 1
#> 2 1 2 2 1
#> 3 2 1 3 1
#> 4 2 1 4 2
compare_bioregionalizations(bioregionalizations)
#> 2024-12-16 14:06:25.470258 - Computing pairwise membership comparisons for each bioregionalization...
#> 2024-12-16 14:06:25.470819 - Comparing memberships among bioregionalizations...
#> 2024-12-16 14:06:25.471469 - Computing Rand index...
#> 2024-12-16 14:06:25.471842 - Computing Jaccard index...
#> $args
#> $args$indices
#> [1] "rand" "jaccard"
#>
#> $args$cor_frequency
#> [1] FALSE
#>
#> $args$store_pairwise_membership
#> [1] TRUE
#>
#> $args$store_confusion_matrix
#> [1] TRUE
#>
#>
#> $inputs
#> number_items number_bioregionalizations
#> 4 4
#>
#> $pairwise_membership
#> X1 X2 X3 X4
#> 1_2 TRUE TRUE FALSE TRUE
#> 1_3 FALSE FALSE FALSE TRUE
#> 1_4 FALSE FALSE FALSE FALSE
#> 2_3 FALSE FALSE FALSE TRUE
#> 2_4 FALSE FALSE FALSE FALSE
#> 3_4 TRUE TRUE FALSE FALSE
#>
#> $freq_item_pw_membership
#> 1_2 1_3 1_4 2_3 2_4 3_4
#> 3 1 0 1 0 2
#>
#> $confusion_matrix
#> $confusion_matrix$`X1%X2`
#> a b c d
#> 2 0 0 4
#>
#> $confusion_matrix$`X1%X3`
#> a b c d
#> 0 2 0 4
#>
#> $confusion_matrix$`X1%X4`
#> a b c d
#> 1 1 2 2
#>
#> $confusion_matrix$`X2%X3`
#> a b c d
#> 0 2 0 4
#>
#> $confusion_matrix$`X2%X4`
#> a b c d
#> 1 1 2 2
#>
#> $confusion_matrix$`X3%X4`
#> a b c d
#> 0 0 3 3
#>
#>
#> $bioregionalization_comparison
#> bioregionalization_comparison rand jaccard
#> 1 X1%X2 1.0000000 1.00
#> 2 X1%X3 0.6666667 0.00
#> 3 X1%X4 0.5000000 0.25
#> 4 X2%X3 0.6666667 0.00
#> 5 X2%X4 0.5000000 0.25
#> 6 X3%X4 0.5000000 0.00
#>
#> attr(,"class")
#> [1] "bioregion.bioregionalization.comparison"
#> [2] "list"
# Find out which bioregionalizations are most representative
compare_bioregionalizations(bioregionalizations,
cor_frequency = TRUE)
#> 2024-12-16 14:06:25.473658 - Computing pairwise membership comparisons for each bioregionalization...
#> 2024-12-16 14:06:25.474113 - Comparing memberships among bioregionalizations...
#> 2024-12-16 14:06:25.474707 - Computing Rand index...
#> 2024-12-16 14:06:25.47504 - Computing Jaccard index...
#> 2024-12-16 14:06:25.475356 - Computing the correlation between each bioregionalization and the vector of frequency of pairwise membership...
#> $args
#> $args$indices
#> [1] "rand" "jaccard"
#>
#> $args$cor_frequency
#> [1] TRUE
#>
#> $args$store_pairwise_membership
#> [1] TRUE
#>
#> $args$store_confusion_matrix
#> [1] TRUE
#>
#>
#> $inputs
#> number_items number_bioregionalizations
#> 4 4
#>
#> $pairwise_membership
#> X1 X2 X3 X4
#> 1_2 TRUE TRUE FALSE TRUE
#> 1_3 FALSE FALSE FALSE TRUE
#> 1_4 FALSE FALSE FALSE FALSE
#> 2_3 FALSE FALSE FALSE TRUE
#> 2_4 FALSE FALSE FALSE FALSE
#> 3_4 TRUE TRUE FALSE FALSE
#>
#> $freq_item_pw_membership
#> 1_2 1_3 1_4 2_3 2_4 3_4
#> 3 1 0 1 0 2
#>
#> $bioregionalization_freq_cor
#> X1 X2 X3 X4
#> 0.8834522 0.8834522 0.0000000 0.4685213
#>
#> $confusion_matrix
#> $confusion_matrix$`X1%X2`
#> a b c d
#> 2 0 0 4
#>
#> $confusion_matrix$`X1%X3`
#> a b c d
#> 0 2 0 4
#>
#> $confusion_matrix$`X1%X4`
#> a b c d
#> 1 1 2 2
#>
#> $confusion_matrix$`X2%X3`
#> a b c d
#> 0 2 0 4
#>
#> $confusion_matrix$`X2%X4`
#> a b c d
#> 1 1 2 2
#>
#> $confusion_matrix$`X3%X4`
#> a b c d
#> 0 0 3 3
#>
#>
#> $bioregionalization_comparison
#> bioregionalization_comparison rand jaccard
#> 1 X1%X2 1.0000000 1.00
#> 2 X1%X3 0.6666667 0.00
#> 3 X1%X4 0.5000000 0.25
#> 4 X2%X3 0.6666667 0.00
#> 5 X2%X4 0.5000000 0.25
#> 6 X3%X4 0.5000000 0.00
#>
#> attr(,"class")
#> [1] "bioregion.bioregionalization.comparison"
#> [2] "list"