Skip to contents

In this vignette, we aim at comparing the assignment of sites into different bioregions across multiple bioregionalizations, using the function compare_partitions().

Data

We use the vegetation dataset that comes with bioregion.

data("vegedf")
data("vegemat")

# Calculation of (dis)similarity matrices
vegedissim <- dissimilarity(vegemat, metric = c("Simpson"))
vegesim <- dissimilarity_to_similarity(vegedissim)

Bioregionalization

We use the same three bioregionalization algorithms as in the visualization vignette, i.e. a non-hierarchical, hierarchical and network bioregionalizations.
We chose 3 bioregions for the non-hierarchical and hierarchical bioregionalizations.

# Non hierarchical bioregionalization
vege_nhclu_kmeans <- nhclu_kmeans(vegedissim, n_clust = 3, index = "Simpson")
vege_nhclu_kmeans$cluster_info # 3
##     partition_name n_clust
## K_3            K_3       3
# Hierarchical bioregionalization
set.seed(1)
vege_hclu_hierarclust <- hclu_hierarclust(dissimilarity = vegedissim,
                                          index = names(vegedissim)[3],
                                          method = "mcquitty", n_clust = 3)
vege_hclu_hierarclust$cluster_info # 3
##   partition_name n_clust requested_n_clust output_cut_height
## 1            K_3       3                 3             0.625
# Network bioregionalization
set.seed(1)
vege_netclu_walktrap <- netclu_walktrap(vegesim,
                                        index = names(vegesim)[3])
vege_netclu_walktrap$cluster_info # 3
##     partition_name n_clust
## K_3            K_3       3

Compare the partitions

Before comparing the partitions, we build a common data.frame containing the three distinct bioregionalizations.

comp <- dplyr::left_join(vege_hclu_hierarclust$clusters,
                         vege_netclu_walktrap$clusters,
                         by = "ID")
colnames(comp) <- c("ID", "K_3_hclu", "K_3_netclu")
comp <- dplyr::left_join(comp,
                         vege_nhclu_kmeans$clusters,
                         by = "ID")
colnames(comp) <- c("ID", "K_3_hclu", "K_3_netclu", "K_3_nhclu")

head(comp)
##    ID K_3_hclu K_3_netclu K_3_nhclu
## 1 505        1          1         2
## 2 988        2          3         1
## 3 920        2          3         1
## 4 904        2          3         1
## 5 962        2          3         1
## 6 762        2          3         1

We can now run the function compare_partitions().

hclu_vs_netclu <- compare_partitions(
  cluster_object = comp[, c("K_3_hclu", "K_3_netclu", "K_3_nhclu")],
  store_pairwise_membership = TRUE,
  cor_frequency = TRUE,
  store_confusion_matrix = TRUE)
hclu_vs_netclu
## Partition comparison:
##  - 3 partitions compared
##  - 715 items in the clustering
##  - No metrics computed
##  - Correlation between each partition and the total frequency of item  pairwise membership computed:
##    # Range:  0.772  -  0.859 
##    # Partition(s) most representative (i.e., highest correlation): 
##  K_3_nhclu 
##  Correlation =  0.859 
##  - Item pairwise membership  stored in outputs
##  - Confusion matrices of partition comparisons  stored in outputs

compare_partitions() produces several outputs which: - look within each partition/bioregionalization how sites are assigned to bioregions - compare different partitions/bioregionalizations by analysing whether they produce similar pairwise memberships

Let’s first look at pairwise membership within bioregionalization.

Pairwise membership

The number of pairwise combinations for \(n\) sites equals \(n(n-1)/2\). So in our case, where we have 715 sites, we do end up with 2.55255^{5} pairwise combinations.

nrow(hclu_vs_netclu$pairwise_membership) == nrow(comp)*(nrow(comp)-1)/2
## [1] TRUE

Pairwise memberships look for each pairs of site whether they are assigned to the same or to a different bioregion. Let’s look at the sites 1 and 9 across the different bioregionalization:

comp[c(1, 9), ]
##    ID K_3_hclu K_3_netclu K_3_nhclu
## 1 505        1          1         2
## 9 557        1          1         1

We can see that the sites 1 and 9 are classified in the same bioregion in the first two bioregionalizations, but not in the third one.
The $pairwise_membership output of compare_partitions() shows this as a TRUE/FALSE statement.

hclu_vs_netclu$pairwise_membership[8:10, ]
##      K_3_hclu K_3_netclu K_3_nhclu
## 1_9      TRUE       TRUE     FALSE
## 1_10    FALSE      FALSE     FALSE
## 1_11    FALSE       TRUE     FALSE

The number of times each pair of sites are clustered together (i.e. the sum of rows of the table in $pairwise_membership) is available in the $freq_item_pw_membership output:

hclu_vs_netclu$freq_item_pw_membership[c(1, 8)]
## 1_2 1_9 
##   0   2

The sites 1 and 2 were never classified in the same bioregion across the three bioregionalizations. Sites 1 and 9 were classified in the same bioregion in two bioregionalizations. If we look at the total frequencies:

table(hclu_vs_netclu$freq_item_pw_membership)
## 
##      0      1      2      3 
## 115150  43508  42433  54164

we see that the most dominant situation is when sites are never assigned to the same bioregion.

Confusion matrix

The confusion matrix allows to compare different bioregionalizations by looking at the similarity of their pairwise memberships. To do so, the function computes a confusion matrix with four elements: . \(a\) number of pairs of sites grouped in bioregionalization 1 and in bioregionalization 2 . \(b\) number of pairs of sites grouped in bioregionalization 1 but not in bioregionalization 2 . \(c\) number of pairs of sites not grouped in bioregionalization 1 but grouped in bioregionalization 2 . \(d\) number of pairs of sites not grouped in both bioregionalization 1 & 2

hclu_vs_netclu$confusion_matrix
## $`K_3_hclu%K_3_netclu`
##      a      b      c      d 
##  63430  30529  36686 124610 
## 
## $`K_3_hclu%K_3_nhclu`
##      a      b      c      d 
##  75181  18778  21610 139686 
## 
## $`K_3_netclu%K_3_nhclu`
##      a      b      c      d 
##  66314  33802  30477 124662

Based on the confusion matrices, we can compute a range of indices to indicate the agreement among partitions. As of now, we have implemented:

Rand index \((a+d)/(a+b+c+d)\) The Rand index measures agreement among partitions by accounting for both the pairs of sites that are grouped, but also the pairs of sites that are not grouped.

Jaccard index \(a/(a+b+c)\) The Jaccard index measures agreement among partitions by only accounting for pairs of sites that are grouped.

These two metrics are complementary, because the Jaccard index will tell if partitions are similar in their clustering structure, whereas the Rand index will tell if partitions are similar not only in the pairs of items clustered together, but also in terms of the pairs of sites that are not clustered together. For example, take two partitions which never group together the same pairs of sites. Their Jaccard index will be 0, whereas the Rand index can be > 0 due to the sites that are not grouped together.

Additional indices can be manually computed by the users on the basis of the list of confusion matrices.

In some cases, users may be interested in finding which of the partitions is most representative of all partitions. To find it out, we can compare the pairwise membership of each partition with the total frequency of pairwise membership across all partitions. This correlation can be requested with cor_frequency = TRUE.

hclu_vs_netclu$partition_freq_cor
##   K_3_hclu K_3_netclu  K_3_nhclu 
##  0.8475913  0.7723665  0.8592373

Here the third bioregionalization is the most representative of all partitions.