# Compare cluster memberships among multiple partitions

Source:`R/compare_partitions.R`

`compare_partitions.Rd`

This function aims at computing pairwise comparisons for several
partitions, usually on outputs from `netclu_`

, `hclu_`

or `nhclu_`

functions.
It also provides the confusion matrix from pairwise comparisons, so that
the user can compute additional comparison metrics.

## Usage

```
compare_partitions(
cluster_object,
sample_items = NULL,
indices = c("rand", "jaccard"),
cor_frequency = FALSE,
store_pairwise_membership = TRUE,
store_confusion_matrix = TRUE
)
```

## Arguments

- cluster_object
a

`bioregion.clusters`

object or a`data.frame`

or a list of`data.frame`

containing multiple partitions. At least two partitions are required. If a list of`data.frame`

is provided, they should all have the same number of rows (i.e., same items in the clustering for all partitions).- sample_items
`NULL`

or a positive integer. Reduce the number of items to be used in the comparison of partitions. Useful if the number of items is high and pairwise comparisons cannot be computed. Suggested values 5000 or 10000 computation- indices
`NULL`

or`character`

. Indices to compute for the pairwise comparison of partitions. Current available metrics are`"rand"`

and`"jaccard"`

- cor_frequency
a boolean. If

`TRUE`

, then computes the correlation between each partition and the total frequency of co-membership of items across all partitions. Useful to identify which partition(s) is(are) most representative of all the computed partitions.- store_pairwise_membership
a boolean. If

`TRUE`

, the pairwise membership of items is stored in the output object.- store_confusion_matrix
a boolean. If

`TRUE`

, the confusion matrices of pairwise partition comparisons are stored in the output object.

## Value

A `list`

with 4 to 7 elements:

`args`

: arguments provided by the user`inputs`

: information on the input partitions, such as the number of items being clustered(facultative)

`pairwise_membership`

: only if`store_pairwise_membership = TRUE`

. This element contains the pairwise memberships of all items for each partition, in the form of a`boolean matrix`

where`TRUE`

means that two items are in the same cluster, and`FALSE`

means that two items are not in the same cluster`freq_item_pw_membership`

: A`numeric vector`

containing the number of times each pair of items are clustered together. It corresponds to the sum of rows of the table in`pairwise_membership`

(facultative)

`partition_freq_cor`

: only if`cor_frequency = TRUE`

. A`numeric vector`

indicating the correlation between individual partitions and the total frequency of pairwise membership across all partitions. It corresponds to the correlation between individual columns in`pairwise_membership`

and`freq_item_pw_membership`

(facultative)

`confusion_matrix`

: only if`store_confusion_matrix = TRUE`

. A`list`

containing all confusion matrices between each pair of partitions.`partition_comparison`

: a`data.frame`

containing the results of the comparison of partitions, where the first column indicates which partitions are compared, and the next columns correspond to the requested`indices`

.

## Details

This function proceeds in two main steps:

The first step is done within each partition. It will compare all pairs of items and document if they are clustered together (

`TRUE`

) or separately (`FALSE`

) in each partition. For example, if site 1 and site 2 are clustered in the same cluster in partition 1, then the pairwise membership site1_site2 will be`TRUE`

. The output of this first step is stored in the slot`pairwise_membership`

if`store_pairwise_membership = TRUE`

.The second step compares all pairs of partitions by analysing if their pairwise memberships are similar or not. To do so, for each pair of partitions, the function computes a confusion matrix with four elements:

*a*: number of pairs of items grouped in partition 1 and in partition 2*b*: number of pairs of items grouped in partition 1 but not in partition 2*c*: number of pairs of items not grouped in partition 1 but grouped in partition 2*d*: number of pairs of items not grouped in both partition 1 & 2

The confusion matrix is stored in `confusion_matrix`

if
`store_confusion_matrix = TRUE`

.

Based on the confusion matrices, we can compute a range of indices to indicate the agreement among partitions. As of now, we have implemented:

*Rand index*\((a + d)/(a + b + c + d)\) The Rand index measures agreement among partitions by accounting for both the pairs of sites that are grouped, but also the pairs of sites that are not grouped.*Jaccard index*\((a)/(a + b + c)\) The Jaccard index measures agreement among partitions by only accounting for pairs of sites that are grouped - it is

These two metrics are complementary, because the Jaccard index will tell if partitions are similar in their clustering structure, whereas the Rand index will tell if partitions are similar not only in the pairs of items clustered together, but also in terms of the pairs of sites that are not clustered together. For example, take two partitions which never group together the same pairs of sites. Their Jaccard index will be 0, whereas the Rand index can be > 0 due to the sites that are not grouped together.

Additional indices can be manually computed by the users on the basis of the list of confusion matrices.

In some cases, users may be interested in finding which of the partitions
is most representative of all partitions. To find it out, we can
compare the pairwise membership of each partition with the total frequency
of pairwise membership across all partitions. This correlation can be
requested with `cor_frequency = TRUE`

## Author

Boris Leroy (leroy.boris@gmail.com), Maxime Lenormand (maxime.lenormand@inrae.fr) and Pierre Denelle (pierre.denelle@gmail.com)

## Examples

```
# A simple case with four partitions of four items
partitions <- data.frame(matrix(nr = 4, nc = 4,
c(1,2,1,1,1,2,2,1,2,1,3,1,2,1,4,2),
byrow = TRUE))
partitions
#> X1 X2 X3 X4
#> 1 1 2 1 1
#> 2 1 2 2 1
#> 3 2 1 3 1
#> 4 2 1 4 2
compare_partitions(partitions)
#> 2024-07-25 18:17:08.518328 - Computing pairwise membership comparisons for eachpartition...
#> 2024-07-25 18:17:08.519487 - Comparing memberships among partitions...
#> 2024-07-25 18:17:08.519995 - Computing Rand index...
#> 2024-07-25 18:17:08.52022 - Computing Jaccard index...
#> Partition comparison:
#> - 4 partitions compared
#> - 4 items in the clustering
#> - Requested indices: rand jaccard
#> - Metric summary:
#> rand jaccard
#> Min 0.5000000 0.00
#> Mean 0.6388889 0.25
#> Max 1.0000000 1.00
#> - Item pairwise membership stored in outputs
#> - Confusion matrices of partition comparisons stored in outputs
# Find out which partitions are most representative
compare_partitions(partitions,
cor_frequency = TRUE)
#> 2024-07-25 18:17:08.522107 - Computing pairwise membership comparisons for eachpartition...
#> 2024-07-25 18:17:08.522878 - Comparing memberships among partitions...
#> 2024-07-25 18:17:08.52332 - Computing Rand index...
#> 2024-07-25 18:17:08.523515 - Computing Jaccard index...
#> 2024-07-25 18:17:08.523682 - Computing the correlation between each partition and the vector of frequency of pairwise membership...
#> Partition comparison:
#> - 4 partitions compared
#> - 4 items in the clustering
#> - Requested indices: rand jaccard
#> - Metric summary:
#> rand jaccard
#> Min 0.5000000 0.00
#> Mean 0.6388889 0.25
#> Max 1.0000000 1.00
#> - Correlation between each partition and the total frequency of item pairwise membership computed:
#> # Range: 0 - 0.883
#> # Partition(s) most representative (i.e., highest correlation):
#> X1, X2
#> Correlation = 0.883
#> - Item pairwise membership stored in outputs
#> - Confusion matrices of partition comparisons stored in outputs
```