Search for an optimal number of clusters in a list of partitions
Source:R/find_optimal_n.R
find_optimal_n.Rd
This function aims to optimize one or several criteria on a set of ordered partitions. It is typically used to find one or more optimal cluster counts on hierarchical trees to cut or ranges of partitions from k-means or PAM. Users should exercise caution in other cases (e.g., unordered partitions or unrelated partitions).
Usage
find_optimal_n(
partitions,
metrics_to_use = "all",
criterion = "elbow",
step_quantile = 0.99,
step_levels = NULL,
step_round_above = TRUE,
metric_cutoffs = c(0.5, 0.75, 0.9, 0.95, 0.99, 0.999),
n_breakpoints = 1,
plot = TRUE
)
Arguments
- partitions
A
bioregion.partition.metrics
object (output frombioregionalization_metrics()
) or adata.frame
with the first two columns namedK
(partition name) andn_clusters
(number of clusters), followed by columns with numeric evaluation metrics.- metrics_to_use
A
character
vector or single string specifying metrics inpartitions
for calculating optimal clusters. Defaults to"all"
(uses all metrics).- criterion
A
character
string specifying the criterion to identify optimal clusters. Options include"elbow"
,"increasing_step"
,"decreasing_step"
,"cutoff"
,"breakpoints"
,"min"
, or"max"
. Defaults to"elbow"
. See Details.- step_quantile
For
"increasing_step"
or"decreasing_step"
, specifies the quantile of differences between consecutive partitions as the cutoff to identify significant steps ineval_metric
.- step_levels
For
"increasing_step"
or"decreasing_step"
, specifies the number of largest steps to retain as cutoffs.- step_round_above
A
boolean
indicating whether the optimal clusters are above (TRUE
) or below (FALSE
) identified steps. Defaults toTRUE
.- metric_cutoffs
For
criterion = "cutoff"
, specifies the cutoffs ofeval_metric
to extract cluster counts.- n_breakpoints
Specifies the number of breakpoints to find in the curve. Defaults to 1.
- plot
A
boolean
indicating if a plot of the firsteval_metric
with identified optimal clusters should be drawn.
Value
A list
of class bioregion.optimal.n
with these elements:
args
: Input arguments.evaluation_df
: The input evaluationdata.frame
, appended withboolean
columns for optimal cluster counts.optimal_nb_clusters
: Alist
with optimal cluster counts for each metric in"metrics_to_use"
, based on the chosencriterion
.plot
: The plot (if requested).
Details
This function explores evaluation metric ~ cluster relationships, applying criteria to find optimal cluster counts.
Note on criteria: Several criteria can return multiple optimal cluster counts, emphasizing hierarchical or nested bioregionalizations. This approach aligns with modern recommendations for biological datasets, as seen in Ficetola et al. (2017)'s reanalysis of Holt et al. (2013).
Criteria for optimal clusters:
elbow
: Identifies the "elbow" point in the evaluation metric curve, where incremental improvements diminish. Based on a method to find the maximum distance from a straight line linking curve endpoints.increasing_step
ordecreasing_step
: Highlights significant increases or decreases in metrics by analyzing pairwise differences between partitions. Users specifystep_quantile
orstep_levels
.cutoffs
: Derives clusters from specified metric cutoffs, e.g., as in Holt et al. (2013). Adjust cutoffs based on spatial scale.breakpoints
: Uses segmented regression to find breakpoints. Requires specifyingn_breakpoints
.min
&max
: Selects clusters at minimum or maximum metric values.
Note
Please note that finding the optimal number of clusters is a procedure which normally requires decisions from the users, and as such can hardly be fully automatized. Users are strongly advised to read the references indicated below to look for guidance on how to choose their optimal number(s) of clusters. Consider the "optimal" numbers of clusters returned by this function as first approximation of the best numbers for your bioregionalization.
References
Holt BG, Lessard J, Borregaard MK, Fritz SA, Araújo MB, Dimitrov D, Fabre P, Graham CH, Graves GR, Jønsson Ka, Nogués-Bravo D, Wang Z, Whittaker RJ, Fjeldså J & Rahbek C (2013) An update of Wallace's zoogeographic regions of the world. Science 339, 74-78.
Ficetola GF, Mazel F & Thuiller W (2017) Global determinants of zoogeographical boundaries. Nature Ecology & Evolution 1, 0089.
See also
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_1_hierarchical_clustering.html#optimaln.
Associated functions: hclu_hierarclust
Author
Boris Leroy (leroy.boris@gmail.com)
Maxime Lenormand (maxime.lenormand@inrae.fr)
Pierre Denelle (pierre.denelle@gmail.com)
Examples
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001),
20, 25)
rownames(comat) <- paste0("Site",1:20)
colnames(comat) <- paste0("Species",1:25)
comnet <- mat_to_net(comat)
dissim <- dissimilarity(comat, metric = "all")
# User-defined number of clusters
tree1 <- hclu_hierarclust(dissim,
n_clust = 2:15)
#> Building the iterative hierarchical consensus tree... Note that this process can take time especially if you have a lot of sites.
#>
#> Final tree has a 0.7975 cophenetic correlation coefficient with the initial dissimilarity matrix
tree1
#> Clustering results for algorithm : hclu_hierarclust
#> (hierarchical clustering based on a dissimilarity matrix)
#> - Number of sites: 20
#> - Name of dissimilarity metric: Jaccard
#> - Tree construction method: average
#> - Randomization of the dissimilarity matrix: yes, number of trials 100
#> - Method to compute the final tree: Iterative consensus hierarchical tree
#> - Cophenetic correlation coefficient: 0.798
#> - Number of clusters requested by the user: 2 3 4 5 6 7 8 9 10 11 ... (with 4 more values)
#> Clustering results:
#> - Number of partitions: 14
#> - Partitions are hierarchical
#> - Number of clusters: 2 3 4 5 6 7 8 9 10 11 ... (with 4 more values)
#> - Height of cut of the hierarchical tree: 0.266 0.25 0.234 0.219 0.217 0.215 0.188 0.164 0.156 0.141 ... (with 4 more values)
a <- bioregionalization_metrics(tree1,
dissimilarity = dissim,
net = comnet,
species_col = "Node2",
site_col = "Node1",
eval_metric = c("tot_endemism",
"avg_endemism",
"pc_distance",
"anosim"))
#> Computing similarity-based metrics...
#> - pc_distance OK
#> - anosim OK
#> Computing composition-based metrics...
#> - avg_endemism OK
#> - tot_endemism OK
find_optimal_n(a)
#> Number of partitions: 14
#> Searching for potential optimal number(s) of clusters based on the elbow method
#> * elbow found at:
#> tot_endemism 3
#> avg_endemism 3
#> pc_distance 10
#> anosim 8
#> Warning: The elbow method is likely not suitable for the ANOSIM metric. You should rather look for leaps in the curve (see criterion = 'increasing_step' or decreasing_step)
#> Plotting results...
#> Search for an optimal number of clusters:
#> - 14 partition(s) evaluated
#> - Range of clusters explored: from 2 to 15
#> - Evaluated metric(s): tot_endemism avg_endemism pc_distance anosim
#>
#> Potential optimal partition(s):
#> - Criterion chosen to optimise the number of clusters: elbow
#> - Optimal partition(s) of clusters for each metric:
#> tot_endemism - 3
#> avg_endemism - 3
#> pc_distance - 10
#> anosim - 8
find_optimal_n(a, criterion = "increasing_step")
#> Number of partitions: 14
#> Searching for potential optimal number(s) of clusters based on the increasing_step method
#> - Step method
#> Warning: Criterion 'increasing_step' cannot work properly with metric 'tot_endemism', because this metric is usually monotonously decreasing. Consider using criterion = 'decreasing_step' instead.
#> Plotting results...
#> Search for an optimal number of clusters:
#> - 14 partition(s) evaluated
#> - Range of clusters explored: from 2 to 15
#> - Evaluated metric(s): tot_endemism avg_endemism pc_distance anosim
#>
#> Potential optimal partition(s):
#> - Criterion chosen to optimise the number of clusters: increasing_step
#> (step quantile chosen: 0.99 (i.e., only the top 1 % increase in evaluation metrics are used as break points for the number of clusters)
#> - Optimal partition(s) of clusters for each metric:
#> tot_endemism -
#> avg_endemism -
#> pc_distance - 3
#> anosim - 3
find_optimal_n(a, criterion = "decreasing_step")
#> Number of partitions: 14
#> Searching for potential optimal number(s) of clusters based on the decreasing_step method
#> - Step method
#> Warning: Criterion 'decreasing_step' cannot work properly with metrics 'pc_distance' or 'avg_endemism', because these metrics are usually monotonously decreasing. Consider using criterion = 'increasing_step' instead.
#> Plotting results...
#> Search for an optimal number of clusters:
#> - 14 partition(s) evaluated
#> - Range of clusters explored: from 2 to 15
#> - Evaluated metric(s): tot_endemism avg_endemism pc_distance anosim
#>
#> Potential optimal partition(s):
#> - Criterion chosen to optimise the number of clusters: decreasing_step
#> (step quantile chosen: 0.99 (i.e., only the top 1 % decrease in evaluation metrics are used as break points for the number of clusters)
#> - Optimal partition(s) of clusters for each metric:
#> tot_endemism - 3
#> avg_endemism - 3
#> pc_distance - 13
#> anosim - 9
find_optimal_n(a, criterion = "decreasing_step",
step_levels = 3)
#> Number of partitions: 14
#> Searching for potential optimal number(s) of clusters based on the decreasing_step method
#> - Step method
#> Warning: Criterion 'decreasing_step' cannot work properly with metrics 'pc_distance' or 'avg_endemism', because these metrics are usually monotonously decreasing. Consider using criterion = 'increasing_step' instead.
#> Warning: The number of optimal N for method 'tot_endemism' is suspiciously high, consider switching between 'increasing_step' and 'decreasing_step'
#> Warning: The number of optimal N for method 'avg_endemism' is suspiciously high, consider switching between 'increasing_step' and 'decreasing_step'
#> Plotting results...
#> Search for an optimal number of clusters:
#> - 14 partition(s) evaluated
#> - Range of clusters explored: from 2 to 15
#> - Evaluated metric(s): tot_endemism avg_endemism pc_distance anosim
#>
#> Potential optimal partition(s):
#> - Criterion chosen to optimise the number of clusters: decreasing_step
#> (step quantile chosen: 0.99 (i.e., only the top 1 % decrease in evaluation metrics are used as break points for the number of clusters)
#> - Optimal partition(s) of clusters for each metric:
#> tot_endemism - 3 4 5 6 7 8 9 10 11 12 13 14 15
#> avg_endemism - 3 4 5 6 7 8 9 10 11 12 13 14 15
#> pc_distance - 6 13 15
#> anosim - 6 9 13
find_optimal_n(a, criterion = "decreasing_step",
step_quantile = .9)
#> Number of partitions: 14
#> Searching for potential optimal number(s) of clusters based on the decreasing_step method
#> - Step method
#> Warning: Criterion 'decreasing_step' cannot work properly with metrics 'pc_distance' or 'avg_endemism', because these metrics are usually monotonously decreasing. Consider using criterion = 'increasing_step' instead.
#> Plotting results...
#> Search for an optimal number of clusters:
#> - 14 partition(s) evaluated
#> - Range of clusters explored: from 2 to 15
#> - Evaluated metric(s): tot_endemism avg_endemism pc_distance anosim
#>
#> Potential optimal partition(s):
#> - Criterion chosen to optimise the number of clusters: decreasing_step
#> (step quantile chosen: 0.9 (i.e., only the top 10 % decrease in evaluation metrics are used as break points for the number of clusters)
#> - Optimal partition(s) of clusters for each metric:
#> tot_endemism - 3
#> avg_endemism - 3
#> pc_distance - 6 13
#> anosim - 6 9
find_optimal_n(a, criterion = "decreasing_step",
step_levels = 3)
#> Number of partitions: 14
#> Searching for potential optimal number(s) of clusters based on the decreasing_step method
#> - Step method
#> Warning: Criterion 'decreasing_step' cannot work properly with metrics 'pc_distance' or 'avg_endemism', because these metrics are usually monotonously decreasing. Consider using criterion = 'increasing_step' instead.
#> Warning: The number of optimal N for method 'tot_endemism' is suspiciously high, consider switching between 'increasing_step' and 'decreasing_step'
#> Warning: The number of optimal N for method 'avg_endemism' is suspiciously high, consider switching between 'increasing_step' and 'decreasing_step'
#> Plotting results...
#> Search for an optimal number of clusters:
#> - 14 partition(s) evaluated
#> - Range of clusters explored: from 2 to 15
#> - Evaluated metric(s): tot_endemism avg_endemism pc_distance anosim
#>
#> Potential optimal partition(s):
#> - Criterion chosen to optimise the number of clusters: decreasing_step
#> (step quantile chosen: 0.99 (i.e., only the top 1 % decrease in evaluation metrics are used as break points for the number of clusters)
#> - Optimal partition(s) of clusters for each metric:
#> tot_endemism - 3 4 5 6 7 8 9 10 11 12 13 14 15
#> avg_endemism - 3 4 5 6 7 8 9 10 11 12 13 14 15
#> pc_distance - 6 13 15
#> anosim - 6 9 13
find_optimal_n(a, criterion = "decreasing_step",
step_levels = 3)
#> Number of partitions: 14
#> Searching for potential optimal number(s) of clusters based on the decreasing_step method
#> - Step method
#> Warning: Criterion 'decreasing_step' cannot work properly with metrics 'pc_distance' or 'avg_endemism', because these metrics are usually monotonously decreasing. Consider using criterion = 'increasing_step' instead.
#> Warning: The number of optimal N for method 'tot_endemism' is suspiciously high, consider switching between 'increasing_step' and 'decreasing_step'
#> Warning: The number of optimal N for method 'avg_endemism' is suspiciously high, consider switching between 'increasing_step' and 'decreasing_step'
#> Plotting results...
#> Search for an optimal number of clusters:
#> - 14 partition(s) evaluated
#> - Range of clusters explored: from 2 to 15
#> - Evaluated metric(s): tot_endemism avg_endemism pc_distance anosim
#>
#> Potential optimal partition(s):
#> - Criterion chosen to optimise the number of clusters: decreasing_step
#> (step quantile chosen: 0.99 (i.e., only the top 1 % decrease in evaluation metrics are used as break points for the number of clusters)
#> - Optimal partition(s) of clusters for each metric:
#> tot_endemism - 3 4 5 6 7 8 9 10 11 12 13 14 15
#> avg_endemism - 3 4 5 6 7 8 9 10 11 12 13 14 15
#> pc_distance - 6 13 15
#> anosim - 6 9 13
find_optimal_n(a, criterion = "breakpoints")
#> Warning: Metrics tot_endemism, avg_endemism do not vary sufficiently to use breakpoints, so they were removed.
#> Number of partitions: 14
#> Searching for potential optimal number(s) of clusters based on the breakpoints method
#> Plotting results...
#> (the red line is the prediction from the segmented regression)
#> Search for an optimal number of clusters:
#> - 14 partition(s) evaluated
#> - Range of clusters explored: from 2 to 15
#> - Evaluated metric(s): pc_distance anosim
#>
#> Potential optimal partition(s):
#> - Criterion chosen to optimise the number of clusters: breakpoints
#> - Optimal partition(s) of clusters for each metric:
#> pc_distance - 10
#> anosim - 4