Hierarchical clustering based on dissimilarity or beta-diversity
Source:R/hclu_hierarclust.R
hclu_hierarclust.Rd
This function generates a hierarchical tree from a dissimilarity
(beta-diversity) data.frame
, calculates the cophenetic correlation
coefficient, and can get clusters from the tree if requested by the user.
The function implements randomization of the dissimilarity matrix to
generate the tree, with two different methods to generate the final tree.
Typically, the dissimilarity data.frame
is a
bioregion.pairwise.metric
object obtained by running similarity
or similarity
and then similarity_to_dissimilarity
.
Usage
hclu_hierarclust(
dissimilarity,
index = names(dissimilarity)[3],
method = "average",
randomize = TRUE,
n_runs = 100,
keep_trials = FALSE,
optimal_tree_method = "iterative_consensus_tree",
n_clust = NULL,
cut_height = NULL,
find_h = TRUE,
h_max = 1,
h_min = 0,
consensus_p = 0.5,
verbose = TRUE
)
Arguments
- dissimilarity
the output object from
dissimilarity()
orsimilarity_to_dissimilarity()
, or adist
object. If adata.frame
is used, the first two columns represent pairs of sites (or any pair of nodes), and the next column(s) are the dissimilarity indices.- index
name or number of the dissimilarity column to use. By default, the third column name of
dissimilarity
is used.- method
name of the hierarchical classification method, as in hclust. Should be one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).
- randomize
a
boolean
indicating if the dissimilarity matrix should be randomized, to account for the order of sites in the dissimilarity matrix.- n_runs
number of trials to randomize the dissimilarity matrix.
- keep_trials
a
boolean
indicating if all random trial results. should be stored in the output object (set to FALSE to save space if yourdissimilarity
object is large). Note that it cannot be set toTRUE
ifoptimal_tree_method = "iterative_consensus_tree"
- optimal_tree_method
a
character
indicating how the final tree should be obtained from all trials. Possible values areiterative_consensus_tree
(default),best
andconsensus
. We recommenditerative_consensus_tree
. See details- n_clust
an
integer
or aninteger
vector indicating the number of clusters to be obtained from the hierarchical tree, or the output from partition_metrics. Should not be used at the same time ascut_height
.- cut_height
a
numeric
vector indicating the height(s) at which the tree should be cut. Should not be used at the same time asn_clust
.- find_h
a
boolean
indicating if the height of cut should be found for the requestedn_clust
.- h_max
a
numeric
indicating the maximum possible tree height for the chosenindex
.- h_min
a
numeric
indicating the minimum possible height in the tree for the chosenindex
.- consensus_p
a
numeric
, (only ifoptimal_tree_method = "consensus"
), indicating the threshold proportion of trees that must support a region/cluster for it to be included in the final consensus tree.- verbose
a
boolean
. (only ifoptimal_tree_method = "iterative_consensus_tree"
), Set toFALSE
if you want to disable the progress message
Value
A list
of class bioregion.clusters
with five slots:
name:
character
containing the name of the algorithmargs:
list
of input arguments as provided by the userinputs:
list
of characteristics of the clustering processalgorithm:
list
of all objects associated with the clustering procedure, such as original cluster objectsclusters:
data.frame
containing the clustering results
In the algorithm
slot, users can find the following elements:
trials
: a list containing all randomization trials. Each trial contains the dissimilarity matrix, with site order randomized, the associated tree and the cophenetic correlation coefficient (Spearman) for that treefinal.tree
: ahclust
object containing the final hierarchical tree to be usedfinal.tree.coph.cor
: the cophenetic correlation coefficient between the initial dissimilarity matrix andfinal.tree
Details
The function is based on hclust.
The default method for the hierarchical tree is average
, i.e.
UPGMA as it has been recommended as the best method to generate a tree
from beta diversity dissimilarity Kreft2010bioregion.
Clusters can be obtained by two methods:
Specifying a desired number of clusters in
n_clust
Specifying one or several heights of cut in
cut_height
To find an optimal number of clusters, see partition_metrics()
It is important to pay attention to the fact that the order of rows in the input distance matrix influences the tree topology as explained in Dapporto2013bioregion. To address this, the function generates multiple trees by randomizing the distance matrix. Two methods are available to obtain the final tree:
optimal_tree_method = "iterative_consensus_tree"
: The Iterative Hierarchical Consensus Tree (IHCT) method reconstructs a consensus tree by iteratively splitting the dataset into two subclusters based on the pairwise dissimilarity of sites acrossn_runs
trees based onn_runs
randomizations of the distance matrix. At each iteration, it identifies the majority membership of sites into two stable groups across all trees, calculates the height based on the selected linkage method (method
), and enforces monotonic constraints on node heights to produce a coherent tree structure. This approach provides a robust, hierarchical representation of site relationships, balancing cluster stability and hierarchical constraints.optimal_tree_method = "best"
: This method selects one tree among with the highest cophenetic correlation coefficient, representing the best fit between the hierarchical structure and the original distance matrix.optimal_tree_method = "consensus"
: This method constructs a consensus tree using phylogenetic methods with the function consensus. When using this option, you must set theconsensus_p
parameter, which indicates the proportion of trees that must contain a region/cluster for it to be included in the final consensus tree. Consensus trees lack an inherent height because they represent a majority structure rather than an actual hierarchical clustering. To assign heights, we use a non-negative least squares method (nnls.tree) based on the initial distance matrix, ensuring that the consensus tree preserves approximate distances among clusters.
We recommend using the "iterative_consensus_tree"
as all the branches of
this tree will always reflect the majority decision among many randomized
versions of the distance matrix. This method is inspired by
Dapporto2015bioregion, which also used the majority decision
among many randomized versions of the distance matrix, but it expands it
to reconstruct the entire topology of the tree iteratively.
We do not recommend using the basic consensus
method because in many
contexts it provides inconsistent results, with a meaningless tree topology
and a very low cophenetic correlation coefficient.
For a fast exploration of the tree, we recommend using the best
method
which will only select the tree with the highest cophenetic correlation
coefficient among all randomized versions of the distance matrix.
Author
Boris Leroy (leroy.boris@gmail.com), Pierre Denelle (pierre.denelle@gmail.com) and Maxime Lenormand (maxime.lenormand@inrae.fr)
Examples
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001),
20, 25)
rownames(comat) <- paste0("Site",1:20)
colnames(comat) <- paste0("Species",1:25)
dissim <- dissimilarity(comat, metric = "Simpson")
# User-defined number of clusters
tree1 <- hclu_hierarclust(dissim,
n_clust = 5)
#> Building the iterative hierarchical consensus tree... Note that this process can take time especially if you have a lot of sites.
#>
#> Final tree has a 0.4777 cophenetic correlation coefficient with the initial dissimilarity
#> matrix
#> Determining the cut height to reach 5 groups...
#> --> 0.109375
tree1
#> Clustering results for algorithm : hclu_hierarclust
#> (hierarchical clustering based on a dissimilarity matrix)
#> - Number of sites: 20
#> - Name of dissimilarity metric: Simpson
#> - Tree construction method: average
#> - Randomization of the dissimilarity matrix: yes, number of trials 100
#> - Method to compute the final tree: Iterative consensus hierarchical tree
#> - Cophenetic correlation coefficient: 0.478
#> - Number of clusters requested by the user: 5
#> Clustering results:
#> - Number of partitions: 1
#> - Number of clusters: 5
#> - Height of cut of the hierarchical tree: 0.109
plot(tree1)
str(tree1)
#> List of 6
#> $ name : chr "hclu_hierarclust"
#> $ args :List of 14
#> ..$ index : chr "Simpson"
#> ..$ method : chr "average"
#> ..$ randomize : logi TRUE
#> ..$ n_runs : num 100
#> ..$ optimal_tree_method: chr "iterative_consensus_tree"
#> ..$ keep_trials : logi FALSE
#> ..$ n_clust : num 5
#> ..$ cut_height : NULL
#> ..$ find_h : logi TRUE
#> ..$ h_max : num 1
#> ..$ h_min : num 0
#> ..$ consensus_p : num 0.5
#> ..$ verbose : logi TRUE
#> ..$ dynamic_tree_cut : logi FALSE
#> $ inputs :List of 7
#> ..$ bipartite : logi FALSE
#> ..$ weight : logi TRUE
#> ..$ pairwise : logi TRUE
#> ..$ pairwise_metric: chr "Simpson"
#> ..$ dissimilarity : logi TRUE
#> ..$ nb_sites : int 20
#> ..$ hierarchical : logi FALSE
#> $ algorithm :List of 6
#> ..$ final.tree :List of 5
#> .. ..- attr(*, "class")= chr "hclust"
#> ..$ final.tree.coph.cor: num 0.478
#> ..$ final.tree.msd : num 0.00245
#> ..$ output_n_clust : int 5
#> ..$ output_cut_height : Named num 0.109
#> .. ..- attr(*, "names")= chr "k_5"
#> ..$ trials : chr "Trials not stored in output"
#> $ clusters :'data.frame': 20 obs. of 2 variables:
#> ..$ ID : chr [1:20] "Site1" "Site10" "Site11" "Site12" ...
#> ..$ K_5: chr [1:20] "1" "2" "3" "4" ...
#> $ cluster_info:'data.frame': 1 obs. of 4 variables:
#> ..$ partition_name : chr "K_5"
#> ..$ n_clust : int 5
#> ..$ requested_n_clust: num 5
#> ..$ output_cut_height: num 0.109
#> - attr(*, "class")= chr [1:2] "bioregion.clusters" "list"
tree1$clusters
#> ID K_5
#> Site1 Site1 1
#> Site10 Site10 2
#> Site11 Site11 3
#> Site12 Site12 4
#> Site13 Site13 4
#> Site14 Site14 5
#> Site15 Site15 2
#> Site16 Site16 4
#> Site17 Site17 1
#> Site18 Site18 3
#> Site19 Site19 4
#> Site2 Site2 4
#> Site20 Site20 4
#> Site3 Site3 1
#> Site4 Site4 5
#> Site5 Site5 4
#> Site6 Site6 2
#> Site7 Site7 2
#> Site8 Site8 1
#> Site9 Site9 4
# User-defined height cut
# Only one height
tree2 <- hclu_hierarclust(dissim,
cut_height = .05)
#> Building the iterative hierarchical consensus tree... Note that this process can take time especially if you have a lot of sites.
#>
#> Final tree has a 0.5232 cophenetic correlation coefficient with the initial dissimilarity
#> matrix
tree2
#> Clustering results for algorithm : hclu_hierarclust
#> (hierarchical clustering based on a dissimilarity matrix)
#> - Number of sites: 20
#> - Name of dissimilarity metric: Simpson
#> - Tree construction method: average
#> - Randomization of the dissimilarity matrix: yes, number of trials 100
#> - Method to compute the final tree: Iterative consensus hierarchical tree
#> - Cophenetic correlation coefficient: 0.523
#> - Heights of cut requested by the user: 0.05
#> Clustering results:
#> - Number of partitions: 1
#> - Number of clusters: 14
#> - Height of cut of the hierarchical tree: 0.05
tree2$clusters
#> ID K_14
#> 1 Site1 1
#> 2 Site10 2
#> 3 Site11 3
#> 4 Site12 4
#> 5 Site13 4
#> 6 Site14 5
#> 7 Site15 6
#> 8 Site16 7
#> 9 Site17 8
#> 10 Site18 9
#> 11 Site19 10
#> 12 Site2 10
#> 13 Site20 11
#> 14 Site3 12
#> 15 Site4 13
#> 16 Site5 4
#> 17 Site6 14
#> 18 Site7 2
#> 19 Site8 8
#> 20 Site9 4
# Multiple heights
tree3 <- hclu_hierarclust(dissim,
cut_height = c(.05, .15, .25))
#> Building the iterative hierarchical consensus tree... Note that this process can take time especially if you have a lot of sites.
#>
#> Final tree has a 0.5232 cophenetic correlation coefficient with the initial dissimilarity
#> matrix
tree3$clusters # Mind the order of height cuts: from deep to shallow cuts
#> ID K_1_1 K_1_2 K_14
#> Site1 Site1 1 1 1
#> Site10 Site10 1 1 2
#> Site11 Site11 1 1 3
#> Site12 Site12 1 1 4
#> Site13 Site13 1 1 4
#> Site14 Site14 1 1 5
#> Site15 Site15 1 1 6
#> Site16 Site16 1 1 7
#> Site17 Site17 1 1 8
#> Site18 Site18 1 1 9
#> Site19 Site19 1 1 10
#> Site2 Site2 1 1 10
#> Site20 Site20 1 1 11
#> Site3 Site3 1 1 12
#> Site4 Site4 1 1 13
#> Site5 Site5 1 1 4
#> Site6 Site6 1 1 14
#> Site7 Site7 1 1 2
#> Site8 Site8 1 1 8
#> Site9 Site9 1 1 4
# Info on each partition can be found in table cluster_info
tree3$cluster_info
#> partition_name n_clust requested_cut_height
#> h_0.25 K_1_1 1 0.25
#> h_0.15 K_1_2 1 0.15
#> h_0.05 K_14 14 0.05
plot(tree3)
# Recut the tree afterwards
tree3.1 <- cut_tree(tree3, n = 5)
#> Determining the cut height to reach 5 groups...
#> --> 0.109375
# Make multiple cuts
tree4 <- hclu_hierarclust(dissim,
n_clust = 1:19)
#> Building the iterative hierarchical consensus tree... Note that this process can take time especially if you have a lot of sites.
#>
#> Final tree has a 0.5232 cophenetic correlation coefficient with the initial dissimilarity
#> matrix
#> Warning: The requested number of cluster could not be found
#> for k = 7. Closest number found: 6
#> Warning: The requested number of cluster could not be found
#> for k = 12. Closest number found: 11
#> Warning: The requested number of cluster could not be found
#> for k = 15. Closest number found: 14
#> Warning: The requested number of cluster could not be found
#> for k = 19. Closest number found: 18
# Change the method to get the final tree
tree5 <- hclu_hierarclust(dissim,
optimal_tree_method = "best",
n_clust = 10)
#> Randomizing the dissimilarity matrix with 100 trials
#> -- range of cophenetic correlation coefficients among trials: 0.3954 - 0.5233
#>
#> Final tree has a 0.5233 cophenetic correlation coefficient with the initial dissimilarity
#> matrix
#> Determining the cut height to reach 10 groups...
#> --> 0.0703125