This function performs non-hierarchical clustering based on dissimilarity using partitioning around medoids (PAM).
Arguments
- dissimilarity
The output object from
dissimilarity()
orsimilarity_to_dissimilarity()
, or adist
object. If adata.frame
is used, the first two columns should represent pairs of sites (or any pair of nodes), and the subsequent column(s) should contain the dissimilarity indices.- index
The name or number of the dissimilarity column to use. By default, the third column name of
dissimilarity
is used.- seed
A value for the random number generator (
NULL
for random by default).- n_clust
An
integer
vector or a singleinteger
value specifying the requested number(s) of clusters.- variant
A
character
string specifying the PAM variant to use. Defaults tofaster
. Available options areoriginal
,o_1
,o_2
,f_3
,f_4
,f_5
, orfaster
. See pam for more details.- nstart
An
integer
specifying the number of random starts for the PAM algorithm. Defaults to 1 (for thefaster
variant).- cluster_only
A
boolean
specifying whether only the clustering results should be returned from the pam function. Setting this toTRUE
makes the function more efficient.- algorithm_in_output
A
boolean
indicating whether the original output of pam should be included in the result. Defaults toTRUE
(see Value).- ...
Additional arguments to pass to
pam()
(see pam).
Value
A list
of class bioregion.clusters
with five components:
name: A
character
string containing the name of the algorithm.args: A
list
of input arguments as provided by the user.inputs: A
list
of characteristics of the clustering process.algorithm: A
list
of all objects associated with the clustering procedure, such as original cluster objects (only ifalgorithm_in_output = TRUE
).clusters: A
data.frame
containing the clustering results.
If algorithm_in_output = TRUE
, the algorithm
slot includes the output of
pam.
Details
This method partitions the data into the chosen number of clusters based on the input dissimilarity matrix. It is more robust than k-means because it minimizes the sum of dissimilarities between cluster centers (medoids) and points assigned to the cluster. In contrast, k-means minimizes the sum of squared Euclidean distances, which makes it unsuitable for dissimilarity matrices that are not based on Euclidean distances.
References
Kaufman L & Rousseeuw PJ (2009) Finding groups in data: An introduction to cluster analysis. In & Sons. JW (ed.), Finding groups in data: An introduction to cluster analysis.
See also
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a4_2_non_hierarchical_clustering.html.
Associated functions: nhclu_clara nhclu_clarans nhclu_dbscan nhclu_kmeans nhclu_affprop
Author
Boris Leroy (leroy.boris@gmail.com)
Pierre Denelle (pierre.denelle@gmail.com)
Maxime Lenormand (maxime.lenormand@inrae.fr)
Examples
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001),
20, 25)
rownames(comat) <- paste0("Site",1:20)
colnames(comat) <- paste0("Species",1:25)
comnet <- mat_to_net(comat)
dissim <- dissimilarity(comat, metric = "all")
clust1 <- nhclu_pam(dissim, n_clust = 2:10, index = "Simpson")
clust2 <- nhclu_pam(dissim, n_clust = 2:15, index = "Simpson")
bioregionalization_metrics(clust2, dissimilarity = dissim,
eval_metric = "pc_distance")
#> Computing similarity-based metrics...
#> - pc_distance OK
#> Partition metrics:
#> - 14 partition(s) evaluated
#> - Range of clusters explored: from 2 to 15
#> - Requested metric(s): pc_distance
#> - Metric summary:
#> pc_distance
#> Min 0.4310532
#> Mean 0.7450280
#> Max 0.9755307
#>
#> Access the data.frame of metrics with your_object$evaluation_df
bioregionalization_metrics(clust2, net = comnet, species_col = "Node2",
site_col = "Node1", eval_metric = "avg_endemism")
#> Computing composition-based metrics...
#> - avg_endemism OK
#> Partition metrics:
#> - 14 partition(s) evaluated
#> - Range of clusters explored: from 2 to 15
#> - Requested metric(s): avg_endemism
#> - Metric summary:
#> avg_endemism
#> Min 0
#> Mean 0
#> Max 0
#>
#> Access the data.frame of metrics with your_object$evaluation_df
#> Details of endemism % for each partition are available in
#> your_object$endemism_results