Skip to contents

This function performs non hierarchical clustering on the basis of dissimilarity with partitioning around medoids.

Usage

nhclu_pam(
  dissimilarity,
  index = names(dissimilarity)[3],
  seed = NULL,
  n_clust = c(1, 2, 3),
  variant = "faster",
  nstart = 1,
  cluster_only = FALSE,
  algorithm_in_output = TRUE,
  ...
)

Arguments

dissimilarity

the output object from dissimilarity() or similarity_to_dissimilarity(), or a dist object. If a data.frame is used, the first two columns represent pairs of sites (or any pair of nodes), and the next column(s) are the dissimilarity indices.

index

name or number of the dissimilarity column to use. By default, the third column name of dissimilarity is used.

seed

for the random number generator (NULL for random by default).

n_clust

an integer vector or a single integer value specifying the requested number(s) of clusters.

variant

a character string specifying the variant of pam to use, by default faster. Available options are original, o_1, o_2, f_3, f_4, f_5 or faster. See pam for more details.

nstart

an integer specifying the number of random start for the pam algorithm. By default, 1 (for the faster variant).

cluster_only

a boolean specifying if only the clustering should be returned from the pam function (more efficient).

algorithm_in_output

a boolean indicating if the original output of pam should be returned in the output (TRUE by default, see Value).

...

you can add here further arguments to be passed to pam() (see pam).

Value

A list of class bioregion.clusters with five slots:

  1. name: character containing the name of the algorithm

  2. args: list of input arguments as provided by the user

  3. inputs: list of characteristics of the clustering process

  4. algorithm: list of all objects associated with the clustering procedure, such as original cluster objects

  5. clusters: data.frame containing the clustering results

In the algorithm slot, if algorithm_in_output = TRUE, users can find the output of pam.

Details

This method partitions data into the chosen number of cluster on the basis of the input dissimilarity matrix. It is more robust than k-means because it minimizes the sum of dissimilarity between cluster centres and points assigned to the cluster - whereas the k-means approach minimizes the sum of squared euclidean distances (thus k-means cannot be applied directly on the input dissimilarity matrix if the distances are not euclidean).

References

Kaufman L & Rousseeuw PJ (2009) Finding groups in data: An introduction to cluster analysis. In & Sons. JW (ed.), Finding groups in data: An introduction to cluster analysis.

See also

Author

Boris Leroy (leroy.boris@gmail.com)
Pierre Denelle (pierre.denelle@gmail.com)
Maxime Lenormand (maxime.lenormand@inrae.fr)

Examples

comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001),
20, 25)
rownames(comat) <- paste0("Site",1:20)
colnames(comat) <- paste0("Species",1:25)

comnet <- mat_to_net(comat)
dissim <- dissimilarity(comat, metric = "all")

clust1 <- nhclu_pam(dissim, n_clust = 2:10, index = "Simpson")
clust2 <- nhclu_pam(dissim, n_clust = 2:15, index = "Simpson")
bioregionalization_metrics(clust2, dissimilarity = dissim,
eval_metric = "pc_distance")
#> Computing similarity-based metrics...
#>   - pc_distance OK
#> Partition metrics:
#>  - 14  partition(s) evaluated
#>  - Range of clusters explored: from  2  to  15 
#>  - Requested metric(s):  pc_distance 
#>  - Metric summary:
#>      pc_distance
#> Min    0.4383977
#> Mean   0.7543747
#> Max    0.9694532
#> 
#> Access the data.frame of metrics with your_object$evaluation_df
bioregionalization_metrics(clust2, net = comnet, species_col = "Node2",
                   site_col = "Node1", eval_metric = "avg_endemism")
#> Computing composition-based metrics...
#>   - avg_endemism OK
#> Partition metrics:
#>  - 14  partition(s) evaluated
#>  - Range of clusters explored: from  2  to  15 
#>  - Requested metric(s):  avg_endemism 
#>  - Metric summary:
#>      avg_endemism
#> Min   0.000000000
#> Mean  0.001428571
#> Max   0.020000000
#> 
#> Access the data.frame of metrics with your_object$evaluation_df
#> Details of endemism % for each partition are available in 
#>         your_object$endemism_results