**Hierarchical clustering** is a very effective method for exploratory data analysis and is aimed at building a hierarchy of clusters based on the similarity of the samples in a dataset.

The idea behind hierarchical clustering is very intuitive. Let’s assume we are analyzing a dataset containing *n* samples. We start by assigning each sample to a cluster containing only one element (thus, we have* n* starting clusters). In the first step of the clustering procedure, the two most similar clusters are merged together. This way, we will now have *n-1* clusters (more precisely, *n-2* individual clusters and *1* merged cluster). Then, the method proceeds again as before: find the next two closest clusters, merge them, repeat. In the end, *starting with n* samples in the dataset, there will always be* n-1* merges.

In R, a hierarchical clustering analysis may be easily performed calling the ** hclust **function.

#in this example, my_eset is a ExpressionSet-class object

hc <- hclust(dist(t(exprs(my_eset)))

This will return a *hclust-class* object *hc*. This object contains all information required to build the hierarchical clustering tree and is a list containing several components. In particular:

is a 2-column matrix that gives information about the merges. Each row of the matrix corresponds to a step of the clustering process. So, for example, the two indexes**hc$merge***idx <- hc$merge[x,]*are the indexes of the elements merged at step*x*. If the value of*idx[i]*is negative, then*idx[i]*points to the sample having index:*(-1)*idx[i]*. If the value of*idx[i]*is positive, then*idx[i]*points to a cluster merged in a previous step, which is step:*hc$merge[idx[i],]*.is a numeric vector containing the heights of the merges. Of course**hc$height***hc$height[j]*refers to the height of the merge*j*that links together the samples as indicated in*hc$merge[j,]*returns the labels of each sample**hc$labels**is another vector suggesting a order for the samples so that there won’t be intersections while drawing the hierarchical clustering tree**hc$order**

We can plot a simple basic hierarchical clustering tree using the *plot* command. Alternatively, we can use the * colorhcplot* package, available on CRAN. The function

*colorhcplot*requires two arguments: a

*hclust-class*(

*) object and a factor (*

**hc***) describing the sample groups within the dataset.*

**fac**# standard Cluster Dendrogram

plot(hc)

# color Cluster Dendrogram

install.packages(“colorhcplot”)

library(colorhcplot)

colorhcplot(hc, fac)

Features of *colorhcplot* include:

- sample labels are colored according to the group;
- a legend helps identifying the groups;
- homogenous clusters (i.e. clusters containing samples that belong all to the same group) are connected by colored lines

For drawing a color hierarchical clustering tree, *colorhcplot* performs the following operations.

It starts by re-ordering all samples and merges according to the *hc$order* vector.

new_labels <- hc$labels[hc$order]

new_fac <- fac[hc$order]

new_merge <- (-1)*hc$merge

for (n in 1:nrow(new_merge)) {

for (k in 1:ncol(new_merge)){

if (new_merge[n,k] > 0) {

new_merge[n,k] <- which(hc$order == new_merge[n,k])

}

}

}new_merge <- (-1)*new_merge

The next step is to merge clusters in a bottom-up fashion. The function goes through the *new_merge* matrix and starts drawing the tree. At the same time, it keeps track of the middle point of each cluster-merging segment. As stated before, negative values in the *new_merge* matrix point to individual clusters while positive values point to a merge completed before.

#assuming a suitable plot has already being plotted

middle_point<-c()

for (i in 1:nrow(new_merge)){

#define x0 and x1 of each merge/segment in a 2-elements vector

x01=c()

for (j in c(1,2)) {

if (new_merge[i,j]<0) {

x01[j] <- (-1)*new_merge[i,j]

} else {

x01[j] <- middle_point[new_merge[i,j]]

}

}

#keep track of the middle point of this merge

middle_point <- c(middle_point, ((x01[1]+x01[2])/2))

#draw the segment

segments(x01[1],hc$height[i],x01[2],hc$height[i],lwd=3)

}

And now we can easily draw the vertical lines

#default hang value in a hc plot is 0.1 = 10% of the total height of the plot. Anyway, hang may be easily adjusted

hang <- 0.1* (max(hc$height)-min(hc$height))

for (i in 1:nrow(new_merge)){

for (j in c(1,2)) {

if ((x<-new_merge[i,j])<0) {

#draw the hang

segments((-1)*x,hc$height[i],(-1)*x,(hc$height[i])-hang,lwd=3)

} else {

segments(middle_point[x],hc$height[i],middle_point[x],hc$height[x],lwd=3)

}

}

}

The hierarchical clustering tree is complete. We can modify the code above to introduce colors.