Distance between clusters

Packages

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(spatstat.geom)

Loading required package: spatstat.data
Loading required package: spatstat.univar
spatstat.univar 3.1-5
spatstat.geom 3.6-1

Introduction

How far apart are two clusters of objects, when all I have are distances between objects?

Hierarchical cluster analysis is based on distances between individuals. This can be defined via Euclidean distance, Manhattan distance, a matching coefficient, etc. I won’t get into these further.

There is a second problem, though, which I do wish to discuss: when you carry out a hierarchical cluster analysis, you want to join the two closest clusters together at each step. But how do you work out how far apart two clusters are, when all you have are distances between individuals? Here, I give some examples, and, perhaps more interesting, some visuals with the code that makes them.

Inter-cluster distances

Some clusters to find differences between

Let’s make some random points in two dimensions that are in two clusters: 5 points each, uniformly distributed on \((0,20)^2\) and \((20,40)^2\):

set.seed(457299)
A=tibble(x=runif(5,0,20),y=runif(5,0,20))
B=tibble(x=runif(5,20,40),y=runif(5,20,40))
ddd=bind_rows(A=A,B=B,.id="cluster")
g=ggplot(ddd,aes(x=x,y=y,colour=cluster))+geom_point()+
  coord_fixed(xlim=c(0,40),ylim=c(0,40))
g

Note that I gave this graph a name, so that I can add things to it later.

We know how to measure distances between individuals, but what about distances between clusters?

Single linkage

One way to measure the distance between two clusters is to pick a point from each cluster to represent that cluster, and use the distance between those. For example, we might find the points in cluster A and and in cluster B that are closest together, and say that the distance between the two clusters is the distance between those two points.

So, we need all the distances between a point in A and a point in B. The package spatstat.geom has a function crossdist that does exactly this:

dist=distances=spatstat.geom::crossdist(A$x, A$y, B$x, B$y)
dist

         [,1]     [,2]     [,3]     [,4]     [,5]
[1,] 25.15504 17.21720 12.36243 24.28935 23.34414
[2,] 43.98827 35.44714 29.91026 41.73130 42.46700
[3,] 42.82409 34.39974 28.93598 40.86692 41.19940
[4,] 32.97612 24.52382 19.07809 31.05325 31.42569
[5,] 32.54432 23.73306 18.08921 29.39528 31.48468

This is a matrix. From this, we want to find out which points are the two that have the smallest distance between them. It looks like point 1 in A and point 3 in B (the middle distance of the top row). We can use base R to verify this:

wm1=which.min(apply(distances,1,min))
wm1

[1] 1

wm2=which.min(apply(distances,2,min))
wm2

[1] 3

closest=bind_rows(A=A[wm1,],B=B[wm2,],.id="cluster")
closest

# A tibble: 2 × 3
  cluster     x     y
  <chr>   <dbl> <dbl>
1 A        19.0  15.0
2 B        22.8  26.8

So we can join the two closest points, now that we know where they are:

g+geom_segment(data=closest,aes(x=x[1],y=y[1],xend=x[2],yend=y[2]),colour="darkgreen")

Warning in geom_segment(data = closest, aes(x = x[1], y = y[1], xend = x[2], : All aesthetics have length 1, but the data has 2 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

This works, but it isn’t very elegant (or very tidyverse).

I usually use crossing for this kind of thing, but the points in A and B have both an x and a y coordinate. I use a hack: unite to combine them together into one thing, then separate after making all possible pairs:

A %>% unite(coord, x, y) -> a1
B %>% unite(coord, x, y) -> b1
crossing(A_coord=a1$coord, B_coord=b1$coord) %>% 
  separate(A_coord, into=c("A_x", "A_y"), sep="_", convert=T) %>% 
  separate(B_coord, into=c("B_x", "B_y"), sep="_", convert=T)

# A tibble: 25 × 4
     A_x   A_y   B_x   B_y
   <dbl> <dbl> <dbl> <dbl>
 1  11.2  11.7  22.8  26.8
 2  11.2  11.7  27.4  30.0
 3  11.2  11.7  28.8  37.3
 4  11.2  11.7  35.2  34.2
 5  11.2  11.7  35.7  31.3
 6  19.0  15.0  22.8  26.8
 7  19.0  15.0  27.4  30.0
 8  19.0  15.0  28.8  37.3
 9  19.0  15.0  35.2  34.2
10  19.0  15.0  35.7  31.3
# ℹ 15 more rows

The reason for the sep in the separate is that separate also counts the decimal points as separators, which I want to exclude; the only separators should be the underscores that unite introduced. The convert turns the coordinates back into numbers.

Now I find the (Euclidean) distances and then the smallest one:

crossing(A_coord=a1$coord, B_coord=b1$coord) %>% 
  separate(A_coord, into=c("A_x", "A_y"), sep="_", convert=T) %>% 
  separate(B_coord, into=c("B_x", "B_y"), sep="_", convert=T) %>% 
  mutate(dist=sqrt((A_x-B_x)^2+(A_y-B_y)^2)) -> distances
distances %>% arrange(dist) %>% slice(1) -> d
d

# A tibble: 1 × 5
    A_x   A_y   B_x   B_y  dist
  <dbl> <dbl> <dbl> <dbl> <dbl>
1  19.0  15.0  22.8  26.8  12.4

and then I add it to my plot:

g+geom_segment(data=d, aes(x=A_x, y=A_y, xend=B_x, yend=B_y), colour="darkgreen")

A problem with single linkage is that two clusters are close if a point in A and a point in B happen to be close. The other red and blue points on the graph could be anywhere. You could say that this goes against two clusters being “close”. The impact in cluster analysis is that you get “stringy” clusters where single points are added on to clusters one at a time. Can we improve on that?

Complete linkage

Another way to measure the distance between two clusters is the longest distance between a point in A and a point in B. This will make two clusters close if everything in the two clusters is close. You could reasonably argue that this is a better idea than single linkage.

After the work we did above, this is simple to draw: take the data frame distances, find the largest distance, and add it to the plot:

distances %>% arrange(desc(dist)) %>% slice(1) -> d
g+geom_segment(data=d, aes(x=A_x, y=A_y, xend=B_x, yend=B_y), colour="darkgreen")

Ward’s method

Let’s cast our minds back to analysis of variance, which gives another way of thinking about distance between groups (in one dimension). Consider these data:

d1=tribble(
  ~y, ~g,
  10, "A",
  11, "A",
  13, "A",
  11, "B",
  12, "B",
  14, "B"
)
d1

# A tibble: 6 × 2
      y g    
  <dbl> <chr>
1    10 A    
2    11 A    
3    13 A    
4    11 B    
5    12 B    
6    14 B

The two groups here are pretty close together, relative to how spread out they are:

ggplot(d1, aes(x=g, y=y, colour=g))+geom_point()

and the analysis of variance concurs:

d1.1=aov(y~g, data=d1)
summary(d1.1)

            Df Sum Sq Mean Sq F value Pr(>F)
g            1  1.500   1.500   0.643  0.468
Residuals    4  9.333   2.333

The \(F\)-value is small because the variability between groups is small compared to the variability within groups; it’s reasonable to act as if the two groups have the same mean.

Compare that with this data set:

d2=tribble(
  ~y, ~g,
  10, "A",
  11, "A",
  13, "A",
  21, "B",
  22, "B",
  24, "B"
)
d2

# A tibble: 6 × 2
      y g    
  <dbl> <chr>
1    10 A    
2    11 A    
3    13 A    
4    21 B    
5    22 B    
6    24 B

ggplot(d2, aes(x=g, y=y, colour=g))+geom_point()

How do within-groups and between-groups compare this time?

d2.1=aov(y~g, data=d2)
summary(d2.1)

            Df Sum Sq Mean Sq F value   Pr(>F)    
g            1 181.50  181.50   77.79 0.000912 ***
Residuals    4   9.33    2.33                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This time the variability between groups is much larger than the variability within groups; we have (strong) evidence that the group means are different, and it makes sense to treat the groups separately.

How does that apply to cluster distances? Well, what is happening here is comparing squared distances from group means to distances from the overall mean. Let’s see:

d1 %>% summarize(m=mean(y)) -> d1.overall
d1.overall

# A tibble: 1 × 1
      m
  <dbl>
1  11.8

d1 %>% mutate(mean=d1.overall$m) %>% 
  mutate(diffsq=(y-mean)^2) %>% 
  summarize(total=sum(diffsq))

# A tibble: 1 × 1
  total
  <dbl>
1  10.8

This is the sum of squared differences from the mean of all the observations taken together. What about the same thing, but from each group mean?

d1 %>% group_by(g) %>% summarize(m=mean(y)) -> d1.groups
d1.groups

# A tibble: 2 × 2
  g         m
  <chr> <dbl>
1 A      11.3
2 B      12.3

d1 %>% left_join(d1.groups) %>% 
  mutate(diffsq=(y-m)^2) %>% 
  summarize(total=sum(diffsq))

Joining with `by = join_by(g)`

# A tibble: 1 × 1
  total
  <dbl>
1  9.33

One way to measure the distance between two groups (clusters) is to take the difference of these. The observations will always be closer to their own group mean than to the combined mean, but in this case the difference is small:

10.8-9.33

[1] 1.47

Thinking of these as clusters, these are close together and could easily be combined.

What about the two groups that look more distinct?

The distance from the overall mean is:

d2 %>% summarize(m=mean(y)) -> d2.overall
d2.overall

# A tibble: 1 × 1
      m
  <dbl>
1  16.8

d2 %>% mutate(mean=d2.overall$m) %>% 
  mutate(diffsq=(y-mean)^2) %>% 
  summarize(total=sum(diffsq))

# A tibble: 1 × 1
  total
  <dbl>
1  191.

and from the separate group means is

d2 %>% group_by(g) %>% summarize(m=mean(y)) -> d2.groups
d2.groups

# A tibble: 2 × 2
  g         m
  <chr> <dbl>
1 A      11.3
2 B      22.3

d2 %>% left_join(d2.groups) %>% 
  mutate(diffsq=(y-m)^2) %>% 
  summarize(total=sum(diffsq))

Joining with `by = join_by(g)`

# A tibble: 1 × 1
  total
  <dbl>
1  9.33

This time the difference is

191-9.33

[1] 181.67

This is much bigger, and combining these groups would not be a good idea.

For cluster analysis, these ideas are behind Ward’s method. Compare the distances of each point from the mean of the clusters they currently belong to, with the distances from the mean of those two clusters combined. If the difference between these is small, the two clusters could be combined; if not, the two clusters should not be combined if possible.

How does this look on a picture? I did this in my lecture notes with some hairy for-loops, but I think I can do better.

Let’s first work out the means for each of x and y for my clusters:

ddd %>% group_by(cluster) %>% 
  summarize(mean_x=mean(x), mean_y=mean(y)) -> means
means

# A tibble: 2 × 3
  cluster mean_x mean_y
  <chr>    <dbl>  <dbl>
1 A         9.01   10.5
2 B        30.0    31.9

Now I look up which cluster each observation was in and add its mean. (I surreptitiously used this idea above):

ddd %>% left_join(means) -> group_means

Joining with `by = join_by(cluster)`

group_means

# A tibble: 10 × 5
   cluster     x     y mean_x mean_y
   <chr>   <dbl> <dbl>  <dbl>  <dbl>
 1 A       19.0  15.0    9.01   10.5
 2 A        2.49  4.84   9.01   10.5
 3 A        4.55  4.34   9.01   10.5
 4 A       11.2  11.7    9.01   10.5
 5 A        7.88 16.6    9.01   10.5
 6 B       35.2  34.2   30.0    31.9
 7 B       27.4  30.0   30.0    31.9
 8 B       22.8  26.8   30.0    31.9
 9 B       28.8  37.3   30.0    31.9
10 B       35.7  31.3   30.0    31.9

and then I think I can add the lines, coloured by cluster, thus:

g+geom_segment(data=group_means, aes(x=x, y=y, xend=mean_x, yend=mean_y, colour=cluster))

The points are reasonably close to their group means.

How does that compare to the distances from the overall mean? First we have to work that out:

ddd %>% summarize(mean_x=mean(x), mean_y=mean(y)) -> means

Then we glue this onto to the original points:

ddd %>% mutate(mean_x=means$mean_x, mean_y=means$mean_y) -> overall_means
overall_means

# A tibble: 10 × 5
   cluster     x     y mean_x mean_y
   <chr>   <dbl> <dbl>  <dbl>  <dbl>
 1 A       19.0  15.0    19.5   21.2
 2 A        2.49  4.84   19.5   21.2
 3 A        4.55  4.34   19.5   21.2
 4 A       11.2  11.7    19.5   21.2
 5 A        7.88 16.6    19.5   21.2
 6 B       35.2  34.2    19.5   21.2
 7 B       27.4  30.0    19.5   21.2
 8 B       22.8  26.8    19.5   21.2
 9 B       28.8  37.3    19.5   21.2
10 B       35.7  31.3    19.5   21.2

and then repeat the previous idea to plot them:

g+geom_segment(data=overall_means, aes(x=x, y=y, xend=mean_x, yend=mean_y), colour="darkgreen")

The points are a lot further from the overall mean than from the group means (the green lines overall are longer than the red and blue ones), suggesting that the clusters are, in the sense of Ward’s method, a long way apart.