DANA 4840 — Distance measures (Euclidean, Pearson correlation, Gower / daisy)

Clustering uses distances (or dissimilarities) between objects. Below: scale numeric rows, Euclidean distance with dist, Pearson-style distances between row profiles, Gower with daisy on mixed data, and an optional fviz_dist heatmap. Examples use USArrests (15 rows) and flower.

set.seed() makes random sampling repeatable. sample(1:50, 15) chooses 15 row indices from USArrests (whole rows, not fifteen cells from one row). ss stores those indices; df is the subset USArrests[ss, ]. Comments after # mark each step.


Random subset of rows (USArrests)

set.seed(123)                           # Repetible
ss <- sample(1:50, 15)                  # 15 filas (índices), sin repetir
ss                                      # números de fila en USArrests
df <- USArrests[ss, ]                   # mismo orden que ss; sigue usando df (sin idx) para scale()
cbind(idx = ss, df)                     # tabla de lectura: idx + estado + datos
 [1] 31 15 14  3 42 43 37 48 25 26 27  5 40 28  9

               idx Murder Assault UrbanPop Rape
New Mexico      31   11.4     285       70 32.1
Iowa            15    2.2      56       57 11.3
Indiana         14    7.2     113       65 21.0
Arizona          3    8.1     294       80 31.0
Tennessee       42   13.2     188       59 26.9
Texas           43   12.7     201       80 25.5
Oregon          37    4.9     159       67 29.3
West Virginia   48    5.7      81       39  9.3
Missouri        25    9.0     178       70 28.2
Montana         26    6.0     109       53 16.4
Nebraska        27    4.3     102       62 16.5
California       5    9.0     276       91 40.6
South Carolina  40   14.4     279       48 22.5
Nevada          28   12.2     252       81 46.0
Florida          9   15.4     335       80 31.9

The idx column is that state’s row number in the full USArrests matrix (50 states, fixed order in R). Rows appear in ss order: first printed row matches ss[1] (31 → New Mexico), last matches ss[15] (9 → Florida). scale(df) below still uses df without idx, so only the four crime variables are scaled.


Standardize before comparing numeric profiles

scale(df) subtracts each column mean and divides by its standard deviation for these rows only, so variables on different scales are more comparable.

df.scaled <- scale(df)                  # Centrar y escalar cada columna
df.scaled                               # attr: centros y escalas usados
                    Murder     Assault    UrbanPop        Rape
New Mexico      0.58508090  1.02300309  0.22505574  0.61101857
Iowa           -1.70220419 -1.54760088 -0.68923319 -1.43885018
Indiana        -0.45911447 -0.90775622 -0.12659385 -0.48290177
Arizona        -0.23535832  1.12403120  0.92835492  0.50261205
Tennessee       1.03259320 -0.06585536 -0.54857336  0.09855138
Texas           0.90828422  0.08007413  0.92835492 -0.03942055
Oregon         -1.03093574 -0.39139036  0.01406598  0.33507470
West Virginia  -0.83204139 -1.26696726 -1.95517172 -1.63595295
Missouri       -0.01160217 -0.17810880  0.22505574  0.22666818
Montana        -0.75745600 -0.95265760 -0.97055287 -0.93623813
Nebraska       -1.18010651 -1.03123501 -0.33758361 -0.92638299
California     -0.01160217  0.92197499  1.70198401  1.44870532
South Carolina  1.33093473  0.95565102 -1.32220246 -0.33507470
Nevada          0.78397525  0.65256671  0.99868483  1.98088278
Florida         1.57955267  1.58427034  0.92835492  0.59130829
attr(,"scaled:center")
    Murder    Assault   UrbanPop       Rape 
  9.046667 193.866667  66.800000  25.900000 
attr(,"scaled:scale")
   Murder   Assault  UrbanPop      Rape 
 4.022236 89.084123 14.218700 10.146991 

Euclidean distance between rows (dist)

dist() computes pairwise distances between rows. The full object stores choose(15,2) = 105 distances; below is the 3×3 corner rounded to one decimal.

dist.eucl <- dist(df.scaled, method = "euclidean")
round(as.matrix(dist.eucl)[1:3, 1:3], 1)             # esquina 3×3; 105 pares en total
           New Mexico Iowa Indiana
New Mexico        0.0  4.1     2.5
Iowa              4.1  0.0     1.8
Indiana           2.5  1.8     0.0

Between variables instead of rows: transpose first — dist(t(df.scaled)) — so each column becomes a row profile.

Manual check: New Mexico vs Iowa (scaled rows)

For two profiles x and y (here four standardized variables), Euclidean distance is:

\[ d(\mathbf{x},\mathbf{y}) = \sqrt{\sum_{j=1}^{4} (x_j - y_j)^2}. \]

Using the first two rows of df.scaled in this run (New Mexico then Iowa):

Row Murder Assault UrbanPop Rape
New Mexico 0.58508090 1.02300309 0.22505574 0.61101857
Iowa −1.70220419 −1.54760088 −0.68923319 −1.43885018
Difference (x_j-y_j) 2.28728509 2.57060397 0.91428893 2.04986875
Squared 5.23167308 6.60800477 0.83592425 4.20196189

Sum of squared differences ≈ 16.877564 → (d = \sqrt{16.877564} \approx) 4.108231. That matches dist(df.scaled, method = "euclidean") between those two rows at full precision; round(..., 1) gives 4.1, same as the corner matrix above.

sqrt(sum((df.scaled["New Mexico", ] - df.scaled["Iowa", ])^2))
[1] 4.108231

(Rounding in the output panel may show fewer decimals; the stored value is 4.108231249.)


Pearson correlation distance between rows

factoextra::get_dist(..., method = "pearson") builds distances from similarity of shape across variables (high correlation ⇒ low distance). The same idea appears in base R as 1 − cor(t(df.scaled)) between rows (diagonal set to zero). If you do not use factoextra, the base-R version is enough for the concept.

dist.cor <- factoextra::get_dist(df.scaled, method = "pearson")
round(as.matrix(dist.cor)[1:3, 1:3], 1)
cc <- cor(t(df.scaled))
Dpear <- 1 - cc
diag(Dpear) <- 0
round(Dpear[1:3, 1:3], 1)
           New Mexico Iowa Indiana
New Mexico        0.0  1.7     2.0
Iowa              1.7  0.0     0.3
Indiana           2.0  0.3     0.0

Gower / daisy() on mixed flower data

library(cluster)
data(flower)
head(flower, 3)
str(flower)

dd <- daisy(flower)
round(as.matrix(dd)[1:3, 1:3], 2)
  V1 V2 V3 V4 V5 V6  V7 V8
1  0  1  1  4  3 15  25 15
2  1  0  0  2  1  3 150 50
3  0  1  0  3  3  1 150 50

'data.frame':	18 obs. of  8 variables:
 $ V1: Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 2 2 ...
 $ V2: Factor w/ 2 levels "0","1": 2 1 2 1 2 2 1 1 2 2 ...
 $ V3: Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 2 1 1 ...
 $ V4: Factor w/ 5 levels "1","2","3","4",..: 4 2 3 4 5 4 4 2 3 5 ...
 $ V5: Ord.factor w/ 3 levels "1"<"2"<"3": 3 1 3 2 2 3 3 2 1 2 ...
 $ V6: Ord.factor w/ 18 levels "1"<"2"<"3"<"4"<..: 15 3 1 16 2 12 13 7 4 14 ...
 $ V7: num  25 150 150 125 20 50 40 100 25 100 ...
 $ V8: num  15 50 50 50 15 40 20 15 15 60 ...

     1    2    3
1 0.00 0.89 0.53
2 0.89 0.00 0.51
3 0.53 0.51 0.00

Visualize a distance matrix (fviz_dist)

fviz_dist() draws the distance matrix as a heatmap (dark/light by dissimilarity). It comes from factoextra (and ggplot2 underneath).

factoextra::fviz_dist(dist.eucl)

See also

  • Similarity for binary data — Simple Matching, Jaccard, and validating with daisy on a small example.
  • Gower distance on mixed variables — step-by-step arithmetic before using daisy in code.
  • Clustering tendency — quick checks (PCA look, Hopkins statistic, distance-matrix plot) before trusting a partition.