DANA 4840 — Worksheet 1a: Gower distance for mixed data

Study sheet for DANA 4840 — Worksheet 1a. Display math below uses MathJax (enabled on this page with math: true in the front matter).

The worksheet uses a mixed table: quantitative fields, nominal categories, and a symmetric binary flag. Gower distance is appropriate because it builds a normalized dissimilarity between 0 and 1 for each active variable and then averages those contributions (equal weight when there are no missing values).

  • For quantitative variables, the contribution is **( x_1 - x_2 )** divided by the observed range (max − min) on that column, using all rows in the dataset.
  • For categorical and symmetric binary variables, the contribution is 0 if the two values match and 1 if they differ.

Dataset

Employee Age Race Height Income IsMale Politics
Employee 1 22 1 Tall 0.39 TRUE moderate
Employee 2 33 3 Short 0.34 TRUE liberal
Employee 3 52 1 Moderate 0.51 FALSE moderate
Employee 4 46 6 Tall 0.63 TRUE conservative

Source (adapted): McCaffrey — example of calculating the Gower distance

Variable types used in the analysis

Variable Type Reason
Age Quantitative (interval) Absolute difference scaled by the observed range across the four employees.
Race Nominal categorical Codes 1, 3, 1, 6 are labels, not a continuous numeric scale — store as factor in R.
Height Categorical Different labels contribute 1 (here: unordered factor, consistent with daisy defaults).
Income Quantitative (interval) Same scaling rule as Age.
IsMale Symmetric binary Same value → 0; different value → 1.
Politics Nominal categorical Same category → 0; different category → 1.

(a) Gower distance between Employee 1 and Employee 2

Worksheet 1a — (a)

The dataset contains mixed data (quantitative and categorical variables). Calculate by hand the Gower distance between Employee 1 and Employee 2, showing all steps of how you arrive at your answer.

Employee 1: Age = 22, Race = 1, Height = Tall, Income = 0.39, IsMale = TRUE, Politics = moderate.
Employee 2: Age = 33, Race = 3, Height = Short, Income = 0.34, IsMale = TRUE, Politics = liberal.

Step 1: Ranges for quantitative variables

Use all four employees to set each range.

\[ \begin{aligned} R_{\mathrm{Age}} &= 52 - 22 = 30 \\[0.35em] R_{\mathrm{Income}} &= 0.63 - 0.34 = 0.29 \end{aligned} \]

Step 2: Variable-by-variable contributions (Employees 1 vs 2)

Variable Calculation Contribution
Age |22 − 33| / 30 = 11/30 0.3667
Race different categories 1
Height Tall vs Short 1
Income |0.39 − 0.34| / 0.29 = 5/29 0.1724
IsMale TRUE vs TRUE 0
Politics moderate vs liberal 1

Step 3: Average the six contributions

\[ \begin{aligned} D_{\mathrm{Gower}}(1,2) &= \frac{0.3667 + 1 + 1 + 0.1724 + 0 + 1}{6} \\[0.45em] &= \frac{3.5391}{6} \\[0.45em] &\approx 0.5898 \end{aligned} \]

Exact rational form (optional):

\[ D_{\mathrm{Gower}}(1,2) = \frac{3079}{5220} \approx 0.589847 \]

So Employees 1 and 2 are moderately dissimilar when all six variables enter with equal weight.

(If Height is treated as an ordered factor with Short < Moderate < Tall, daisy still assigns contribution 1 for Tall vs Short in this small dataset because that pair spans the full spread of observed ordered levels.)

(b) Check with daisy()

Worksheet 1a — (b)

Use R’s daisy() to validate that your answer in part (a) is correct.

Build the data.frame with sensible types: Race, Height, and Politics as factor; IsMale as logical; Age and Income numeric. With metric = "gower", daisy() applies range scaling on numeric columns and 0/1 contributions on categorical columns as in the table above.

# =========================
# Worksheet 1a — (b)
# =========================
library(cluster)

employees <- data.frame(
  Age = c(22, 33, 52, 46),
  Race = factor(c(1, 3, 1, 6)),
  Height = factor(c("Tall", "Short", "Moderate", "Tall")),
  Income = c(0.39, 0.34, 0.51, 0.63),
  IsMale = c(TRUE, TRUE, FALSE, TRUE),
  Politics = factor(c("moderate", "liberal", "moderate", "conservative")),
  row.names = paste0("Employee", 1:4)
)

D <- as.matrix(daisy(employees, metric = "gower"))

round(D, 6)

D["Employee1", "Employee2"]

3079 / 5220

isTRUE(all.equal(D["Employee1", "Employee2"], 3079 / 5220))
          Employee1 Employee2 Employee3 Employee4
Employee1  0.000000  0.589847  0.568966  0.604598
Employee2  0.589847  0.000000  0.869923  0.738889
Employee3  0.568966  0.869923  0.000000  0.768966
Employee4  0.604598  0.738889  0.768966  0.000000

[1] 0.5898467

[1] 0.5898467

[1] TRUE

Floating-point noise can make raw == checks fail even when the hand calculation matches; isTRUE(all.equal(...)) is the reliable check in R.