Dann brown

I am an Senior Fullstack Software Developer working in my skills and learning new stuffs about tech daily

DANA 4840 — R: basic structures (refresher) and Worksheet 0a dataset

Study sheet for DANA 4840: R objects (vector, list, matrix, array, data.frame, factors) and Worksheet 0a (building the mixed dataset, reading .txt, .xlsx, and .csv, and aligning types with daisy() / Gower-style coding).

Vector

# =========================
# Vector
# =========================
myvector <- c(1, 3, 5)
myvector
str(myvector)
class(myvector)
is.vector(myvector)

[1] 1 3 5

 num [1:3] 1 3 5

[1] "numeric"

[1] TRUE

List

# =========================
# List
# =========================
a <- c(1:4)
b <- c("John", "Mary")
mylist <- list(a, b)
str(mylist)
class(mylist)
is.vector(mylist)
is.list(mylist)

List of 2
 $ : int [1:4] 1 2 3 4
 $ : chr [1:2] "John" "Mary"

[1] "list"

[1] TRUE

[1] TRUE

Matrix

# =========================
# Matrix
# =========================
mymatrix <- matrix(c(1:6), 2, 3, byrow = TRUE)
mymatrix
str(mymatrix)
class(mymatrix)
is.matrix(mymatrix)

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

 int [1:2, 1:3] 1 4 2 5 3 6

[1] "matrix" "array"

[1] TRUE

Array

# =========================
# Array
# =========================
myarray <- array(c(1:12), dim = c(2, 3, 2))
myarray
str(myarray)
class(myarray)
is.matrix(myarray)
is.array(myarray)

, , 1

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

, , 2

     [,1] [,2] [,3]
[1,]    7    9   11
[2,]    8   10   12

 int [1:2, 1:3, 1:2] 1 2 3 4 5 6 7 8 9 10 ...

[1] "array"

[1] FALSE

[1] TRUE

Data frame

# =========================
# Data frame
# =========================
mydataframe <- data.frame(
  Gender = c("Male", "Female", "Male"),
  Age = c(22, 30, 33),
  medal = c("Gold", "Gold", "Bronze")
)
mydataframe
str(mydataframe)
class(mydataframe)
is.data.frame(mydataframe)

  Gender Age  medal
1   Male  22   Gold
2 Female  30   Gold
3   Male  33 Bronze

'data.frame':	3 obs. of  3 variables:
 $ Gender: chr  "Male" "Female" "Male"
 $ Age   : num  22 30 33
 $ medal : chr  "Gold" "Gold" "Bronze"

[1] "data.frame"

[1] TRUE

No automatic conversion to factor (characters stay as text)

# =========================
# data.frame — stringsAsFactors = FALSE
# =========================
mydataframe <- data.frame(
  Gender = c("Male", "Female", "Male"),
  Age = c(22, 30, 33),
  medal = c("Gold", "Gold", "Bronze"),
  stringsAsFactors = FALSE
)
mydataframe$medal
is.factor(mydataframe$medal)
is.vector(mydataframe$medal)

[1] "Gold"   "Gold"   "Bronze"

[1] FALSE

[1] TRUE

Nominal factor

With factor() and without ordered = TRUE, levels have no inherent order: categories are nominal (identity only, not “greater/lesser”). Use this for sex, country, blood type, etc.

# =========================
# Nominal factor
# =========================
myfactor <- factor(c("M", "F", "T", "O", "M"))
myfactor
str(myfactor)
class(myfactor)
is.factor(myfactor)

[1] M F T O M
Levels: F M O T

 Factor w/ 4 levels "F","M","O","T": 2 1 4 3 2

[1] "factor"

[1] TRUE

Ordinal factor (`ordered`, `levels`)

With ordered = TRUE, the factor is ordinal: levels follow a logical order set by levels = c(...) (low to high in that sense). R stores them as ordered integers; comparisons like < between levels follow that sequence (useful for sizes, Likert scales, stages).

# =========================
# Ordinal factor
# =========================
myfactor <- factor(
  c("M", "F", "T", "O", "M"),
  ordered = TRUE,
  levels = c("O", "M", "F", "T")
)
myfactor
is.factor(myfactor)

[1] M F T O M
Levels: O < M < F < T

[1] TRUE

Worksheet 0a — example dataset

Source: McCaffrey — Gower distance example

Age	Race	Height	Income	IsMale	Politics
22	1	Tall	0.39	TRUE	moderate
33	3	Short	0.34	TRUE	liberal
52	1	Moderate	0.51	FALSE	moderate
46	6	Tall	0.63	TRUE	conservative

(a)

Worksheet 0a — (a)
For each variable above, identify if it is a categorical variable or a quantitative one. If it is a categorical variable, further classify the variable as a nominal (or binary) or ordinal.

Quantitative vs categorical; nominal, ordinal, or binary.

Variable	Type	Notes
Age	Quantitative	Age in years (discrete numeric).
Race	Nominal categorical	Codes 1, 3, 6, … with no inherent “greater/lesser” order; distinct labels only.
Height	Ordinal categorical	Levels Short, Moderate, Tall have a natural stature order: Short < Moderate < Tall.
Income	Quantitative	Numeric values (here they look like proportions on 0–1); treat as a numeric scale, not as labels.
IsMale	Binary categorical (nominal)	Only `TRUE` / `FALSE`; two categories with no order (we do not rank sexes in a statistical sense).
Politics	Nominal categorical (typical in analyses)	liberal, moderate, conservative are labels; a left–right spectrum could be argued as ordered, but the exact order and spacing between labels are not fixed in the data, so the worksheet usually treats this as nominal unless the course specifies an explicit order.

Summary: your “numerical” variables are Age and Income (quantitative). The rest are categorical; among those, Height fits best as ordinal because of the physical ordering of categories. Race, IsMale, and Politics (as nominal) do not require an order in how this dataset is defined.

Sample files in the repo (same rows as the table): dana4840_worksheet0a.txt (tab-separated) and dana4840_worksheet0a.csv. For (d), you can generate the .xlsx in R with writexl at assets/data/dana4840_worksheet0a.xlsx, or create / export from Excel into that folder.

(b)

Worksheet 0a — (b)
Use R to create each variable, making sure the type matches your answer in (a). Then create a data frame in R to collectively house these variables as a data set.

Types aligned with (a): integer/double; nominal factor (Race, Politics); ordered (Height); logical (IsMale).

# =========================
# Worksheet 0a — Part (b)
# =========================
Age <- c(22L, 33L, 52L, 46L)
Race <- factor(c(1, 3, 1, 6))
Height <- factor(
  c("Tall", "Short", "Moderate", "Tall"),
  levels = c("Short", "Moderate", "Tall"),
  ordered = TRUE
)
Income <- c(0.39, 0.34, 0.51, 0.63)
IsMale <- c(TRUE, TRUE, FALSE, TRUE)
Politics <- factor(c("moderate", "liberal", "moderate", "conservative"))

ws0a <- data.frame(Age, Race, Height, Income, IsMale, Politics)
str(ws0a)

'data.frame':	4 obs. of  6 variables:
 $ Age     : int  22 33 52 46
 $ Race    : Factor w/ 3 levels "1","3","6": 1 2 1 3
 $ Height  : Ord.factor w/ 3 levels "Short"<"Moderate"<"Tall": 3 1 2 3
 $ Income  : num  0.39 0.34 0.51 0.63
 $ IsMale  : logi  TRUE TRUE FALSE TRUE
 $ Politics: Factor w/ 3 levels "conservative","liberal",..: 3 2 3 1

(c)

Worksheet 0a — (c)
Type the data into a text file. Use read.table() to read the contents of the text file. Is the output of read.table() a data frame or some other data structure? Do the variables match the type in part (a)? If not, how do you convert them?

By default this returns a data.frame. Here the TXT file is created in R with writeLines() and read with read.table(). After reading, types usually do not fully match (a) until you convert Race, Height, IsMale, and Politics.

# =========================
# Worksheet 0a — Part (c)
# =========================

# =========================
# Create local TXT file
# =========================

lines <- c(
  "Age\tRace\tHeight\tIncome\tIsMale\tPolitics",
  "22\t1\tTall\t0.39\tTRUE\tmoderate",
  "33\t3\tShort\t0.34\tTRUE\tliberal",
  "52\t1\tModerate\t0.51\tFALSE\tmoderate",
  "46\t6\tTall\t0.63\tTRUE\tconservative"
)

writeLines(lines, "dana4840_worksheet0a.txt")

file.exists("dana4840_worksheet0a.txt")

getwd()

# =========================
# Read TXT
# =========================

fp <- "dana4840_worksheet0a.txt"

df_txt <- read.table(
  fp,
  header = TRUE,
  sep = "\t",
  stringsAsFactors = FALSE
)

class(df_txt)

str(df_txt)

# =========================
# Convert variables
# =========================

# Nominal categorical
df_txt$Race <- factor(df_txt$Race)

# Ordinal categorical
df_txt$Height <- factor(
  df_txt$Height,
  levels = c("Short", "Moderate", "Tall"),
  ordered = TRUE
)

# Binary categorical
df_txt$IsMale <- factor(
  df_txt$IsMale,
  levels = c(FALSE, TRUE),
  labels = c("Female", "Male")
)

# Politics as nominal factor
df_txt$Politics <- factor(df_txt$Politics)

# =========================
# Final structure
# =========================

str(df_txt)

df_txt

[1] TRUE
[1] "P:/langara/term 4/dana 4840"
[1] "data.frame"
'data.frame':	4 obs. of  6 variables:
 $ Age     : int  22 33 52 46
 $ Race    : int  1 3 1 6
 $ Height  : chr  "Tall" "Short" "Moderate" "Tall"
 $ Income  : num  0.39 0.34 0.51 0.63
 $ IsMale  : logi  TRUE TRUE FALSE TRUE
 $ Politics: chr  "moderate" "liberal" "moderate" "conservative"
'data.frame':	4 obs. of  6 variables:
 $ Age     : int  22 33 52 46
 $ Race    : Factor w/ 3 levels "1","3","6": 1 2 1 3
 $ Height  : Ord.factor w/ 3 levels "Short"<"Moderate"<..: 3 1 2 3
 $ Income  : num  0.39 0.34 0.51 0.63
 $ IsMale  : Factor w/ 2 levels "Female","Male": 2 2 1 2
 $ Politics: Factor w/ 3 levels "conservative",..: 3 2 3 1
   Age Race   Height Income IsMale   Politics
1   22    1     Tall   0.39   Male   moderate
2   33    3    Short   0.34   Male    liberal
3   52    1 Moderate   0.51 Female   moderate
4   46    6     Tall   0.63   Male conservative

(d)

Worksheet 0a — (d)
Type the data into an Excel file. Use read_excel() in package “readxl” to read the contents of the Excel file. Is the output a data frame or some other data structure? If it is not a data frame, how do you convert it into a data frame? Do the variables match the type in part (a)?

read_excel() returns a tibble (tbl_df), a subclass of data.frame. Use as.data.frame() if you need a plain data frame. Apply the same conversions as in (c) to align with (a). The .xlsx is written in R with writexl (readxl only reads); fp <- "assets/data/dana4840_worksheet0a.xlsx" matches the blog/repo layout. Install with: install.packages(c("writexl", "readxl")).

# =========================
# Worksheet 0a — Part (d)
# =========================
# Creates assets/data/dana4840_worksheet0a.xlsx, then read_excel(fp).

# =========================
# Create Excel file (writexl)
# =========================
library(writexl)

fp <- "assets/data/dana4840_worksheet0a.xlsx"
dir.create("assets/data", recursive = TRUE, showWarnings = FALSE)

ws0a_xl <- data.frame(
  Age = c(22L, 33L, 52L, 46L),
  Race = c(1, 3, 1, 6),
  Height = c("Tall", "Short", "Moderate", "Tall"),
  Income = c(0.39, 0.34, 0.51, 0.63),
  IsMale = c(TRUE, TRUE, FALSE, TRUE),
  Politics = c("moderate", "liberal", "moderate", "conservative"),
  stringsAsFactors = FALSE
)

write_xlsx(ws0a_xl, path = fp)

file.exists(fp)

# =========================
# Read Excel (readxl)
# =========================
library(readxl)

df_xl <- read_excel(fp)

class(df_xl)

df_xl <- as.data.frame(df_xl)

str(df_xl, vec.len = 1)

# =========================
# Convert variables (same as part (c) / align with (a))
# =========================
df_xl$Race <- factor(df_xl$Race)

df_xl$Height <- factor(
  df_xl$Height,
  levels = c("Short", "Moderate", "Tall"),
  ordered = TRUE
)

df_xl$IsMale <- factor(
  df_xl$IsMale,
  levels = c(FALSE, TRUE),
  labels = c("Female", "Male")
)

df_xl$Politics <- factor(df_xl$Politics)

str(df_xl)

df_xl

[1] TRUE
[1] "tbl_df"     "tbl"        "data.frame"

'data.frame':	4 obs. of  6 variables:

 $ Age     : num 22 ...

 $ Race    : num 1 ...

 $ Height  : chr "Tall" ...

 $ Income  : num 0.39 ...

 $ IsMale  : logi TRUE ...

 $ Politics: chr "moderate" ...
'data.frame':	4 obs. of  6 variables:
 $ Age     : int  22 33 52 46
 $ Race    : Factor w/ 3 levels "1","3","6": 1 2 1 3
 $ Height  : Ord.factor w/ 3 levels "Short"<"Moderate"<..: 3 1 2 3
 $ Income  : num  0.39 0.34 0.51 0.63
 $ IsMale  : Factor w/ 2 levels "Female","Male": 2 2 1 2
 $ Politics: Factor w/ 3 levels "conservative",..: 3 2 3 1
   Age Race   Height Income IsMale   Politics
1   22    1     Tall   0.39   Male   moderate
2   33    3    Short   0.34   Male    liberal
3   52    1 Moderate   0.51 Female   moderate
4   46    6     Tall   0.63   Male conservative

(e)

Worksheet 0a — (e)
Create a CSV file using the data. Use read.csv() to read the contents of the CSV file. Is the output a data frame or some other data structure? Do the variables match the type in part (a)?

read.csv() returns a data.frame. Here the CSV is created in R with write.csv() (same rows as the table), then read back; with stringsAsFactors = FALSE types usually do not match (a) until you apply the same conversions as in (c).

# =========================
# Worksheet 0a — Part (e)
# =========================

# =========================
# Create local CSV file
# =========================
ws0a_raw <- data.frame(
  Age = c(22L, 33L, 52L, 46L),
  Race = c(1, 3, 1, 6),
  Height = c("Tall", "Short", "Moderate", "Tall"),
  Income = c(0.39, 0.34, 0.51, 0.63),
  IsMale = c(TRUE, TRUE, FALSE, TRUE),
  Politics = c("moderate", "liberal", "moderate", "conservative"),
  stringsAsFactors = FALSE
)

write.csv(ws0a_raw, "dana4840_worksheet0a.csv", row.names = FALSE)

file.exists("dana4840_worksheet0a.csv")

# =========================
# Read CSV
# =========================
fp <- "dana4840_worksheet0a.csv"

df_csv <- read.csv(fp, stringsAsFactors = FALSE)

class(df_csv)

str(df_csv)

# =========================
# Convert variables (same idea as part c)
# =========================
df_csv$Race <- factor(df_csv$Race)

df_csv$Height <- factor(
  df_csv$Height,
  levels = c("Short", "Moderate", "Tall"),
  ordered = TRUE
)

df_csv$IsMale <- factor(
  df_csv$IsMale,
  levels = c(FALSE, TRUE),
  labels = c("Female", "Male")
)

df_csv$Politics <- factor(df_csv$Politics)

str(df_csv)

df_csv

[1] TRUE
[1] "data.frame"
'data.frame':	4 obs. of  6 variables:
 $ Age     : int  22 33 52 46
 $ Race    : int  1 3 1 6
 $ Height  : chr  "Tall" "Short" "Moderate" "Tall"
 $ Income  : num  0.39 0.34 0.51 0.63
 $ IsMale  : logi  TRUE TRUE FALSE TRUE
 $ Politics: chr  "moderate" "liberal" "moderate" "conservative"
'data.frame':	4 obs. of  6 variables:
 $ Age     : int  22 33 52 46
 $ Race    : Factor w/ 3 levels "1","3","6": 1 2 1 3
 $ Height  : Ord.factor w/ 3 levels "Short"<"Moderate"<..: 3 1 2 3
 $ Income  : num  0.39 0.34 0.51 0.63
 $ IsMale  : Factor w/ 2 levels "Female","Male": 2 2 1 2
 $ Politics: Factor w/ 3 levels "conservative",..: 3 2 3 1
   Age Race   Height Income IsMale   Politics
1   22    1     Tall   0.39   Male   moderate
2   33    3    Short   0.34   Male    liberal
3   52    1 Moderate   0.51 Female   moderate
4   46    6     Tall   0.63   Male conservative

Teacher suggestion / última sesión — Worksheet 0a en pocas líneas

En clase a veces solo se muestra leer el .txt ya existente y revisar class() / str() / head(), sin recrear el archivo ni todas las conversiones a factor en el mismo bloque. Tu versión larga arriba sigue siendo la referencia para reproducir todo el flujo.

Run part (c) first so dana4840_worksheet0a.txt exists (or create that file by hand).

fp_txt <- "dana4840_worksheet0a.txt"
stopifnot(file.exists(fp_txt))
df0 <- read.table(fp_txt, header = TRUE, sep = "\t", stringsAsFactors = FALSE)
class(df0)
str(df0)
head(df0)

[1] "data.frame"
'data.frame':	4 obs. of  6 variables:
 $ Age     : int  22 33 52 46
 $ Race    : int  1 3 1 6
 $ Height  : chr  "Tall" "Short" "Moderate" "Tall"
 $ Income  : num  0.39 0.34 0.51 0.63
 $ IsMale  : logi  TRUE TRUE FALSE TRUE
 $ Politics: chr  "moderate" "liberal" "moderate" "conservative"
  Age Race   Height Income IsMale   Politics
1  22    1     Tall   0.39   TRUE   moderate
2  33    3    Short   0.34   TRUE    liberal
3  52    1 Moderate   0.51  FALSE   moderate
4  46    6     Tall   0.63   TRUE conservative