DANA 4840 — R: estructuras básicas (recordatorio) y dataset Worksheet 0a
Study sheet for DANA 4840: R objects (vector, list, matrix, array, data.frame, factors) and Worksheet 0a (building the mixed dataset, reading .txt, .xlsx, and .csv, and aligning types with daisy() / Gower-style coding).
Vector
# =========================
# Vector
# =========================
myvector <- c(1, 3, 5)
myvector
str(myvector)
class(myvector)
is.vector(myvector)
[1] 1 3 5
num [1:3] 1 3 5
[1] "numeric"
[1] TRUE
Lista
# =========================
# List
# =========================
a <- c(1:4)
b <- c("John", "Mary")
mylist <- list(a, b)
str(mylist)
class(mylist)
is.vector(mylist)
is.list(mylist)
List of 2
$ : int [1:4] 1 2 3 4
$ : chr [1:2] "John" "Mary"
[1] "list"
[1] TRUE
[1] TRUE
Matriz
# =========================
# Matrix
# =========================
mymatrix <- matrix(c(1:6), 2, 3, byrow = TRUE)
mymatrix
str(mymatrix)
class(mymatrix)
is.matrix(mymatrix)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
int [1:2, 1:3] 1 4 2 5 3 6
[1] "matrix" "array"
[1] TRUE
Array
# =========================
# Array
# =========================
myarray <- array(c(1:12), dim = c(2, 3, 2))
myarray
str(myarray)
class(myarray)
is.matrix(myarray)
is.array(myarray)
, , 1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
, , 2
[,1] [,2] [,3]
[1,] 7 9 11
[2,] 8 10 12
int [1:2, 1:3, 1:2] 1 2 3 4 5 6 7 8 9 10 ...
[1] "array"
[1] FALSE
[1] TRUE
Data frame
# =========================
# Data frame
# =========================
mydataframe <- data.frame(
Gender = c("Male", "Female", "Male"),
Age = c(22, 30, 33),
medal = c("Gold", "Gold", "Bronze")
)
mydataframe
str(mydataframe)
class(mydataframe)
is.data.frame(mydataframe)
Gender Age medal
1 Male 22 Gold
2 Female 30 Gold
3 Male 33 Bronze
'data.frame': 3 obs. of 3 variables:
$ Gender: chr "Male" "Female" "Male"
$ Age : num 22 30 33
$ medal : chr "Gold" "Gold" "Bronze"
[1] "data.frame"
[1] TRUE
Sin conversión automática a factor (caracteres como texto)
# =========================
# data.frame — stringsAsFactors = FALSE
# =========================
mydataframe <- data.frame(
Gender = c("Male", "Female", "Male"),
Age = c(22, 30, 33),
medal = c("Gold", "Gold", "Bronze"),
stringsAsFactors = FALSE
)
mydataframe$medal
is.factor(mydataframe$medal)
is.vector(mydataframe$medal)
[1] "Gold" "Gold" "Bronze"
[1] FALSE
[1] TRUE
Factor nominal
factor() sin ordered = TRUE define niveles sin orden inherente: las categorías se tratan como nominales (solo identidad, no “mayor/menor”). Sirve para sexo, país, tipo de sangre, etc.
# =========================
# Factor nominal
# =========================
myfactor <- factor(c("M", "F", "T", "O", "M"))
myfactor
str(myfactor)
class(myfactor)
is.factor(myfactor)
[1] M F T O M
Levels: F M O T
Factor w/ 4 levels "F","M","O","T": 2 1 4 3 2
[1] "factor"
[1] TRUE
Factor ordinal (ordered, levels)
Con ordered = TRUE el factor es ordinal: los niveles tienen un orden lógico fijado con levels = c(...) (de menor a mayor en ese sentido). R lo guarda como entero con orden; comparaciones como < entre niveles respetan esa secuencia (útil para tallas, Likert, etapas).
# =========================
# Factor ordinal
# =========================
myfactor <- factor(
c("M", "F", "T", "O", "M"),
ordered = TRUE,
levels = c("O", "M", "F", "T")
)
myfactor
is.factor(myfactor)
[1] M F T O M
Levels: O < M < F < T
[1] TRUE
Worksheet 0a — dataset
Fuente: McCaffrey — Gower distance example
| Age | Race | Height | Income | IsMale | Politics |
|---|---|---|---|---|---|
| 22 | 1 | Tall | 0.39 | TRUE | moderate |
| 33 | 3 | Short | 0.34 | TRUE | liberal |
| 52 | 1 | Moderate | 0.51 | FALSE | moderate |
| 46 | 6 | Tall | 0.63 | TRUE | conservative |
(a)
Worksheet 0a — (a)
For each variable above, identify if it is a categorical variable or a quantitative one. If it is a categorical variable, further classify the variable as a nominal (or binary) or ordinal.
Cuantitativa vs categórica; nominal, ordinal o binaria.
| Variable | Tipo | Detalle |
|---|---|---|
| Age | Cuantitativa | Edad en años (numérica discreta). |
| Race | Categórica nominal | Códigos 1, 3, 6… sin orden inherente “mayor/menor”; solo etiquetas distintas. |
| Height | Categórica ordinal | Niveles Short, Moderate, Tall admiten orden natural por estatura: Short < Moderate < Tall. |
| Income | Cuantitativa | Valores numéricos (aquí parecen proporciones 0–1); se trata como escala numérica, no como etiquetas. |
| IsMale | Categórica binaria (nominal) | Solo TRUE / FALSE; dos categorías sin orden (no decimos que un sexo sea “mayor” que otro en sentido estadístico). |
| Politics | Categórica nominal (típico en análisis) | liberal, moderate, conservative son etiquetas; el espectro izquierda–derecha podría argumentarse como orden, pero el orden exacto y la distancia entre etiquetas no están fijados en el dato, así que en el worksheet suele tratarse como nominal salvo que el curso imponga un orden explícito. |
Resumen: tus “numerical” → Age, Income (cuantitativas). El resto son categóricas; entre ellas, Height es la que encaja mejor como ordinal por el orden físico de las categorías. Race, IsMale y Politics (como nominal) no llevan orden obligatorio en la definición del dataset.
Archivos de ejemplo en el repo (mismas filas que la tabla): dana4840_worksheet0a.txt (tabuladores) y dana4840_worksheet0a.csv. En (d) el .xlsx se puede generar en R con writexl (ruta assets/data/dana4840_worksheet0a.xlsx) o crear a mano / exportar desde Excel en esa carpeta.
(b)
Worksheet 0a — (b)
Use R to create each variable, making sure the type matches your answer in (a). Then create a data frame in R to collectively house these variables as a data set.
Tipos alineados con (a): enteros/reales; factor nominal (Race, Politics); ordered (Height); logical (IsMale).
# =========================
# Worksheet 0a — Part (b)
# =========================
Age <- c(22L, 33L, 52L, 46L)
Race <- factor(c(1, 3, 1, 6))
Height <- factor(
c("Tall", "Short", "Moderate", "Tall"),
levels = c("Short", "Moderate", "Tall"),
ordered = TRUE
)
Income <- c(0.39, 0.34, 0.51, 0.63)
IsMale <- c(TRUE, TRUE, FALSE, TRUE)
Politics <- factor(c("moderate", "liberal", "moderate", "conservative"))
ws0a <- data.frame(Age, Race, Height, Income, IsMale, Politics)
str(ws0a)
'data.frame': 4 obs. of 6 variables:
$ Age : int 22 33 52 46
$ Race : Factor w/ 3 levels "1","3","6": 1 2 1 3
$ Height : Ord.factor w/ 3 levels "Short"<"Moderate"<"Tall": 3 1 2 3
$ Income : num 0.39 0.34 0.51 0.63
$ IsMale : logi TRUE TRUE FALSE TRUE
$ Politics: Factor w/ 3 levels "conservative","liberal",..: 3 2 3 1
(c)
Worksheet 0a — (c)
Type the data into a text file. Use read.table() to read the contents of the text file. Is the output of read.table() a data frame or some other data structure? Do the variables match the type in part (a)? If not, how do you convert them?
Por defecto devuelve un data.frame. Aquí el TXT se genera en R con writeLines() y se lee con read.table(). Tras leer, los tipos suelen no coincidir del todo con (a) hasta convertir Race, Height, IsMale y Politics.
# =========================
# Worksheet 0a — Part (c)
# =========================
# =========================
# Create local TXT file
# =========================
lines <- c(
"Age\tRace\tHeight\tIncome\tIsMale\tPolitics",
"22\t1\tTall\t0.39\tTRUE\tmoderate",
"33\t3\tShort\t0.34\tTRUE\tliberal",
"52\t1\tModerate\t0.51\tFALSE\tmoderate",
"46\t6\tTall\t0.63\tTRUE\tconservative"
)
writeLines(lines, "dana4840_worksheet0a.txt")
file.exists("dana4840_worksheet0a.txt")
getwd()
# =========================
# Read TXT
# =========================
fp <- "dana4840_worksheet0a.txt"
df_txt <- read.table(
fp,
header = TRUE,
sep = "\t",
stringsAsFactors = FALSE
)
class(df_txt)
str(df_txt)
# =========================
# Convert variables
# =========================
# Nominal categorical
df_txt$Race <- factor(df_txt$Race)
# Ordinal categorical
df_txt$Height <- factor(
df_txt$Height,
levels = c("Short", "Moderate", "Tall"),
ordered = TRUE
)
# Binary categorical
df_txt$IsMale <- factor(
df_txt$IsMale,
levels = c(FALSE, TRUE),
labels = c("Female", "Male")
)
# Politics as nominal factor
df_txt$Politics <- factor(df_txt$Politics)
# =========================
# Final structure
# =========================
str(df_txt)
df_txt
[1] TRUE
[1] "P:/langara/term 4/dana 4840"
[1] "data.frame"
'data.frame': 4 obs. of 6 variables:
$ Age : int 22 33 52 46
$ Race : int 1 3 1 6
$ Height : chr "Tall" "Short" "Moderate" "Tall"
$ Income : num 0.39 0.34 0.51 0.63
$ IsMale : logi TRUE TRUE FALSE TRUE
$ Politics: chr "moderate" "liberal" "moderate" "conservative"
'data.frame': 4 obs. of 6 variables:
$ Age : int 22 33 52 46
$ Race : Factor w/ 3 levels "1","3","6": 1 2 1 3
$ Height : Ord.factor w/ 3 levels "Short"<"Moderate"<..: 3 1 2 3
$ Income : num 0.39 0.34 0.51 0.63
$ IsMale : Factor w/ 2 levels "Female","Male": 2 2 1 2
$ Politics: Factor w/ 3 levels "conservative",..: 3 2 3 1
Age Race Height Income IsMale Politics
1 22 1 Tall 0.39 Male moderate
2 33 3 Short 0.34 Male liberal
3 52 1 Moderate 0.51 Female moderate
4 46 6 Tall 0.63 Male conservative
(d)
Worksheet 0a — (d)
Type the data into an Excel file. Use read_excel() in package “readxl” to read the contents of the Excel file. Is the output a data frame or some other data structure? If it is not a data frame, how do you convert it into a data frame? Do the variables match the type in part (a)?
read_excel() devuelve un tibble (tbl_df), subclase de data.frame. Usa as.data.frame() si hace falta. Aplica las mismas conversiones que en (c) para alinear con (a). El .xlsx se escribe en R con writexl (readxl solo lee); fp <- "assets/data/dana4840_worksheet0a.xlsx" alinea con el blog/repo. Instalación: install.packages(c("writexl", "readxl")).
# =========================
# Worksheet 0a — Part (d)
# =========================
# Crea assets/data/dana4840_worksheet0a.xlsx, luego read_excel(fp).
# =========================
# Create Excel file (writexl)
# =========================
library(writexl)
fp <- "assets/data/dana4840_worksheet0a.xlsx"
dir.create("assets/data", recursive = TRUE, showWarnings = FALSE)
ws0a_xl <- data.frame(
Age = c(22L, 33L, 52L, 46L),
Race = c(1, 3, 1, 6),
Height = c("Tall", "Short", "Moderate", "Tall"),
Income = c(0.39, 0.34, 0.51, 0.63),
IsMale = c(TRUE, TRUE, FALSE, TRUE),
Politics = c("moderate", "liberal", "moderate", "conservative"),
stringsAsFactors = FALSE
)
write_xlsx(ws0a_xl, path = fp)
file.exists(fp)
# =========================
# Read Excel (readxl)
# =========================
library(readxl)
df_xl <- read_excel(fp)
class(df_xl)
df_xl <- as.data.frame(df_xl)
str(df_xl, vec.len = 1)
# =========================
# Convert variables (same as part (c) / align with (a))
# =========================
df_xl$Race <- factor(df_xl$Race)
df_xl$Height <- factor(
df_xl$Height,
levels = c("Short", "Moderate", "Tall"),
ordered = TRUE
)
df_xl$IsMale <- factor(
df_xl$IsMale,
levels = c(FALSE, TRUE),
labels = c("Female", "Male")
)
df_xl$Politics <- factor(df_xl$Politics)
str(df_xl)
df_xl
[1] TRUE
[1] "tbl_df" "tbl" "data.frame"
'data.frame': 4 obs. of 6 variables:
$ Age : num 22 ...
$ Race : num 1 ...
$ Height : chr "Tall" ...
$ Income : num 0.39 ...
$ IsMale : logi TRUE ...
$ Politics: chr "moderate" ...
'data.frame': 4 obs. of 6 variables:
$ Age : int 22 33 52 46
$ Race : Factor w/ 3 levels "1","3","6": 1 2 1 3
$ Height : Ord.factor w/ 3 levels "Short"<"Moderate"<..: 3 1 2 3
$ Income : num 0.39 0.34 0.51 0.63
$ IsMale : Factor w/ 2 levels "Female","Male": 2 2 1 2
$ Politics: Factor w/ 3 levels "conservative",..: 3 2 3 1
Age Race Height Income IsMale Politics
1 22 1 Tall 0.39 Male moderate
2 33 3 Short 0.34 Male liberal
3 52 1 Moderate 0.51 Female moderate
4 46 6 Tall 0.63 Male conservative
(e)
Worksheet 0a — (e)
Create a CSV file using the data. Use read.csv() to read the contents of the CSV file. Is the output a data frame or some other data structure? Do the variables match the type in part (a)?
read.csv() devuelve un data.frame. Aquí el CSV se genera en R con write.csv() (mismas filas que la tabla), luego se lee; con stringsAsFactors = FALSE los tipos suelen no coincidir con (a) hasta aplicar las mismas conversiones que en (c).
# =========================
# Worksheet 0a — Part (e)
# =========================
# =========================
# Create local CSV file
# =========================
ws0a_raw <- data.frame(
Age = c(22L, 33L, 52L, 46L),
Race = c(1, 3, 1, 6),
Height = c("Tall", "Short", "Moderate", "Tall"),
Income = c(0.39, 0.34, 0.51, 0.63),
IsMale = c(TRUE, TRUE, FALSE, TRUE),
Politics = c("moderate", "liberal", "moderate", "conservative"),
stringsAsFactors = FALSE
)
write.csv(ws0a_raw, "dana4840_worksheet0a.csv", row.names = FALSE)
file.exists("dana4840_worksheet0a.csv")
# =========================
# Read CSV
# =========================
fp <- "dana4840_worksheet0a.csv"
df_csv <- read.csv(fp, stringsAsFactors = FALSE)
class(df_csv)
str(df_csv)
# =========================
# Convert variables (same idea as part c)
# =========================
df_csv$Race <- factor(df_csv$Race)
df_csv$Height <- factor(
df_csv$Height,
levels = c("Short", "Moderate", "Tall"),
ordered = TRUE
)
df_csv$IsMale <- factor(
df_csv$IsMale,
levels = c(FALSE, TRUE),
labels = c("Female", "Male")
)
df_csv$Politics <- factor(df_csv$Politics)
str(df_csv)
df_csv
[1] TRUE
[1] "data.frame"
'data.frame': 4 obs. of 6 variables:
$ Age : int 22 33 52 46
$ Race : int 1 3 1 6
$ Height : chr "Tall" "Short" "Moderate" "Tall"
$ Income : num 0.39 0.34 0.51 0.63
$ IsMale : logi TRUE TRUE FALSE TRUE
$ Politics: chr "moderate" "liberal" "moderate" "conservative"
'data.frame': 4 obs. of 6 variables:
$ Age : int 22 33 52 46
$ Race : Factor w/ 3 levels "1","3","6": 1 2 1 3
$ Height : Ord.factor w/ 3 levels "Short"<"Moderate"<..: 3 1 2 3
$ Income : num 0.39 0.34 0.51 0.63
$ IsMale : Factor w/ 2 levels "Female","Male": 2 2 1 2
$ Politics: Factor w/ 3 levels "conservative",..: 3 2 3 1
Age Race Height Income IsMale Politics
1 22 1 Tall 0.39 Male moderate
2 33 3 Short 0.34 Male liberal
3 52 1 Moderate 0.51 Female moderate
4 46 6 Tall 0.63 Male conservative