library(dplyr)
library(tidyr)
library(pharmaversesdtm)
# Load all domains we'll use today
data("vs", package = "pharmaversesdtm")
data("dm", package = "pharmaversesdtm")
data("ae", package = "pharmaversesdtm")
data("ex", package = "pharmaversesdtm")
data("ds", package = "pharmaversesdtm")Day 5: Advanced Tidyverse: Pivoting & Joining
Reshaping SDTM Data for Analysis
1 Learning Objectives
By the end of Day 5, you will be able to:
- Reshape datasets using
pivot_wider()andpivot_longer() - Understand when to use Wide vs. Long formats in clinical data
- Perform advanced joins (
anti_join()) for Data Quality Checks - Create 3 analysis-ready datasets as specified in the roadmap
2 Introduction
2.1 Wide vs. Long Data
In clinical data, how you organize your tables (datasets) is very important. There are two main ways:
- Long format: Each row is a single measurement or observation. For example, if you measure blood pressure three times for a subject, you’ll have three rows for that subject. This is the format used by SDTM because it is flexible and easy to filter or summarize.
- Wide format: Each row is a subject or visit, and each measurement is in its own column. For example, you might have one row per subject, with columns for blood pressure at each visit. This format is often used for calculations or when you need to compare values side-by-side.
| Format | Structure | Use Case |
|---|---|---|
| Long | One measurement per row | Storage, filtering, SDTM standard |
| Wide | All measurements in columns | Calculations (BMI, Change from Baseline) |
2.1.1 Why does this matter?
Some analyses are easier in long format (like counting how many times a test was done), while others are easier in wide format (like calculating BMI from height and weight). Being able to switch between these formats is a key skill for clinical programmers.
SDTM stores data in long format (one row per observation). For analysis, we often need to reshape to wide format (one row per subject/visit).
3 Package Installation & Loading
4 Part 1: Pivoting Data
4.1 From Long to Wide: pivot_wider()
Vital Signs (VS) comes in long format. Each test (Height, Weight, BP) is a separate row. Let’s reshape it so each test has its own column.
# Current long format
vs %>%
filter(USUBJID == "01-701-1015" & VISIT == "BASELINE") %>%
select(USUBJID, VISIT, VSTESTCD, VSORRES, VSORRESU) %>%
head(10)# A tibble: 10 × 5
USUBJID VISIT VSTESTCD VSORRES VSORRESU
<chr> <chr> <chr> <chr> <chr>
1 01-701-1015 BASELINE DIABP 56 mmHg
2 01-701-1015 BASELINE DIABP 51 mmHg
3 01-701-1015 BASELINE DIABP 61 mmHg
4 01-701-1015 BASELINE PULSE 56 BEATS/MIN
5 01-701-1015 BASELINE PULSE 59 BEATS/MIN
6 01-701-1015 BASELINE PULSE 59 BEATS/MIN
7 01-701-1015 BASELINE SYSBP 130 mmHg
8 01-701-1015 BASELINE SYSBP 121 mmHg
9 01-701-1015 BASELINE SYSBP 131 mmHg
10 01-701-1015 BASELINE TEMP 97.2 F
# First, see what tests are available
vs %>%
distinct(VSTESTCD) %>%
print()# A tibble: 6 × 1
VSTESTCD
<chr>
1 DIABP
2 HEIGHT
3 PULSE
4 SYSBP
5 TEMP
6 WEIGHT
# Pivot to wide format
vs_wide <- vs %>%
filter(VISIT == "BASELINE") %>%
select(USUBJID, VSTESTCD, VSSTRESN) %>%
pivot_wider(
names_from = VSTESTCD,
values_from = VSSTRESN,
values_fn = mean # Handle duplicates
)
# Show available columns
cat("Columns after pivot:", paste(names(vs_wide), collapse = ", "), "\n\n")Columns after pivot: USUBJID, DIABP, PULSE, SYSBP, TEMP, WEIGHT
head(vs_wide)# A tibble: 6 × 6
USUBJID DIABP PULSE SYSBP TEMP WEIGHT
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 01-701-1015 56 58 127. 36.2 54.4
2 01-701-1023 86 74.7 131. 36.3 80.3
3 01-701-1028 78.7 63.3 141. 37 99.3
4 01-701-1033 73.3 60.3 124. 37.1 88.4
5 01-701-1034 72.3 79.7 169. 36.9 62.6
6 01-701-1047 69 64 154. 36.4 67.1
4.2 Real Example: Calculate BMI
Now that we have Height and Weight side-by-side, we can easily calculate BMI.
# Check if HEIGHT and WEIGHT columns exist
if (all(c("HEIGHT", "WEIGHT") %in% names(vs_wide))) {
vs_bmi <- vs_wide %>%
filter(!is.na(HEIGHT) & !is.na(WEIGHT)) %>%
mutate(
# BMI = Weight (kg) / Height (m)^2
HEIGHT_M = HEIGHT / 100,
BMI = round(WEIGHT / (HEIGHT_M^2), 1)
) %>%
select(USUBJID, HEIGHT, WEIGHT, BMI)
head(vs_bmi)
# Summary statistics
vs_bmi %>%
summarise(
N = n(),
Mean_BMI = round(mean(BMI), 1),
Min_BMI = min(BMI),
Max_BMI = max(BMI)
)
} else {
# Fallback: Calculate Blood Pressure ratio as alternative example
cat("Note: HEIGHT/WEIGHT not available at BASELINE. Using BP data instead.\n\n")
vs_bp <- vs_wide %>%
filter(!is.na(SYSBP) & !is.na(DIABP)) %>%
mutate(
# Pulse Pressure = Systolic - Diastolic
PULSE_PRESSURE = SYSBP - DIABP,
# Mean Arterial Pressure ≈ DBP + 1/3(SBP - DBP)
MAP = round(DIABP + (PULSE_PRESSURE / 3), 1)
) %>%
select(USUBJID, SYSBP, DIABP, PULSE_PRESSURE, MAP)
head(vs_bp)
# Summary
vs_bp %>%
summarise(
N = n(),
Mean_MAP = round(mean(MAP, na.rm = TRUE), 1),
Mean_PP = round(mean(PULSE_PRESSURE, na.rm = TRUE), 1)
)
}Note: HEIGHT/WEIGHT not available at BASELINE. Using BP data instead.
# A tibble: 1 × 3
N Mean_MAP Mean_PP
<int> <dbl> <dbl>
1 253 97.3 60.3
4.3 From Wide to Long: pivot_longer()
Sometimes we need to convert wide data back to long format (often for plotting).
vs_long_again <- vs_wide %>%
select(USUBJID, DIABP, SYSBP, PULSE) %>%
pivot_longer(
cols = c(DIABP, SYSBP, PULSE),
names_to = "VSTESTCD",
values_to = "VSSTRESN"
) %>%
filter(!is.na(VSSTRESN))
head(vs_long_again, 10)# A tibble: 10 × 3
USUBJID VSTESTCD VSSTRESN
<chr> <chr> <dbl>
1 01-701-1015 DIABP 56
2 01-701-1015 SYSBP 127.
3 01-701-1015 PULSE 58
4 01-701-1023 DIABP 86
5 01-701-1023 SYSBP 131.
6 01-701-1023 PULSE 74.7
7 01-701-1028 DIABP 78.7
8 01-701-1028 SYSBP 141.
9 01-701-1028 PULSE 63.3
10 01-701-1033 DIABP 73.3
5 Part 2: Advanced Joins for QC
5.1 Using anti_join() for Quality Checks
anti_join(x, y) returns rows in x that are NOT in y. This is essential for finding missing or orphaned records.
# Subjects in DM (everyone enrolled)
dm_subjects <- dm %>% distinct(USUBJID)
# Subjects in AE (those with adverse events)
ae_subjects <- ae %>% distinct(USUBJID)
# Find subjects with NO adverse events
subjects_no_ae <- dm_subjects %>%
anti_join(ae_subjects, by = "USUBJID")
cat("Total subjects:", nrow(dm_subjects), "\n")Total subjects: 306
cat("Subjects with AEs:", nrow(ae_subjects), "\n")Subjects with AEs: 225
cat("Subjects with NO AEs:", nrow(subjects_no_ae), "\n")Subjects with NO AEs: 81
5.2 Finding Screen Failures
Screen failures are subjects in DM who:
- Have disposition (DS) record showing SCREEN FAILURE
- Often have no exposure (EX) records
# Subjects with exposure
ex_subjects <- ex %>% distinct(USUBJID)
# Screen failures: in DM but NOT in EX
potential_screen_fail <- dm_subjects %>%
anti_join(ex_subjects, by = "USUBJID")
cat("Subjects without exposure (potential screen failures):", nrow(potential_screen_fail), "\n")Subjects without exposure (potential screen failures): 52
# Verify against DS
screen_fail_ds <- ds %>%
filter(grepl("SCREEN", DSDECOD, ignore.case = TRUE)) %>%
distinct(USUBJID)
cat("Screen failures per DS:", nrow(screen_fail_ds), "\n")Screen failures per DS: 52
6 Part 3: Creating Analysis-Ready Datasets
Per the roadmap, we need to create 3 analysis-ready datasets.
6.1 Dataset 1: Wide VS with One Row per Subject/Visit
This dataset has vital signs organized as columns, useful for visit-level analysis.
# Dataset 1: Wide VS by Subject and Visit
analysis_vs_wide <- vs %>%
# Select key columns
select(USUBJID, VISIT, VISITNUM, VSTESTCD, VSSTRESN, VSSTRESU) %>%
# Keep standard tests
filter(VSTESTCD %in% c("SYSBP", "DIABP", "PULSE", "HEIGHT", "WEIGHT", "TEMP")) %>%
# Pivot to wide
pivot_wider(
id_cols = c(USUBJID, VISIT, VISITNUM),
names_from = VSTESTCD,
values_from = VSSTRESN,
values_fn = mean
) %>%
# Sort
arrange(USUBJID, VISITNUM)
cat("Dataset 1: Wide VS\n")Dataset 1: Wide VS
cat("Rows:", nrow(analysis_vs_wide), "\n")Rows: 2741
cat("Columns:", ncol(analysis_vs_wide), "\n\n")Columns: 9
head(analysis_vs_wide, 10)# A tibble: 10 × 9
USUBJID VISIT VISITNUM DIABP HEIGHT PULSE SYSBP TEMP WEIGHT
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 01-701-1015 SCREENING 1 1 68 147. 61.3 136. 36.1 54.0
2 01-701-1015 SCREENING 2 2 66 NA 57.7 140 36.1 NA
3 01-701-1015 BASELINE 3 56 NA 58 127. 36.2 54.4
4 01-701-1015 AMBUL ECG PLACEME… 3.5 64.3 NA 62.3 137. 35.9 NA
5 01-701-1015 WEEK 2 4 53.3 NA 60.3 122. 36.5 53.1
6 01-701-1015 WEEK 4 5 57.3 NA 60 136. 36.1 54.0
7 01-701-1015 AMBUL ECG REMOVAL 6 62 NA 67 131. 36.4 NA
8 01-701-1015 WEEK 6 7 59.3 NA 56 141 36.3 53.1
9 01-701-1015 WEEK 8 8 66.7 NA 58 141. 36.7 53.1
10 01-701-1015 WEEK 12 9 64 NA 53 141 36.3 53.1
6.2 Dataset 2: DM + AE with Demographic Covariates
This dataset merges adverse events with demographic information for safety analysis.
# Dataset 2: AE with demographic covariates
analysis_ae_demo <- ae %>%
# Select key AE variables
select(USUBJID, AESEQ, AETERM, AEDECOD, AEBODSYS, AESEV, AESER, AESTDTC) %>%
# Join demographics
inner_join(
dm %>% select(USUBJID, AGE, AGEU, SEX, RACE, ARM, ACTARM, RFSTDTC),
by = "USUBJID"
) %>%
# Add derived columns
mutate(
# Age group
AGEGR1 = case_when(
AGE < 40 ~ "<40",
AGE >= 40 & AGE < 65 ~ "40-64",
AGE >= 65 ~ ">=65"
),
# Serious flag as factor
AESER_F = factor(AESER, levels = c("N", "Y"), labels = c("Non-Serious", "Serious"))
)
cat("Dataset 2: AE with Demographics\n")Dataset 2: AE with Demographics
cat("Rows:", nrow(analysis_ae_demo), "\n")Rows: 1191
cat("Columns:", ncol(analysis_ae_demo), "\n\n")Columns: 17
# Summary by age group and treatment
analysis_ae_demo %>%
group_by(ARM, AGEGR1) %>%
summarise(
N_Events = n(),
N_Subjects = n_distinct(USUBJID),
.groups = "drop"
) %>%
arrange(ARM, AGEGR1)# A tibble: 6 × 4
ARM AGEGR1 N_Events N_Subjects
<chr> <chr> <int> <int>
1 Placebo 40-64 64 11
2 Placebo >=65 237 58
3 Xanomeline High Dose 40-64 91 10
4 Xanomeline High Dose >=65 364 69
5 Xanomeline Low Dose 40-64 20 8
6 Xanomeline Low Dose >=65 415 69
6.3 Dataset 3: Screen Failure Report using Anti-Join
This dataset identifies subjects who were screened but never received treatment.
# Dataset 3: Screen Failure Report
screen_failure_report <- dm %>%
# Get basic demographics
select(USUBJID, SITEID, AGE, SEX, RACE, RFSTDTC, ARM) %>%
# Anti-join: subjects NOT in EX
anti_join(
ex %>% distinct(USUBJID),
by = "USUBJID"
) %>%
# Add disposition reason if available
left_join(
ds %>%
filter(DSCAT == "DISPOSITION EVENT") %>%
select(USUBJID, DSDECOD, DSTERM),
by = "USUBJID"
) %>%
# Sort by site
arrange(SITEID, USUBJID)
cat("Dataset 3: Screen Failure Report\n")Dataset 3: Screen Failure Report
cat("Rows (Screen Failures):", nrow(screen_failure_report), "\n\n")Rows (Screen Failures): 52
# Summary by site
screen_failure_report %>%
group_by(SITEID) %>%
summarise(
N = n(),
Reasons = paste(unique(na.omit(DSDECOD)), collapse = ", ")
) %>%
head(10)# A tibble: 10 × 3
SITEID N Reasons
<chr> <int> <chr>
1 701 10 SCREEN FAILURE
2 703 1 SCREEN FAILURE
3 705 5 SCREEN FAILURE
4 707 3 SCREEN FAILURE
5 708 7 SCREEN FAILURE
6 709 2 SCREEN FAILURE
7 710 7 SCREEN FAILURE
8 711 8 SCREEN FAILURE
9 715 4 SCREEN FAILURE
10 716 5 SCREEN FAILURE
7 Part 4: Multi-Domain Analysis
Let’s combine data from multiple domains to answer a clinical question:
Question: For subjects with Serious Adverse Events, what was their total drug exposure?
# Subjects with Serious AEs
sae_subjects <- ae %>%
filter(AESER == "Y") %>%
distinct(USUBJID, .keep_all = TRUE) %>%
select(USUBJID, AEDECOD)
# Add demographics
sae_analysis <- sae_subjects %>%
inner_join(
dm %>% select(USUBJID, AGE, SEX, ARM),
by = "USUBJID"
) %>%
# Add exposure summary
left_join(
ex %>%
group_by(USUBJID) %>%
summarise(
TOTAL_DOSE = sum(EXDOSE, na.rm = TRUE),
N_DOSES = n(),
FIRST_DOSE = min(EXSTDTC),
LAST_DOSE = max(EXENDTC),
.groups = "drop"
),
by = "USUBJID"
)
cat("Serious AE Analysis Dataset\n")Serious AE Analysis Dataset
cat("Subjects with SAEs:", nrow(sae_analysis), "\n\n")Subjects with SAEs: 3
# Display
sae_analysis %>%
select(USUBJID, AEDECOD, ARM, TOTAL_DOSE, N_DOSES) %>%
head(10)# A tibble: 3 × 5
USUBJID AEDECOD ARM TOTAL_DOSE N_DOSES
<chr> <chr> <chr> <dbl> <int>
1 01-709-1424 SYNCOPE Xano… 54 1
2 01-718-1170 SYNCOPE Xano… 108 2
3 01-718-1371 PARTIAL SEIZURES WITH SECONDARY GENERALI… Xano… 135 2
8 🎯 Practice Exercise
8.1 Your Turn: Create a Lab Analysis Dataset
Create a wide dataset from LB (Laboratory) data with:
- One row per subject and visit
- Columns for ALT, AST, BILI (Liver Function Tests)
- Joined with DM for treatment arm
data("lb", package = "pharmaversesdtm")
# TODO: Complete the following
lb_wide <- lb %>%
# Step 1: Filter for liver function tests
filter(LBTESTCD %in% c("ALT", "AST", "BILI")) %>%
# Step 2: Select key columns
select(USUBJID, VISIT, VISITNUM, LBTESTCD, LBSTRESN) %>%
# Step 3: Pivot wider
pivot_wider(
# Fill in parameters...
) %>%
# Step 4: Join with DM to get ARM
inner_join(
# Fill in...
)
# Check your results
head(lb_wide)9 Deliverable Summary
Today you completed the following:
| Task | Status |
|---|---|
| Pivoted VS from long to wide format | ✓ Done |
| Calculated BMI using pivoted data | ✓ Done |
| Dataset 1: Wide VS per Subject/Visit | ✓ Done |
| Dataset 2: DM + AE with covariates | ✓ Done |
| Dataset 3: Screen Failure Report | ✓ Done |
| Multi-domain SAE analysis | ✓ Done |
10 Key Takeaways
- Pivot Wider: Use when you need variables side-by-side for calculations.
- Pivot Longer: Use when you need to group by a variable for plotting.
- Anti-joins: Essential for finding missing, orphaned, or excluded records.
- Multi-domain Analysis: Real analysis requires combining multiple SDTM domains.
11 Resources
12 What’s Next?
In Day 6, we will introduce sdtm.oak concepts for transforming raw data to SDTM, including:
- Simulating raw EDC data
- Algorithm-based transformations
- Creating a complete SDTM LB domain with unit standardization