Day 5: Advanced Tidyverse: Pivoting & Joining

Reshaping SDTM Data for Analysis

← Back to Roadmap

1 Learning Objectives

By the end of Day 5, you will be able to:

Reshape datasets using pivot_wider() and pivot_longer()
Understand when to use Wide vs. Long formats in clinical data
Perform advanced joins (anti_join()) for Data Quality Checks
Create 3 analysis-ready datasets as specified in the roadmap

2 Introduction

2.1 Wide vs. Long Data

In clinical data, how you organize your tables (datasets) is very important. There are two main ways:

Long format: Each row is a single measurement or observation. For example, if you measure blood pressure three times for a subject, you’ll have three rows for that subject. This is the format used by SDTM because it is flexible and easy to filter or summarize.
Wide format: Each row is a subject or visit, and each measurement is in its own column. For example, you might have one row per subject, with columns for blood pressure at each visit. This format is often used for calculations or when you need to compare values side-by-side.

Format	Structure	Use Case
Long	One measurement per row	Storage, filtering, SDTM standard
Wide	All measurements in columns	Calculations (BMI, Change from Baseline)

2.1.1 Why does this matter?

Some analyses are easier in long format (like counting how many times a test was done), while others are easier in wide format (like calculating BMI from height and weight). Being able to switch between these formats is a key skill for clinical programmers.

SDTM stores data in long format (one row per observation). For analysis, we often need to reshape to wide format (one row per subject/visit).

3 Package Installation & Loading

library(dplyr)
library(tidyr)
library(pharmaversesdtm)

# Load all domains we'll use today
data("vs", package = "pharmaversesdtm")
data("dm", package = "pharmaversesdtm")
data("ae", package = "pharmaversesdtm")
data("ex", package = "pharmaversesdtm")
data("ds", package = "pharmaversesdtm")

4 Part 1: Pivoting Data

4.1 From Long to Wide: `pivot_wider()`

Vital Signs (VS) comes in long format. Each test (Height, Weight, BP) is a separate row. Let’s reshape it so each test has its own column.

# Current long format
vs %>%
  filter(USUBJID == "01-701-1015" & VISIT == "BASELINE") %>%
  select(USUBJID, VISIT, VSTESTCD, VSORRES, VSORRESU) %>%
  head(10)

# A tibble: 10 × 5
   USUBJID     VISIT    VSTESTCD VSORRES VSORRESU 
   <chr>       <chr>    <chr>    <chr>   <chr>    
 1 01-701-1015 BASELINE DIABP    56      mmHg     
 2 01-701-1015 BASELINE DIABP    51      mmHg     
 3 01-701-1015 BASELINE DIABP    61      mmHg     
 4 01-701-1015 BASELINE PULSE    56      BEATS/MIN
 5 01-701-1015 BASELINE PULSE    59      BEATS/MIN
 6 01-701-1015 BASELINE PULSE    59      BEATS/MIN
 7 01-701-1015 BASELINE SYSBP    130     mmHg     
 8 01-701-1015 BASELINE SYSBP    121     mmHg     
 9 01-701-1015 BASELINE SYSBP    131     mmHg     
10 01-701-1015 BASELINE TEMP     97.2    F

# First, see what tests are available
vs %>%
  distinct(VSTESTCD) %>%
  print()

# A tibble: 6 × 1
  VSTESTCD
  <chr>   
1 DIABP   
2 HEIGHT  
3 PULSE   
4 SYSBP   
5 TEMP    
6 WEIGHT

# Pivot to wide format
vs_wide <- vs %>%
  filter(VISIT == "BASELINE") %>%
  select(USUBJID, VSTESTCD, VSSTRESN) %>%
  pivot_wider(
    names_from = VSTESTCD,
    values_from = VSSTRESN,
    values_fn = mean  # Handle duplicates
  )

# Show available columns
cat("Columns after pivot:", paste(names(vs_wide), collapse = ", "), "\n\n")

Columns after pivot: USUBJID, DIABP, PULSE, SYSBP, TEMP, WEIGHT

head(vs_wide)

# A tibble: 6 × 6
  USUBJID     DIABP PULSE SYSBP  TEMP WEIGHT
  <chr>       <dbl> <dbl> <dbl> <dbl>  <dbl>
1 01-701-1015  56    58    127.  36.2   54.4
2 01-701-1023  86    74.7  131.  36.3   80.3
3 01-701-1028  78.7  63.3  141.  37     99.3
4 01-701-1033  73.3  60.3  124.  37.1   88.4
5 01-701-1034  72.3  79.7  169.  36.9   62.6
6 01-701-1047  69    64    154.  36.4   67.1

4.2 Real Example: Calculate BMI

Now that we have Height and Weight side-by-side, we can easily calculate BMI.

# Check if HEIGHT and WEIGHT columns exist
if (all(c("HEIGHT", "WEIGHT") %in% names(vs_wide))) {
  vs_bmi <- vs_wide %>%
    filter(!is.na(HEIGHT) & !is.na(WEIGHT)) %>%
    mutate(
      # BMI = Weight (kg) / Height (m)^2
      HEIGHT_M = HEIGHT / 100,
      BMI = round(WEIGHT / (HEIGHT_M^2), 1)
    ) %>%
    select(USUBJID, HEIGHT, WEIGHT, BMI)
  
  head(vs_bmi)
  
  # Summary statistics
  vs_bmi %>%
    summarise(
      N = n(),
      Mean_BMI = round(mean(BMI), 1),
      Min_BMI = min(BMI),
      Max_BMI = max(BMI)
    )
} else {
  # Fallback: Calculate Blood Pressure ratio as alternative example
  cat("Note: HEIGHT/WEIGHT not available at BASELINE. Using BP data instead.\n\n")
  
  vs_bp <- vs_wide %>%
    filter(!is.na(SYSBP) & !is.na(DIABP)) %>%
    mutate(
      # Pulse Pressure = Systolic - Diastolic
      PULSE_PRESSURE = SYSBP - DIABP,
      # Mean Arterial Pressure ≈ DBP + 1/3(SBP - DBP)
      MAP = round(DIABP + (PULSE_PRESSURE / 3), 1)
    ) %>%
    select(USUBJID, SYSBP, DIABP, PULSE_PRESSURE, MAP)
  
  head(vs_bp)
  
  # Summary
  vs_bp %>%
    summarise(
      N = n(),
      Mean_MAP = round(mean(MAP, na.rm = TRUE), 1),
      Mean_PP = round(mean(PULSE_PRESSURE, na.rm = TRUE), 1)
    )
}

Note: HEIGHT/WEIGHT not available at BASELINE. Using BP data instead.

# A tibble: 1 × 3
      N Mean_MAP Mean_PP
  <int>    <dbl>   <dbl>
1   253     97.3    60.3

4.3 From Wide to Long: `pivot_longer()`

Sometimes we need to convert wide data back to long format (often for plotting).

vs_long_again <- vs_wide %>%
  select(USUBJID, DIABP, SYSBP, PULSE) %>%
  pivot_longer(
    cols = c(DIABP, SYSBP, PULSE),
    names_to = "VSTESTCD",
    values_to = "VSSTRESN"
  ) %>%
  filter(!is.na(VSSTRESN))

head(vs_long_again, 10)

# A tibble: 10 × 3
   USUBJID     VSTESTCD VSSTRESN
   <chr>       <chr>       <dbl>
 1 01-701-1015 DIABP        56  
 2 01-701-1015 SYSBP       127. 
 3 01-701-1015 PULSE        58  
 4 01-701-1023 DIABP        86  
 5 01-701-1023 SYSBP       131. 
 6 01-701-1023 PULSE        74.7
 7 01-701-1028 DIABP        78.7
 8 01-701-1028 SYSBP       141. 
 9 01-701-1028 PULSE        63.3
10 01-701-1033 DIABP        73.3

5 Part 2: Advanced Joins for QC

5.1 Using `anti_join()` for Quality Checks

anti_join(x, y) returns rows in x that are NOT in y. This is essential for finding missing or orphaned records.

# Subjects in DM (everyone enrolled)
dm_subjects <- dm %>% distinct(USUBJID)

# Subjects in AE (those with adverse events)
ae_subjects <- ae %>% distinct(USUBJID)

# Find subjects with NO adverse events
subjects_no_ae <- dm_subjects %>%
  anti_join(ae_subjects, by = "USUBJID")

cat("Total subjects:", nrow(dm_subjects), "\n")

Total subjects: 306

cat("Subjects with AEs:", nrow(ae_subjects), "\n")

Subjects with AEs: 225

cat("Subjects with NO AEs:", nrow(subjects_no_ae), "\n")

Subjects with NO AEs: 81

5.2 Finding Screen Failures

Screen failures are subjects in DM who:

Have disposition (DS) record showing SCREEN FAILURE
Often have no exposure (EX) records

# Subjects with exposure
ex_subjects <- ex %>% distinct(USUBJID)

# Screen failures: in DM but NOT in EX
potential_screen_fail <- dm_subjects %>%
  anti_join(ex_subjects, by = "USUBJID")

cat("Subjects without exposure (potential screen failures):", nrow(potential_screen_fail), "\n")

Subjects without exposure (potential screen failures): 52

# Verify against DS
screen_fail_ds <- ds %>%
  filter(grepl("SCREEN", DSDECOD, ignore.case = TRUE)) %>%
  distinct(USUBJID)

cat("Screen failures per DS:", nrow(screen_fail_ds), "\n")

Screen failures per DS: 52

6 Part 3: Creating Analysis-Ready Datasets

Per the roadmap, we need to create 3 analysis-ready datasets.

6.1 Dataset 1: Wide VS with One Row per Subject/Visit

This dataset has vital signs organized as columns, useful for visit-level analysis.

# Dataset 1: Wide VS by Subject and Visit
analysis_vs_wide <- vs %>%
  # Select key columns
  select(USUBJID, VISIT, VISITNUM, VSTESTCD, VSSTRESN, VSSTRESU) %>%
  # Keep standard tests
  filter(VSTESTCD %in% c("SYSBP", "DIABP", "PULSE", "HEIGHT", "WEIGHT", "TEMP")) %>%
  # Pivot to wide
  pivot_wider(
    id_cols = c(USUBJID, VISIT, VISITNUM),
    names_from = VSTESTCD,
    values_from = VSSTRESN,
    values_fn = mean
  ) %>%
  # Sort
  arrange(USUBJID, VISITNUM)

cat("Dataset 1: Wide VS\n")

Dataset 1: Wide VS

cat("Rows:", nrow(analysis_vs_wide), "\n")

Rows: 2741

cat("Columns:", ncol(analysis_vs_wide), "\n\n")

Columns: 9

head(analysis_vs_wide, 10)

# A tibble: 10 × 9
   USUBJID     VISIT              VISITNUM DIABP HEIGHT PULSE SYSBP  TEMP WEIGHT
   <chr>       <chr>                 <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
 1 01-701-1015 SCREENING 1             1    68     147.  61.3  136.  36.1   54.0
 2 01-701-1015 SCREENING 2             2    66      NA   57.7  140   36.1   NA  
 3 01-701-1015 BASELINE                3    56      NA   58    127.  36.2   54.4
 4 01-701-1015 AMBUL ECG PLACEME…      3.5  64.3    NA   62.3  137.  35.9   NA  
 5 01-701-1015 WEEK 2                  4    53.3    NA   60.3  122.  36.5   53.1
 6 01-701-1015 WEEK 4                  5    57.3    NA   60    136.  36.1   54.0
 7 01-701-1015 AMBUL ECG REMOVAL       6    62      NA   67    131.  36.4   NA  
 8 01-701-1015 WEEK 6                  7    59.3    NA   56    141   36.3   53.1
 9 01-701-1015 WEEK 8                  8    66.7    NA   58    141.  36.7   53.1
10 01-701-1015 WEEK 12                 9    64      NA   53    141   36.3   53.1

6.2 Dataset 2: DM + AE with Demographic Covariates

This dataset merges adverse events with demographic information for safety analysis.

# Dataset 2: AE with demographic covariates
analysis_ae_demo <- ae %>%
  # Select key AE variables
  select(USUBJID, AESEQ, AETERM, AEDECOD, AEBODSYS, AESEV, AESER, AESTDTC) %>%
  # Join demographics
  inner_join(
    dm %>% select(USUBJID, AGE, AGEU, SEX, RACE, ARM, ACTARM, RFSTDTC),
    by = "USUBJID"
  ) %>%
  # Add derived columns
  mutate(
    # Age group
    AGEGR1 = case_when(
      AGE < 40 ~ "<40",
      AGE >= 40 & AGE < 65 ~ "40-64",
      AGE >= 65 ~ ">=65"
    ),
    # Serious flag as factor
    AESER_F = factor(AESER, levels = c("N", "Y"), labels = c("Non-Serious", "Serious"))
  )

cat("Dataset 2: AE with Demographics\n")

Dataset 2: AE with Demographics

cat("Rows:", nrow(analysis_ae_demo), "\n")

Rows: 1191

cat("Columns:", ncol(analysis_ae_demo), "\n\n")

Columns: 17

# Summary by age group and treatment
analysis_ae_demo %>%
  group_by(ARM, AGEGR1) %>%
  summarise(
    N_Events = n(),
    N_Subjects = n_distinct(USUBJID),
    .groups = "drop"
  ) %>%
  arrange(ARM, AGEGR1)

# A tibble: 6 × 4
  ARM                  AGEGR1 N_Events N_Subjects
  <chr>                <chr>     <int>      <int>
1 Placebo              40-64        64         11
2 Placebo              >=65        237         58
3 Xanomeline High Dose 40-64        91         10
4 Xanomeline High Dose >=65        364         69
5 Xanomeline Low Dose  40-64        20          8
6 Xanomeline Low Dose  >=65        415         69

6.3 Dataset 3: Screen Failure Report using Anti-Join

This dataset identifies subjects who were screened but never received treatment.

# Dataset 3: Screen Failure Report
screen_failure_report <- dm %>%
  # Get basic demographics
  select(USUBJID, SITEID, AGE, SEX, RACE, RFSTDTC, ARM) %>%
  # Anti-join: subjects NOT in EX
  anti_join(
    ex %>% distinct(USUBJID),
    by = "USUBJID"
  ) %>%
  # Add disposition reason if available
  left_join(
    ds %>%
      filter(DSCAT == "DISPOSITION EVENT") %>%
      select(USUBJID, DSDECOD, DSTERM),
    by = "USUBJID"
  ) %>%
  # Sort by site
  arrange(SITEID, USUBJID)

cat("Dataset 3: Screen Failure Report\n")

Dataset 3: Screen Failure Report

cat("Rows (Screen Failures):", nrow(screen_failure_report), "\n\n")

Rows (Screen Failures): 52

# Summary by site
screen_failure_report %>%
  group_by(SITEID) %>%
  summarise(
    N = n(),
    Reasons = paste(unique(na.omit(DSDECOD)), collapse = ", ")
  ) %>%
  head(10)

# A tibble: 10 × 3
   SITEID     N Reasons       
   <chr>  <int> <chr>         
 1 701       10 SCREEN FAILURE
 2 703        1 SCREEN FAILURE
 3 705        5 SCREEN FAILURE
 4 707        3 SCREEN FAILURE
 5 708        7 SCREEN FAILURE
 6 709        2 SCREEN FAILURE
 7 710        7 SCREEN FAILURE
 8 711        8 SCREEN FAILURE
 9 715        4 SCREEN FAILURE
10 716        5 SCREEN FAILURE

7 Part 4: Multi-Domain Analysis

Let’s combine data from multiple domains to answer a clinical question:

Question: For subjects with Serious Adverse Events, what was their total drug exposure?

# Subjects with Serious AEs
sae_subjects <- ae %>%
  filter(AESER == "Y") %>%
  distinct(USUBJID, .keep_all = TRUE) %>%
  select(USUBJID, AEDECOD)

# Add demographics
sae_analysis <- sae_subjects %>%
  inner_join(
    dm %>% select(USUBJID, AGE, SEX, ARM),
    by = "USUBJID"
  ) %>%
  # Add exposure summary
  left_join(
    ex %>%
      group_by(USUBJID) %>%
      summarise(
        TOTAL_DOSE = sum(EXDOSE, na.rm = TRUE),
        N_DOSES = n(),
        FIRST_DOSE = min(EXSTDTC),
        LAST_DOSE = max(EXENDTC),
        .groups = "drop"
      ),
    by = "USUBJID"
  )

cat("Serious AE Analysis Dataset\n")

Serious AE Analysis Dataset

cat("Subjects with SAEs:", nrow(sae_analysis), "\n\n")

Subjects with SAEs: 3

# Display
sae_analysis %>%
  select(USUBJID, AEDECOD, ARM, TOTAL_DOSE, N_DOSES) %>%
  head(10)

# A tibble: 3 × 5
  USUBJID     AEDECOD                                   ARM   TOTAL_DOSE N_DOSES
  <chr>       <chr>                                     <chr>      <dbl>   <int>
1 01-709-1424 SYNCOPE                                   Xano…         54       1
2 01-718-1170 SYNCOPE                                   Xano…        108       2
3 01-718-1371 PARTIAL SEIZURES WITH SECONDARY GENERALI… Xano…        135       2

8 🎯 Practice Exercise

8.1 Your Turn: Create a Lab Analysis Dataset

Create a wide dataset from LB (Laboratory) data with:

One row per subject and visit
Columns for ALT, AST, BILI (Liver Function Tests)
Joined with DM for treatment arm

data("lb", package = "pharmaversesdtm")

# TODO: Complete the following
lb_wide <- lb %>%
  # Step 1: Filter for liver function tests
  filter(LBTESTCD %in% c("ALT", "AST", "BILI")) %>%
  # Step 2: Select key columns
  select(USUBJID, VISIT, VISITNUM, LBTESTCD, LBSTRESN) %>%
  # Step 3: Pivot wider
  pivot_wider(
    # Fill in parameters...
  ) %>%
  # Step 4: Join with DM to get ARM
  inner_join(
    # Fill in...
  )

# Check your results
head(lb_wide)

9 Deliverable Summary

Today you completed the following:

Task	Status
Pivoted VS from long to wide format	✓ Done
Calculated BMI using pivoted data	✓ Done
Dataset 1: Wide VS per Subject/Visit	✓ Done
Dataset 2: DM + AE with covariates	✓ Done
Dataset 3: Screen Failure Report	✓ Done
Multi-domain SAE analysis	✓ Done

10 Key Takeaways

Pivot Wider: Use when you need variables side-by-side for calculations.
Pivot Longer: Use when you need to group by a variable for plotting.
Anti-joins: Essential for finding missing, orphaned, or excluded records.
Multi-domain Analysis: Real analysis requires combining multiple SDTM domains.

11 Resources

12 What’s Next?

In Day 6, we will introduce sdtm.oak concepts for transforming raw data to SDTM, including:

Simulating raw EDC data
Algorithm-based transformations
Creating a complete SDTM LB domain with unit standardization