30 Days of Pharmaverse
  • Week 1: SDTM Fundamentals
  • Week 2: Production SDTM
  • Week 3: ADaM Deep Dive
  • Week 4: Tables, Listings and Figures
  1. Day 2: SDTM Domain Structure & Tidyverse Mastery
  • Day 1: Environment Setup & First SDTM Code
  • Day 2: SDTM Domain Structure & Tidyverse Mastery
  • Day 3: Controlled Terminology & MedDRA Coding
  • Day 4: Clinical Date Derivations with lubridate
  • Day 5: Advanced Tidyverse: Pivoting & Joining
  • Day 6: Introduction to sdtm.oak
  • Day 7: Week 1 Capstone: End-to-End SDTM Script

On this page

  • 1 Learning Objectives
  • 2 Introduction
  • 3 Package Installation & Loading
    • 3.1 Required Packages
    • 3.2 Load Packages
    • 3.3 Load SDTM Datasets
  • 4 SDTM Domain Classes
    • 4.1 The Four Domain Classes
    • 4.2 Special Purpose Class: DM (Demographics)
    • 4.3 Events Class: AE (Adverse Events)
    • 4.4 Interventions Class: EX (Exposure)
    • 4.5 Findings Class: LB (Laboratory)
  • 5 SDTM Variable Naming Conventions
    • 5.1 The Two-Character Domain Prefix
    • 5.2 Common Variable Suffixes
    • 5.3 Understanding the Pattern: Findings Class
  • 6 Tidyverse Mastery: 10 Essential Transformations
    • 6.1 Transformation 1: filter() with Multiple Conditions
    • 6.2 Transformation 2: filter() with %in% for Multiple Values
    • 6.3 Transformation 3: select() with Helper Functions
    • 6.4 Transformation 4: mutate() with case_when()
    • 6.5 Transformation 5: mutate() with across()
    • 6.6 Transformation 6: group_by() + summarise() - AE Summary by SOC
    • 6.7 Transformation 7: group_by() + summarise() - By Treatment Arm
    • 6.8 Transformation 8: pivot_wider() - Reshape Lab Data
    • 6.9 Transformation 9: pivot_longer() - Reshape Wide to Long
    • 6.10 Transformation 10: left_join() + anti_join() - Data Quality Check
  • 7 Advanced: Combining Multiple Operations
  • 8 Day 2 Deliverable Summary
  • 9 Key Takeaways
  • 10 Resources
  • 11 What’s Next?

Day 2: SDTM Domain Structure & Tidyverse Mastery

Understanding SDTM Domain Classes Through Code

← Back to Roadmap

1 Learning Objectives

By the end of Day 2, you will be able to:

  1. Understand the four SDTM domain classes and their purposes
  2. Recognize and interpret SDTM variable naming conventions
  3. Apply advanced tidyverse operations to clinical data
  4. Create meaningful summaries of clinical trial data
  5. Work fluently with dplyr and tidyr on pharmaverse datasets

2 Introduction

Day 2 builds on the foundations from Day 1. Now that you can load and explore SDTM datasets, we’ll dive deeper into:

  • SDTM Domain Classes: Understanding why domains are organized the way they are
  • Variable Naming Conventions: Decoding the patterns like --TESTCD, --ORRES, --DY
  • Tidyverse Mastery: This is 50% of what you’ll do daily as a clinical programmer

The pharmaverse packages are built on tidyverse principles, so mastering dplyr and tidyr is essential for working with admiral, sdtm.oak, and other clinical programming tools.


3 Package Installation & Loading

3.1 Required Packages

Package Purpose
dplyr Data manipulation (filter, select, mutate, etc.)
tidyr Data reshaping (pivot_wider, pivot_longer, etc.)
pharmaversesdtm Example SDTM datasets for learning

3.2 Load Packages

# Load required packages
library(dplyr)
library(tidyr)
library(pharmaversesdtm)

3.3 Load SDTM Datasets

We’ll work with multiple domains today to understand their structures and relationships.

# Load all domains we'll use today
data("dm", package = "pharmaversesdtm")
data("ae", package = "pharmaversesdtm")
data("ex", package = "pharmaversesdtm")
data("lb", package = "pharmaversesdtm")
data("vs", package = "pharmaversesdtm")
data("cm", package = "pharmaversesdtm")
data("ds", package = "pharmaversesdtm")

4 SDTM Domain Classes

SDTM organizes clinical data into four domain classes, each designed for a specific type of information. Understanding these classes helps you know what to expect from each domain. Here’s a simple breakdown:

4.1 The Four Domain Classes

Class Purpose Example Domains Key Characteristics
Special Purpose Subject-level information DM, CO, SE, SV Usually one row per subject
Events Discrete occurrences AE, DS, MH, DV Multiple rows per subject, start/end
Interventions Treatments administered EX, CM, EC, SU Dosing info, frequencies, routes
Findings Collected measurements/tests LB, VS, EG, PE Test codes, results, units, standards

4.1.1 What do these mean?

  • Special Purpose: These domains (like DM for Demographics) are the backbone of your study. They contain key information about each subject, such as their ID, age, sex, and treatment group. Usually, there is one row per subject.
  • Events: These domains (like AE for Adverse Events) record things that happen to a subject, such as getting sick or dropping out of the study. There can be many events per subject, so you’ll see multiple rows for the same person.
  • Interventions: These domains (like EX for Exposure) track what is done to a subject, such as giving them a drug or a procedure. Each row is a record of a treatment or intervention.
  • Findings: These domains (like LB for Laboratory) store results of tests or measurements taken from the subject, such as blood pressure or lab results. Each row is a single measurement.

4.2 Special Purpose Class: DM (Demographics)

The Demographics domain is the anchor of all SDTM data. It contains one row per subject and provides the reference information that other domains link to. For example, if you want to know a subject’s age or which treatment group they are in, you look in DM.

# DM structure - note the one row per subject
dm %>%
  dplyr::select(STUDYID, USUBJID, SUBJID, RFSTDTC, RFENDTC, ARM, ACTARM, AGE, SEX, RACE) %>%
  utils::head(5)
# A tibble: 5 × 10
  STUDYID      USUBJID     SUBJID RFSTDTC RFENDTC ARM   ACTARM   AGE SEX   RACE 
  <chr>        <chr>       <chr>  <chr>   <chr>   <chr> <chr>  <dbl> <chr> <chr>
1 CDISCPILOT01 01-701-1015 1015   2014-0… 2014-0… Plac… Place…    63 F     WHITE
2 CDISCPILOT01 01-701-1023 1023   2012-0… 2012-0… Plac… Place…    64 M     WHITE
3 CDISCPILOT01 01-701-1028 1028   2013-0… 2014-0… Xano… Xanom…    71 M     WHITE
4 CDISCPILOT01 01-701-1033 1033   2014-0… 2014-0… Xano… Xanom…    74 M     WHITE
5 CDISCPILOT01 01-701-1034 1034   2014-0… 2014-1… Xano… Xanom…    77 F     WHITE
NoteKey DM Variables
  • STUDYID: Study Identifier (the name/code of the study)
  • USUBJID: Unique Subject Identifier (a unique code for each subject, used to link all domains)
  • RFSTDTC: Reference Start Date/Time (usually the date of first dose)
  • RFENDTC: Reference End Date/Time (when the subject finished or left the study)
  • ARM/ACTARM: Planned/Actual Treatment Arm (which group the subject was assigned to)

4.3 Events Class: AE (Adverse Events)

Events domains capture discrete occurrences that happen to subjects. Each event has a start and potentially an end. For example, if a subject gets a headache during the study, it is recorded as an adverse event (AE).

# AE structure - multiple rows per subject
ae %>%
  dplyr::select(USUBJID, AETERM, AEDECOD, AEBODSYS, AESEV, AESER, AESTDTC, AEENDTC) %>%
  utils::head(10)
# A tibble: 10 × 8
   USUBJID     AETERM               AEDECOD AEBODSYS AESEV AESER AESTDTC AEENDTC
   <chr>       <chr>                <chr>   <chr>    <chr> <chr> <chr>   <chr>  
 1 01-701-1015 APPLICATION SITE ER… APPLIC… GENERAL… MILD  N     2014-0… <NA>   
 2 01-701-1015 APPLICATION SITE PR… APPLIC… GENERAL… MILD  N     2014-0… <NA>   
 3 01-701-1015 DIARRHOEA            DIARRH… GASTROI… MILD  N     2014-0… 2014-0…
 4 01-701-1023 ATRIOVENTRICULAR BL… ATRIOV… CARDIAC… MILD  N     2012-0… <NA>   
 5 01-701-1023 ERYTHEMA             ERYTHE… SKIN AN… MILD  N     2012-0… 2012-0…
 6 01-701-1023 ERYTHEMA             ERYTHE… SKIN AN… MODE… N     2012-0… <NA>   
 7 01-701-1023 ERYTHEMA             ERYTHE… SKIN AN… MILD  N     2012-0… 2012-0…
 8 01-701-1028 APPLICATION SITE ER… APPLIC… GENERAL… MILD  N     2013-0… <NA>   
 9 01-701-1028 APPLICATION SITE PR… APPLIC… GENERAL… MILD  N     2013-0… <NA>   
10 01-701-1034 APPLICATION SITE PR… APPLIC… GENERAL… MILD  N     2014-0… <NA>   
NoteKey AE Variables
  • AETERM: The verbatim term reported by the investigator (e.g., “Headache”)
  • AEDECOD: The standardized (coded) term (e.g., “HEADACHE”)
  • AEBODSYS: The body system affected (e.g., “Nervous system disorders”)
  • AESEV: Severity (e.g., Mild, Moderate, Severe)
  • AESER: Seriousness (Yes/No)
  • AESTDTC/AEENDTC: Start and end dates of the event

4.4 Interventions Class: EX (Exposure)

Interventions domains track treatments given to subjects. They include dosing information, routes, and frequencies. For example, every time a subject receives a dose of study drug, it is recorded in EX.

# EX structure - dosing records
ex %>%
  dplyr::select(USUBJID, EXTRT, EXDOSE, EXDOSU, EXROUTE, EXSTDTC, EXENDTC) %>%
  utils::head(10)
# A tibble: 10 × 7
   USUBJID     EXTRT      EXDOSE EXDOSU EXROUTE     EXSTDTC    EXENDTC   
   <chr>       <chr>       <dbl> <chr>  <chr>       <chr>      <chr>     
 1 01-701-1015 PLACEBO         0 mg     TRANSDERMAL 2014-01-02 2014-01-16
 2 01-701-1015 PLACEBO         0 mg     TRANSDERMAL 2014-01-17 2014-06-18
 3 01-701-1015 PLACEBO         0 mg     TRANSDERMAL 2014-06-19 2014-07-02
 4 01-701-1023 PLACEBO         0 mg     TRANSDERMAL 2012-08-05 2012-08-27
 5 01-701-1023 PLACEBO         0 mg     TRANSDERMAL 2012-08-28 2012-09-01
 6 01-701-1028 XANOMELINE     54 mg     TRANSDERMAL 2013-07-19 2013-08-01
 7 01-701-1028 XANOMELINE     81 mg     TRANSDERMAL 2013-08-02 2014-01-06
 8 01-701-1028 XANOMELINE     54 mg     TRANSDERMAL 2014-01-07 2014-01-14
 9 01-701-1033 XANOMELINE     54 mg     TRANSDERMAL 2014-03-18 2014-03-31
10 01-701-1034 XANOMELINE     54 mg     TRANSDERMAL 2014-07-01 2014-07-15
NoteKey EX Variables
  • EXTRT: Name of the treatment (e.g., “Drug A”)
  • EXDOSE: Dose amount
  • EXDOSU: Dose units (e.g., mg)
  • EXROUTE: Route of administration (e.g., Oral)
  • EXSTDTC/EXENDTC: Start and end dates of dosing

4.5 Findings Class: LB (Laboratory)

Findings domains contain collected measurements. They have a standardized structure with test codes, results, units, and reference ranges. For example, each blood test result is a row in LB.

# LB structure - lab results with standardization
lb %>%
  dplyr::select(USUBJID, LBTESTCD, LBTEST, LBORRES, LBORRESU, LBSTRESN, LBSTRESU) %>%
  utils::head(10)
# A tibble: 10 × 7
   USUBJID     LBTESTCD LBTEST  LBORRES LBORRESU LBSTRESN LBSTRESU
   <chr>       <chr>    <chr>   <chr>   <chr>       <dbl> <chr>   
 1 01-701-1015 ALB      Albumin 3.8     g/dL           38 g/L     
 2 01-701-1015 ALB      Albumin 3.9     g/dL           39 g/L     
 3 01-701-1015 ALB      Albumin 3.8     g/dL           38 g/L     
 4 01-701-1015 ALB      Albumin 3.7     g/dL           37 g/L     
 5 01-701-1015 ALB      Albumin 3.8     g/dL           38 g/L     
 6 01-701-1015 ALB      Albumin 3.8     g/dL           38 g/L     
 7 01-701-1015 ALB      Albumin 3.7     g/dL           37 g/L     
 8 01-701-1015 ALB      Albumin 3.7     g/dL           37 g/L     
 9 01-701-1015 ALB      Albumin 3.8     g/dL           38 g/L     
10 01-701-1015 ALB      Albumin 3.8     g/dL           38 g/L     
NoteKey LB Variables
  • LBTESTCD: Short test code (e.g., “GLUC”)
  • LBTEST: Full test name (e.g., “Glucose”)
  • LBORRES: Original result (as collected)
  • LBORRESU: Original result units (e.g., mg/dL)
  • LBSTRESN: Standardized numeric result
  • LBSTRESU: Standardized units (e.g., mmol/L)

5 SDTM Variable Naming Conventions

SDTM uses consistent naming patterns across all domains. Understanding these patterns helps you quickly identify variable purposes without documentation.

5.1 The Two-Character Domain Prefix

Every domain-specific variable starts with a two-character prefix that identifies the domain:

Prefix Domain Example Variables
DM Demographics DMDTC, DMDY
AE Adverse Events AESTDTC, AEENDTC, AESEV
EX Exposure EXSTDTC, EXDOSE, EXROUTE
LB Laboratory LBTESTCD, LBORRES, LBSTRESN
VS Vital Signs VSTESTCD, VSORRES, VSSTRESN
CM Concomitant Meds CMTRT, CMDOSE, CMSTDTC

5.2 Common Variable Suffixes

The suffix tells you what type of information the variable contains:

Suffix Meaning Examples
--TERM Reported term (verbatim) AETERM, CMTRT
--DECOD Decoded/standardized term AEDECOD, CMDECOD
--TESTCD Test code (short) LBTESTCD, VSTESTCD
--TEST Test name (full) LBTEST, VSTEST
--ORRES Original result (as collected) LBORRES, VSORRES
--ORRESU Original result units LBORRESU, VSORRESU
--STRESN Standardized result (numeric) LBSTRESN, VSSTRESN
--STRESU Standardized result units LBSTRESU, VSSTRESU
--STDTC Start date/time AESTDTC, EXSTDTC, CMSTDTC
--ENDTC End date/time AEENDTC, EXENDTC, CMENDTC
--DY Study day AESTDY, AEENDY, EXSTDY
--SEV Severity AESEV
--SER Serious (Y/N) AESER
--REL Relationship/Causality AEREL

5.3 Understanding the Pattern: Findings Class

Let’s examine how the naming convention works in the Findings class (LB, VS):

# The Findings class pattern in LB
lb %>%
  dplyr::select(
    USUBJID,
    LBTESTCD,   # Test code (short identifier)
    LBTEST,     # Test name (full description)
    LBORRES,    # Original result (as collected)
    LBORRESU,   # Original units
    LBSTRESC,   # Standardized result (character)
    LBSTRESN,   # Standardized result (numeric)
    LBSTRESU    # Standardized units
  ) %>%
  dplyr::filter(LBTESTCD == "ALT") %>%
  utils::head(5)
# A tibble: 5 × 8
  USUBJID     LBTESTCD LBTEST        LBORRES LBORRESU LBSTRESC LBSTRESN LBSTRESU
  <chr>       <chr>    <chr>         <chr>   <chr>    <chr>       <dbl> <chr>   
1 01-701-1015 ALT      Alanine Amin… 27      U/L      27             27 U/L     
2 01-701-1015 ALT      Alanine Amin… 41      U/L      41             41 U/L     
3 01-701-1015 ALT      Alanine Amin… 18      U/L      18             18 U/L     
4 01-701-1015 ALT      Alanine Amin… 26      U/L      26             26 U/L     
5 01-701-1015 ALT      Alanine Amin… 22      U/L      22             22 U/L     
TipWhy Two Result Variables?
  • ORRES/ORRESU: Original result as collected (might be “< 5”, “>100”, or character values)
  • STRESN/STRESU: Standardized numeric result for analysis (handles unit conversions)

This separation preserves the original data while providing analysis-ready values.


6 Tidyverse Mastery: 10 Essential Transformations

The deliverable for Day 2 is to demonstrate 10 tidyverse transformations. Let’s work through them systematically.

6.1 Transformation 1: filter() with Multiple Conditions

Filter AEs that are both serious AND severe.

# Complex filtering: Serious AND Severe AEs
serious_severe_ae <- ae %>%
  dplyr::filter(AESER == "Y" & AESEV == "SEVERE")

cat("Total AEs:", nrow(ae), "\n")
Total AEs: 1191 
cat("Serious + Severe AEs:", nrow(serious_severe_ae), "\n")
Serious + Severe AEs: 2 
# Preview
serious_severe_ae %>%
  dplyr::select(USUBJID, AEDECOD, AESEV, AESER) %>%
  utils::head()
# A tibble: 2 × 4
  USUBJID     AEDECOD                                        AESEV  AESER
  <chr>       <chr>                                          <chr>  <chr>
1 01-718-1170 SYNCOPE                                        SEVERE Y    
2 01-718-1371 PARTIAL SEIZURES WITH SECONDARY GENERALISATION SEVERE Y    

6.2 Transformation 2: filter() with %in% for Multiple Values

Filter specific test codes from the lab data.

# Filter for liver function tests
liver_tests <- c("ALT", "AST", "BILI", "ALP")

lft_data <- lb %>%
  dplyr::filter(LBTESTCD %in% liver_tests)

cat("Total LB records:", nrow(lb), "\n")
Total LB records: 59580 
cat("Liver function tests:", nrow(lft_data), "\n")
Liver function tests: 7266 
# Count by test
lft_data %>%
  dplyr::count(LBTESTCD, LBTEST)
# A tibble: 4 × 3
  LBTESTCD LBTEST                         n
  <chr>    <chr>                      <int>
1 ALP      Alkaline Phosphatase        1824
2 ALT      Alanine Aminotransferase    1814
3 AST      Aspartate Aminotransferase  1814
4 BILI     Bilirubin                   1814

6.3 Transformation 3: select() with Helper Functions

Use select helpers to choose columns by pattern.

# Select all date columns from AE
ae_dates <- ae %>%
  dplyr::select(USUBJID, dplyr::ends_with("DTC"))

names(ae_dates)
[1] "USUBJID" "AEDTC"   "AESTDTC" "AEENDTC"
# Select all result columns from LB
lb_results <- lb %>%
  dplyr::select(USUBJID, LBTESTCD, dplyr::contains("RES"))

names(lb_results)
[1] "USUBJID"  "LBTESTCD" "LBORRES"  "LBORRESU" "LBSTRESC" "LBSTRESN" "LBSTRESU"

6.4 Transformation 4: mutate() with case_when()

Create a severity score based on AE severity.

# Create numeric severity score
ae_scored <- ae %>%
  dplyr::mutate(
    AESEV_N = dplyr::case_when(
      AESEV == "MILD"     ~ 1L,
      AESEV == "MODERATE" ~ 2L,
      AESEV == "SEVERE"   ~ 3L,
      TRUE                ~ NA_integer_
    )
  )

# Check the mapping
ae_scored %>%
  dplyr::distinct(AESEV, AESEV_N) %>%
  dplyr::arrange(AESEV_N)
# A tibble: 3 × 2
  AESEV    AESEV_N
  <chr>      <int>
1 MILD           1
2 MODERATE       2
3 SEVERE         3

6.5 Transformation 5: mutate() with across()

Apply transformations to multiple columns at once.

# Convert all character date columns to uppercase (demonstration)
ae_upper <- ae %>%
  dplyr::mutate(
    dplyr::across(
      dplyr::where(is.character),
      ~ toupper(.x)
    )
  )

# This is useful for standardizing text fields
ae_upper %>%
  dplyr::select(USUBJID, AETERM, AEDECOD) %>%
  utils::head(3)
# A tibble: 3 × 3
  USUBJID     AETERM                    AEDECOD                  
  <chr>       <chr>                     <chr>                    
1 01-701-1015 APPLICATION SITE ERYTHEMA APPLICATION SITE ERYTHEMA
2 01-701-1015 APPLICATION SITE PRURITUS APPLICATION SITE PRURITUS
3 01-701-1015 DIARRHOEA                 DIARRHOEA                

6.6 Transformation 6: group_by() + summarise() - AE Summary by SOC

This is the key deliverable: Create an AE summary by System Organ Class.

# AE Summary by System Organ Class (SOC)
ae_soc_summary <- ae %>%
  dplyr::group_by(AEBODSYS) %>%
  dplyr::summarise(
    n_events = dplyr::n(),
    n_subjects = dplyr::n_distinct(USUBJID),
    n_serious = sum(AESER == "Y", na.rm = TRUE),
    n_severe = sum(AESEV == "SEVERE", na.rm = TRUE),
    .groups = "drop"
  ) %>%
  dplyr::arrange(dplyr::desc(n_events))

# Display the summary
ae_soc_summary
# A tibble: 23 × 5
   AEBODSYS                               n_events n_subjects n_serious n_severe
   <chr>                                     <int>      <int>     <int>    <int>
 1 GENERAL DISORDERS AND ADMINISTRATION …      292        108         0       10
 2 SKIN AND SUBCUTANEOUS TISSUE DISORDERS      276        105         0        6
 3 NERVOUS SYSTEM DISORDERS                    101         59         3       10
 4 CARDIAC DISORDERS                            91         44         0        4
 5 GASTROINTESTINAL DISORDERS                   87         53         0        2
 6 INFECTIONS AND INFESTATIONS                  73         39         0        1
 7 RESPIRATORY, THORACIC AND MEDIASTINAL…       53         30         0        0
 8 INVESTIGATIONS                               40         23         0        0
 9 PSYCHIATRIC DISORDERS                        40         29         0        2
10 INJURY, POISONING AND PROCEDURAL COMP…       29         14         0        3
# ℹ 13 more rows

6.7 Transformation 7: group_by() + summarise() - By Treatment Arm

Summarize by treatment arm using joined data.

# First, join AE with DM to get treatment arm
ae_with_arm <- ae %>%
  dplyr::inner_join(
    dm %>% dplyr::select(USUBJID, ARM),
    by = "USUBJID"
  )

# Now summarize by treatment arm
ae_arm_summary <- ae_with_arm %>%
  dplyr::group_by(ARM) %>%
  dplyr::summarise(
    n_subjects_with_ae = dplyr::n_distinct(USUBJID),
    total_aes = dplyr::n(),
    mean_aes_per_subject = round(dplyr::n() / dplyr::n_distinct(USUBJID), 1),
    .groups = "drop"
  )

ae_arm_summary
# A tibble: 3 × 4
  ARM                  n_subjects_with_ae total_aes mean_aes_per_subject
  <chr>                             <int>     <int>                <dbl>
1 Placebo                              69       301                  4.4
2 Xanomeline High Dose                 79       455                  5.8
3 Xanomeline Low Dose                  77       435                  5.6

6.8 Transformation 8: pivot_wider() - Reshape Lab Data

Convert long lab data to wide format (one row per subject/visit with columns for each test).

# Pivot LB data to wide format
lb_wide <- lb %>%
  dplyr::filter(LBTESTCD %in% c("ALT", "AST", "BILI")) %>%
  dplyr::select(USUBJID, VISIT, LBTESTCD, LBSTRESN) %>%
  tidyr::pivot_wider(
    names_from = LBTESTCD,
    values_from = LBSTRESN,
    values_fn = mean  # Handle multiple values per cell
  )

utils::head(lb_wide, 10)
# A tibble: 10 × 5
   USUBJID     VISIT         ALT   AST  BILI
   <chr>       <chr>       <dbl> <dbl> <dbl>
 1 01-701-1015 SCREENING 1    27    40 10.3 
 2 01-701-1015 WEEK 2         41    33  8.55
 3 01-701-1015 WEEK 4         18    21  6.84
 4 01-701-1015 WEEK 6         26    26  5.13
 5 01-701-1015 WEEK 8         22    21  8.55
 6 01-701-1015 WEEK 12        27    22  6.84
 7 01-701-1015 WEEK 16        17    23  6.84
 8 01-701-1015 WEEK 20        21    19  6.84
 9 01-701-1015 WEEK 24        23    23  6.84
10 01-701-1015 WEEK 26        23    19  5.13

6.9 Transformation 9: pivot_longer() - Reshape Wide to Long

Convert wide data back to long format (useful for plotting).

# Pivot back to long format
lb_long <- lb_wide %>%
  tidyr::pivot_longer(
    cols = c(ALT, AST, BILI),
    names_to = "LBTESTCD",
    values_to = "LBSTRESN"
  ) %>%
  dplyr::filter(!is.na(LBSTRESN))

utils::head(lb_long, 10)
# A tibble: 10 × 4
   USUBJID     VISIT       LBTESTCD LBSTRESN
   <chr>       <chr>       <chr>       <dbl>
 1 01-701-1015 SCREENING 1 ALT         27   
 2 01-701-1015 SCREENING 1 AST         40   
 3 01-701-1015 SCREENING 1 BILI        10.3 
 4 01-701-1015 WEEK 2      ALT         41   
 5 01-701-1015 WEEK 2      AST         33   
 6 01-701-1015 WEEK 2      BILI         8.55
 7 01-701-1015 WEEK 4      ALT         18   
 8 01-701-1015 WEEK 4      AST         21   
 9 01-701-1015 WEEK 4      BILI         6.84
10 01-701-1015 WEEK 6      ALT         26   

6.10 Transformation 10: left_join() + anti_join() - Data Quality Check

Find subjects with exposure but no adverse events (data quality check).

# Subjects who were dosed (in EX)
subjects_dosed <- ex %>%
  dplyr::distinct(USUBJID)

# Subjects with AEs
subjects_with_ae <- ae %>%
  dplyr::distinct(USUBJID)

# Find dosed subjects with NO AEs
dosed_no_ae <- subjects_dosed %>%
  dplyr::anti_join(subjects_with_ae, by = "USUBJID")

cat("Subjects dosed:", nrow(subjects_dosed), "\n")
Subjects dosed: 254 
cat("Subjects with AEs:", nrow(subjects_with_ae), "\n")
Subjects with AEs: 225 
cat("Dosed subjects with NO AEs:", nrow(dosed_no_ae), "\n")
Dosed subjects with NO AEs: 29 

7 Advanced: Combining Multiple Operations

Real clinical programming often chains many operations together. Here’s a complete example:

# Complete analysis: AE summary by SOC and Treatment Arm
complete_ae_summary <- ae %>%
  # Join with demographics
  dplyr::inner_join(
    dm %>% dplyr::select(USUBJID, ARM, AGE, SEX),
    by = "USUBJID"
  ) %>%
  # Create severity score
  dplyr::mutate(
    AESEV_N = dplyr::case_when(
      AESEV == "MILD"     ~ 1L,
      AESEV == "MODERATE" ~ 2L,
      AESEV == "SEVERE"   ~ 3L,
      TRUE                ~ NA_integer_
    )
  ) %>%
  # Group and summarize
  dplyr::group_by(ARM, AEBODSYS) %>%
  dplyr::summarise(
    n_events = dplyr::n(),
    n_subjects = dplyr::n_distinct(USUBJID),
    n_serious = sum(AESER == "Y", na.rm = TRUE),
    mean_severity = round(mean(AESEV_N, na.rm = TRUE), 2),
    .groups = "drop"
  ) %>%
  # Filter to most common SOCs
  dplyr::filter(n_events >= 5) %>%
  # Sort
dplyr::arrange(ARM, dplyr::desc(n_events))

# Display top results
utils::head(complete_ae_summary, 15)
# A tibble: 15 × 6
   ARM     AEBODSYS                  n_events n_subjects n_serious mean_severity
   <chr>   <chr>                        <int>      <int>     <int>         <dbl>
 1 Placebo GENERAL DISORDERS AND AD…       48         21         0          1.25
 2 Placebo SKIN AND SUBCUTANEOUS TI…       47         21         0          1.4 
 3 Placebo INFECTIONS AND INFESTATI…       35         16         0          1.2 
 4 Placebo CARDIAC DISORDERS               27         13         0          1.41
 5 Placebo GASTROINTESTINAL DISORDE…       26         17         0          1.15
 6 Placebo INVESTIGATIONS                  19         10         0          1.16
 7 Placebo NERVOUS SYSTEM DISORDERS        16         12         0          1.38
 8 Placebo RESPIRATORY, THORACIC AN…       15         10         0          1.07
 9 Placebo PSYCHIATRIC DISORDERS           14         10         0          1.43
10 Placebo INJURY, POISONING AND PR…        9          4         0          1.44
11 Placebo EYE DISORDERS                    8          4         0          1.38
12 Placebo METABOLISM AND NUTRITION…        8          6         0          1.25
13 Placebo MUSCULOSKELETAL AND CONN…        8          5         0          1.5 
14 Placebo VASCULAR DISORDERS               7          3         0          1.43
15 Placebo RENAL AND URINARY DISORD…        5          4         0          1.2 

8 Day 2 Deliverable Summary

Today you completed the following:

Task Status
Understood SDTM domain classes Done
Learned variable naming conventions Done
Transformation 1: filter() with multiple conditions Done
Transformation 2: filter() with %in% Done
Transformation 3: select() with helper functions Done
Transformation 4: mutate() with case_when() Done
Transformation 5: mutate() with across() Done
Transformation 6: group_by + summarise (AE by SOC) Done
Transformation 7: Summarize by treatment arm Done
Transformation 8: pivot_wider() Done
Transformation 9: pivot_longer() Done
Transformation 10: anti_join() for data QC Done

9 Key Takeaways

  1. SDTM has four domain classes: Special Purpose, Events, Interventions, Findings
  2. Variable naming is systematic: Prefix (domain) + suffix (content type)
  3. --TESTCD vs --TEST: Short code vs full name
  4. --ORRES vs --STRESN: Original vs standardized results
  5. Tidyverse is foundational: 50% of clinical programming is dplyr/tidyr operations

10 Resources

  • CDISC SDTM Implementation Guide v3.4 - Official SDTM documentation
  • tidyverse.org - Complete tidyverse documentation
  • pharmaverse blog: ‘The pharmaverse (hi)story’ - Background on pharmaverse development
  • dplyr cheatsheet - Quick reference for dplyr

11 What’s Next?

In Day 3, we’ll explore:

  • Controlled Terminology and why it matters for submissions
  • MedDRA coding and creating lookup tables
  • CDISC CT concepts: AESEV, AEREL, AESER values
  • How metacore + metatools automate CT validation

 

30 Days of Pharmaverse  ·  Disclaimer  ·  Indraneel Chakraborty  ·  © 2026