Day 2: SDTM Domain Structure & Tidyverse Mastery

Understanding SDTM Domain Classes Through Code

1 Learning Objectives

By the end of Day 2, you will be able to:

Understand the four SDTM domain classes and their purposes
Recognize and interpret SDTM variable naming conventions
Apply advanced tidyverse operations to clinical data
Create meaningful summaries of clinical trial data
Work fluently with dplyr and tidyr on pharmaverse datasets

2 Introduction

Day 2 builds on the foundations from Day 1. Now that you can load and explore SDTM datasets, we’ll dive deeper into:

SDTM Domain Classes: Understanding why domains are organized the way they are
Variable Naming Conventions: Decoding the patterns like --TESTCD, --ORRES, --DY
Tidyverse Mastery: This is 50% of what you’ll do daily as a clinical programmer

The pharmaverse packages are built on tidyverse principles, so mastering dplyr and tidyr is essential for working with admiral, sdtm.oak, and other clinical programming tools.

3 Package Installation & Loading

3.1 Required Packages

Package	Purpose
`dplyr`	Data manipulation (filter, select, mutate, etc.)
`tidyr`	Data reshaping (pivot_wider, pivot_longer, etc.)
`pharmaversesdtm`	Example SDTM datasets for learning

3.2 Load Packages

# Load required packages
library(dplyr)
library(tidyr)
library(pharmaversesdtm)

3.3 Load SDTM Datasets

We’ll work with multiple domains today to understand their structures and relationships.

# Load all domains we'll use today
data("dm", package = "pharmaversesdtm")
data("ae", package = "pharmaversesdtm")
data("ex", package = "pharmaversesdtm")
data("lb", package = "pharmaversesdtm")
data("vs", package = "pharmaversesdtm")
data("cm", package = "pharmaversesdtm")
data("ds", package = "pharmaversesdtm")

4 SDTM Domain Classes

SDTM organizes clinical data into four domain classes, each designed for a specific type of information. Understanding these classes helps you know what to expect from each domain. Here’s a simple breakdown:

4.1 The Four Domain Classes

Class	Purpose	Example Domains	Key Characteristics
Special Purpose	Subject-level information	DM, CO, SE, SV	Usually one row per subject
Events	Discrete occurrences	AE, DS, MH, DV	Multiple rows per subject, start/end
Interventions	Treatments administered	EX, CM, EC, SU	Dosing info, frequencies, routes
Findings	Collected measurements/tests	LB, VS, EG, PE	Test codes, results, units, standards

4.1.1 What do these mean?

Special Purpose: These domains (like DM for Demographics) are the backbone of your study. They contain key information about each subject, such as their ID, age, sex, and treatment group. Usually, there is one row per subject.
Events: These domains (like AE for Adverse Events) record things that happen to a subject, such as getting sick or dropping out of the study. There can be many events per subject, so you’ll see multiple rows for the same person.
Interventions: These domains (like EX for Exposure) track what is done to a subject, such as giving them a drug or a procedure. Each row is a record of a treatment or intervention.
Findings: These domains (like LB for Laboratory) store results of tests or measurements taken from the subject, such as blood pressure or lab results. Each row is a single measurement.

4.2 Special Purpose Class: DM (Demographics)

The Demographics domain is the anchor of all SDTM data. It contains one row per subject and provides the reference information that other domains link to. For example, if you want to know a subject’s age or which treatment group they are in, you look in DM.

# DM structure - note the one row per subject
dm %>%
  dplyr::select(STUDYID, USUBJID, SUBJID, RFSTDTC, RFENDTC, ARM, ACTARM, AGE, SEX, RACE) %>%
  utils::head(5)

# A tibble: 5 × 10
  STUDYID      USUBJID     SUBJID RFSTDTC RFENDTC ARM   ACTARM   AGE SEX   RACE 
  <chr>        <chr>       <chr>  <chr>   <chr>   <chr> <chr>  <dbl> <chr> <chr>
1 CDISCPILOT01 01-701-1015 1015   2014-0… 2014-0… Plac… Place…    63 F     WHITE
2 CDISCPILOT01 01-701-1023 1023   2012-0… 2012-0… Plac… Place…    64 M     WHITE
3 CDISCPILOT01 01-701-1028 1028   2013-0… 2014-0… Xano… Xanom…    71 M     WHITE
4 CDISCPILOT01 01-701-1033 1033   2014-0… 2014-0… Xano… Xanom…    74 M     WHITE
5 CDISCPILOT01 01-701-1034 1034   2014-0… 2014-1… Xano… Xanom…    77 F     WHITE

Key DM Variables

STUDYID: Study Identifier (the name/code of the study)
USUBJID: Unique Subject Identifier (a unique code for each subject, used to link all domains)
RFSTDTC: Reference Start Date/Time (usually the date of first dose)
RFENDTC: Reference End Date/Time (when the subject finished or left the study)
ARM/ACTARM: Planned/Actual Treatment Arm (which group the subject was assigned to)

4.3 Events Class: AE (Adverse Events)

Events domains capture discrete occurrences that happen to subjects. Each event has a start and potentially an end. For example, if a subject gets a headache during the study, it is recorded as an adverse event (AE).

# AE structure - multiple rows per subject
ae %>%
  dplyr::select(USUBJID, AETERM, AEDECOD, AEBODSYS, AESEV, AESER, AESTDTC, AEENDTC) %>%
  utils::head(10)

# A tibble: 10 × 8
   USUBJID     AETERM               AEDECOD AEBODSYS AESEV AESER AESTDTC AEENDTC
   <chr>       <chr>                <chr>   <chr>    <chr> <chr> <chr>   <chr>  
 1 01-701-1015 APPLICATION SITE ER… APPLIC… GENERAL… MILD  N     2014-0… <NA>   
 2 01-701-1015 APPLICATION SITE PR… APPLIC… GENERAL… MILD  N     2014-0… <NA>   
 3 01-701-1015 DIARRHOEA            DIARRH… GASTROI… MILD  N     2014-0… 2014-0…
 4 01-701-1023 ATRIOVENTRICULAR BL… ATRIOV… CARDIAC… MILD  N     2012-0… <NA>   
 5 01-701-1023 ERYTHEMA             ERYTHE… SKIN AN… MILD  N     2012-0… 2012-0…
 6 01-701-1023 ERYTHEMA             ERYTHE… SKIN AN… MODE… N     2012-0… <NA>   
 7 01-701-1023 ERYTHEMA             ERYTHE… SKIN AN… MILD  N     2012-0… 2012-0…
 8 01-701-1028 APPLICATION SITE ER… APPLIC… GENERAL… MILD  N     2013-0… <NA>   
 9 01-701-1028 APPLICATION SITE PR… APPLIC… GENERAL… MILD  N     2013-0… <NA>   
10 01-701-1034 APPLICATION SITE PR… APPLIC… GENERAL… MILD  N     2014-0… <NA>

Key AE Variables

AETERM: The verbatim term reported by the investigator (e.g., “Headache”)
AEDECOD: The standardized (coded) term (e.g., “HEADACHE”)
AEBODSYS: The body system affected (e.g., “Nervous system disorders”)
AESEV: Severity (e.g., Mild, Moderate, Severe)
AESER: Seriousness (Yes/No)
AESTDTC/AEENDTC: Start and end dates of the event

4.4 Interventions Class: EX (Exposure)

Interventions domains track treatments given to subjects. They include dosing information, routes, and frequencies. For example, every time a subject receives a dose of study drug, it is recorded in EX.

# EX structure - dosing records
ex %>%
  dplyr::select(USUBJID, EXTRT, EXDOSE, EXDOSU, EXROUTE, EXSTDTC, EXENDTC) %>%
  utils::head(10)

# A tibble: 10 × 7
   USUBJID     EXTRT      EXDOSE EXDOSU EXROUTE     EXSTDTC    EXENDTC   
   <chr>       <chr>       <dbl> <chr>  <chr>       <chr>      <chr>     
 1 01-701-1015 PLACEBO         0 mg     TRANSDERMAL 2014-01-02 2014-01-16
 2 01-701-1015 PLACEBO         0 mg     TRANSDERMAL 2014-01-17 2014-06-18
 3 01-701-1015 PLACEBO         0 mg     TRANSDERMAL 2014-06-19 2014-07-02
 4 01-701-1023 PLACEBO         0 mg     TRANSDERMAL 2012-08-05 2012-08-27
 5 01-701-1023 PLACEBO         0 mg     TRANSDERMAL 2012-08-28 2012-09-01
 6 01-701-1028 XANOMELINE     54 mg     TRANSDERMAL 2013-07-19 2013-08-01
 7 01-701-1028 XANOMELINE     81 mg     TRANSDERMAL 2013-08-02 2014-01-06
 8 01-701-1028 XANOMELINE     54 mg     TRANSDERMAL 2014-01-07 2014-01-14
 9 01-701-1033 XANOMELINE     54 mg     TRANSDERMAL 2014-03-18 2014-03-31
10 01-701-1034 XANOMELINE     54 mg     TRANSDERMAL 2014-07-01 2014-07-15

Key EX Variables

EXTRT: Name of the treatment (e.g., “Drug A”)
EXDOSE: Dose amount
EXDOSU: Dose units (e.g., mg)
EXROUTE: Route of administration (e.g., Oral)
EXSTDTC/EXENDTC: Start and end dates of dosing

4.5 Findings Class: LB (Laboratory)

Findings domains contain collected measurements. They have a standardized structure with test codes, results, units, and reference ranges. For example, each blood test result is a row in LB.

# LB structure - lab results with standardization
lb %>%
  dplyr::select(USUBJID, LBTESTCD, LBTEST, LBORRES, LBORRESU, LBSTRESN, LBSTRESU) %>%
  utils::head(10)

# A tibble: 10 × 7
   USUBJID     LBTESTCD LBTEST  LBORRES LBORRESU LBSTRESN LBSTRESU
   <chr>       <chr>    <chr>   <chr>   <chr>       <dbl> <chr>   
 1 01-701-1015 ALB      Albumin 3.8     g/dL           38 g/L     
 2 01-701-1015 ALB      Albumin 3.9     g/dL           39 g/L     
 3 01-701-1015 ALB      Albumin 3.8     g/dL           38 g/L     
 4 01-701-1015 ALB      Albumin 3.7     g/dL           37 g/L     
 5 01-701-1015 ALB      Albumin 3.8     g/dL           38 g/L     
 6 01-701-1015 ALB      Albumin 3.8     g/dL           38 g/L     
 7 01-701-1015 ALB      Albumin 3.7     g/dL           37 g/L     
 8 01-701-1015 ALB      Albumin 3.7     g/dL           37 g/L     
 9 01-701-1015 ALB      Albumin 3.8     g/dL           38 g/L     
10 01-701-1015 ALB      Albumin 3.8     g/dL           38 g/L

Key LB Variables

LBTESTCD: Short test code (e.g., “GLUC”)
LBTEST: Full test name (e.g., “Glucose”)
LBORRES: Original result (as collected)
LBORRESU: Original result units (e.g., mg/dL)
LBSTRESN: Standardized numeric result
LBSTRESU: Standardized units (e.g., mmol/L)

5 SDTM Variable Naming Conventions

SDTM uses consistent naming patterns across all domains. Understanding these patterns helps you quickly identify variable purposes without documentation.

5.1 The Two-Character Domain Prefix

Every domain-specific variable starts with a two-character prefix that identifies the domain:

Prefix	Domain	Example Variables
DM	Demographics	DMDTC, DMDY
AE	Adverse Events	AESTDTC, AEENDTC, AESEV
EX	Exposure	EXSTDTC, EXDOSE, EXROUTE
LB	Laboratory	LBTESTCD, LBORRES, LBSTRESN
VS	Vital Signs	VSTESTCD, VSORRES, VSSTRESN
CM	Concomitant Meds	CMTRT, CMDOSE, CMSTDTC

5.2 Common Variable Suffixes

The suffix tells you what type of information the variable contains:

Suffix	Meaning	Examples
`--TERM`	Reported term (verbatim)	AETERM, CMTRT
`--DECOD`	Decoded/standardized term	AEDECOD, CMDECOD
`--TESTCD`	Test code (short)	LBTESTCD, VSTESTCD
`--TEST`	Test name (full)	LBTEST, VSTEST
`--ORRES`	Original result (as collected)	LBORRES, VSORRES
`--ORRESU`	Original result units	LBORRESU, VSORRESU
`--STRESN`	Standardized result (numeric)	LBSTRESN, VSSTRESN
`--STRESU`	Standardized result units	LBSTRESU, VSSTRESU
`--STDTC`	Start date/time	AESTDTC, EXSTDTC, CMSTDTC
`--ENDTC`	End date/time	AEENDTC, EXENDTC, CMENDTC
`--DY`	Study day	AESTDY, AEENDY, EXSTDY
`--SEV`	Severity	AESEV
`--SER`	Serious (Y/N)	AESER
`--REL`	Relationship/Causality	AEREL

5.3 Understanding the Pattern: Findings Class

Let’s examine how the naming convention works in the Findings class (LB, VS):

# The Findings class pattern in LB
lb %>%
  dplyr::select(
    USUBJID,
    LBTESTCD,   # Test code (short identifier)
    LBTEST,     # Test name (full description)
    LBORRES,    # Original result (as collected)
    LBORRESU,   # Original units
    LBSTRESC,   # Standardized result (character)
    LBSTRESN,   # Standardized result (numeric)
    LBSTRESU    # Standardized units
  ) %>%
  dplyr::filter(LBTESTCD == "ALT") %>%
  utils::head(5)

# A tibble: 5 × 8
  USUBJID     LBTESTCD LBTEST        LBORRES LBORRESU LBSTRESC LBSTRESN LBSTRESU
  <chr>       <chr>    <chr>         <chr>   <chr>    <chr>       <dbl> <chr>   
1 01-701-1015 ALT      Alanine Amin… 27      U/L      27             27 U/L     
2 01-701-1015 ALT      Alanine Amin… 41      U/L      41             41 U/L     
3 01-701-1015 ALT      Alanine Amin… 18      U/L      18             18 U/L     
4 01-701-1015 ALT      Alanine Amin… 26      U/L      26             26 U/L     
5 01-701-1015 ALT      Alanine Amin… 22      U/L      22             22 U/L

Why Two Result Variables?

ORRES/ORRESU: Original result as collected (might be “< 5”, “>100”, or character values)
STRESN/STRESU: Standardized numeric result for analysis (handles unit conversions)

This separation preserves the original data while providing analysis-ready values.

6 Tidyverse Mastery: 10 Essential Transformations

The deliverable for Day 2 is to demonstrate 10 tidyverse transformations. Let’s work through them systematically.

6.1 Transformation 1: `filter()` with Multiple Conditions

Filter AEs that are both serious AND severe.

# Complex filtering: Serious AND Severe AEs
serious_severe_ae <- ae %>%
  dplyr::filter(AESER == "Y" & AESEV == "SEVERE")

cat("Total AEs:", nrow(ae), "\n")

Total AEs: 1191

cat("Serious + Severe AEs:", nrow(serious_severe_ae), "\n")

Serious + Severe AEs: 2

# Preview
serious_severe_ae %>%
  dplyr::select(USUBJID, AEDECOD, AESEV, AESER) %>%
  utils::head()

# A tibble: 2 × 4
  USUBJID     AEDECOD                                        AESEV  AESER
  <chr>       <chr>                                          <chr>  <chr>
1 01-718-1170 SYNCOPE                                        SEVERE Y    
2 01-718-1371 PARTIAL SEIZURES WITH SECONDARY GENERALISATION SEVERE Y

6.2 Transformation 2: `filter()` with `%in%` for Multiple Values

Filter specific test codes from the lab data.

# Filter for liver function tests
liver_tests <- c("ALT", "AST", "BILI", "ALP")

lft_data <- lb %>%
  dplyr::filter(LBTESTCD %in% liver_tests)

cat("Total LB records:", nrow(lb), "\n")

Total LB records: 59580

cat("Liver function tests:", nrow(lft_data), "\n")

Liver function tests: 7266

# Count by test
lft_data %>%
  dplyr::count(LBTESTCD, LBTEST)

# A tibble: 4 × 3
  LBTESTCD LBTEST                         n
  <chr>    <chr>                      <int>
1 ALP      Alkaline Phosphatase        1824
2 ALT      Alanine Aminotransferase    1814
3 AST      Aspartate Aminotransferase  1814
4 BILI     Bilirubin                   1814

6.3 Transformation 3: `select()` with Helper Functions

Use select helpers to choose columns by pattern.

# Select all date columns from AE
ae_dates <- ae %>%
  dplyr::select(USUBJID, dplyr::ends_with("DTC"))

names(ae_dates)

[1] "USUBJID" "AEDTC"   "AESTDTC" "AEENDTC"

# Select all result columns from LB
lb_results <- lb %>%
  dplyr::select(USUBJID, LBTESTCD, dplyr::contains("RES"))

names(lb_results)

[1] "USUBJID"  "LBTESTCD" "LBORRES"  "LBORRESU" "LBSTRESC" "LBSTRESN" "LBSTRESU"

6.4 Transformation 4: `mutate()` with `case_when()`

Create a severity score based on AE severity.

# Create numeric severity score
ae_scored <- ae %>%
  dplyr::mutate(
    AESEV_N = dplyr::case_when(
      AESEV == "MILD"     ~ 1L,
      AESEV == "MODERATE" ~ 2L,
      AESEV == "SEVERE"   ~ 3L,
      TRUE                ~ NA_integer_
    )
  )

# Check the mapping
ae_scored %>%
  dplyr::distinct(AESEV, AESEV_N) %>%
  dplyr::arrange(AESEV_N)

# A tibble: 3 × 2
  AESEV    AESEV_N
  <chr>      <int>
1 MILD           1
2 MODERATE       2
3 SEVERE         3

6.5 Transformation 5: `mutate()` with `across()`

Apply transformations to multiple columns at once.

# Convert all character date columns to uppercase (demonstration)
ae_upper <- ae %>%
  dplyr::mutate(
    dplyr::across(
      dplyr::where(is.character),
      ~ toupper(.x)
    )
  )

# This is useful for standardizing text fields
ae_upper %>%
  dplyr::select(USUBJID, AETERM, AEDECOD) %>%
  utils::head(3)

# A tibble: 3 × 3
  USUBJID     AETERM                    AEDECOD                  
  <chr>       <chr>                     <chr>                    
1 01-701-1015 APPLICATION SITE ERYTHEMA APPLICATION SITE ERYTHEMA
2 01-701-1015 APPLICATION SITE PRURITUS APPLICATION SITE PRURITUS
3 01-701-1015 DIARRHOEA                 DIARRHOEA

6.6 Transformation 6: `group_by()` + `summarise()` - AE Summary by SOC

This is the key deliverable: Create an AE summary by System Organ Class.

# AE Summary by System Organ Class (SOC)
ae_soc_summary <- ae %>%
  dplyr::group_by(AEBODSYS) %>%
  dplyr::summarise(
    n_events = dplyr::n(),
    n_subjects = dplyr::n_distinct(USUBJID),
    n_serious = sum(AESER == "Y", na.rm = TRUE),
    n_severe = sum(AESEV == "SEVERE", na.rm = TRUE),
    .groups = "drop"
  ) %>%
  dplyr::arrange(dplyr::desc(n_events))

# Display the summary
ae_soc_summary

# A tibble: 23 × 5
   AEBODSYS                               n_events n_subjects n_serious n_severe
   <chr>                                     <int>      <int>     <int>    <int>
 1 GENERAL DISORDERS AND ADMINISTRATION …      292        108         0       10
 2 SKIN AND SUBCUTANEOUS TISSUE DISORDERS      276        105         0        6
 3 NERVOUS SYSTEM DISORDERS                    101         59         3       10
 4 CARDIAC DISORDERS                            91         44         0        4
 5 GASTROINTESTINAL DISORDERS                   87         53         0        2
 6 INFECTIONS AND INFESTATIONS                  73         39         0        1
 7 RESPIRATORY, THORACIC AND MEDIASTINAL…       53         30         0        0
 8 INVESTIGATIONS                               40         23         0        0
 9 PSYCHIATRIC DISORDERS                        40         29         0        2
10 INJURY, POISONING AND PROCEDURAL COMP…       29         14         0        3
# ℹ 13 more rows

6.7 Transformation 7: `group_by()` + `summarise()` - By Treatment Arm

Summarize by treatment arm using joined data.

# First, join AE with DM to get treatment arm
ae_with_arm <- ae %>%
  dplyr::inner_join(
    dm %>% dplyr::select(USUBJID, ARM),
    by = "USUBJID"
  )

# Now summarize by treatment arm
ae_arm_summary <- ae_with_arm %>%
  dplyr::group_by(ARM) %>%
  dplyr::summarise(
    n_subjects_with_ae = dplyr::n_distinct(USUBJID),
    total_aes = dplyr::n(),
    mean_aes_per_subject = round(dplyr::n() / dplyr::n_distinct(USUBJID), 1),
    .groups = "drop"
  )

ae_arm_summary

# A tibble: 3 × 4
  ARM                  n_subjects_with_ae total_aes mean_aes_per_subject
  <chr>                             <int>     <int>                <dbl>
1 Placebo                              69       301                  4.4
2 Xanomeline High Dose                 79       455                  5.8
3 Xanomeline Low Dose                  77       435                  5.6

6.8 Transformation 8: `pivot_wider()` - Reshape Lab Data

Convert long lab data to wide format (one row per subject/visit with columns for each test).

# Pivot LB data to wide format
lb_wide <- lb %>%
  dplyr::filter(LBTESTCD %in% c("ALT", "AST", "BILI")) %>%
  dplyr::select(USUBJID, VISIT, LBTESTCD, LBSTRESN) %>%
  tidyr::pivot_wider(
    names_from = LBTESTCD,
    values_from = LBSTRESN,
    values_fn = mean  # Handle multiple values per cell
  )

utils::head(lb_wide, 10)

# A tibble: 10 × 5
   USUBJID     VISIT         ALT   AST  BILI
   <chr>       <chr>       <dbl> <dbl> <dbl>
 1 01-701-1015 SCREENING 1    27    40 10.3 
 2 01-701-1015 WEEK 2         41    33  8.55
 3 01-701-1015 WEEK 4         18    21  6.84
 4 01-701-1015 WEEK 6         26    26  5.13
 5 01-701-1015 WEEK 8         22    21  8.55
 6 01-701-1015 WEEK 12        27    22  6.84
 7 01-701-1015 WEEK 16        17    23  6.84
 8 01-701-1015 WEEK 20        21    19  6.84
 9 01-701-1015 WEEK 24        23    23  6.84
10 01-701-1015 WEEK 26        23    19  5.13

6.9 Transformation 9: `pivot_longer()` - Reshape Wide to Long

Convert wide data back to long format (useful for plotting).

# Pivot back to long format
lb_long <- lb_wide %>%
  tidyr::pivot_longer(
    cols = c(ALT, AST, BILI),
    names_to = "LBTESTCD",
    values_to = "LBSTRESN"
  ) %>%
  dplyr::filter(!is.na(LBSTRESN))

utils::head(lb_long, 10)

# A tibble: 10 × 4
   USUBJID     VISIT       LBTESTCD LBSTRESN
   <chr>       <chr>       <chr>       <dbl>
 1 01-701-1015 SCREENING 1 ALT         27   
 2 01-701-1015 SCREENING 1 AST         40   
 3 01-701-1015 SCREENING 1 BILI        10.3 
 4 01-701-1015 WEEK 2      ALT         41   
 5 01-701-1015 WEEK 2      AST         33   
 6 01-701-1015 WEEK 2      BILI         8.55
 7 01-701-1015 WEEK 4      ALT         18   
 8 01-701-1015 WEEK 4      AST         21   
 9 01-701-1015 WEEK 4      BILI         6.84
10 01-701-1015 WEEK 6      ALT         26

6.10 Transformation 10: `left_join()` + `anti_join()` - Data Quality Check

Find subjects with exposure but no adverse events (data quality check).

# Subjects who were dosed (in EX)
subjects_dosed <- ex %>%
  dplyr::distinct(USUBJID)

# Subjects with AEs
subjects_with_ae <- ae %>%
  dplyr::distinct(USUBJID)

# Find dosed subjects with NO AEs
dosed_no_ae <- subjects_dosed %>%
  dplyr::anti_join(subjects_with_ae, by = "USUBJID")

cat("Subjects dosed:", nrow(subjects_dosed), "\n")

Subjects dosed: 254

cat("Subjects with AEs:", nrow(subjects_with_ae), "\n")

Subjects with AEs: 225

cat("Dosed subjects with NO AEs:", nrow(dosed_no_ae), "\n")

Dosed subjects with NO AEs: 29

7 Advanced: Combining Multiple Operations

Real clinical programming often chains many operations together. Here’s a complete example:

# Complete analysis: AE summary by SOC and Treatment Arm
complete_ae_summary <- ae %>%
  # Join with demographics
  dplyr::inner_join(
    dm %>% dplyr::select(USUBJID, ARM, AGE, SEX),
    by = "USUBJID"
  ) %>%
  # Create severity score
  dplyr::mutate(
    AESEV_N = dplyr::case_when(
      AESEV == "MILD"     ~ 1L,
      AESEV == "MODERATE" ~ 2L,
      AESEV == "SEVERE"   ~ 3L,
      TRUE                ~ NA_integer_
    )
  ) %>%
  # Group and summarize
  dplyr::group_by(ARM, AEBODSYS) %>%
  dplyr::summarise(
    n_events = dplyr::n(),
    n_subjects = dplyr::n_distinct(USUBJID),
    n_serious = sum(AESER == "Y", na.rm = TRUE),
    mean_severity = round(mean(AESEV_N, na.rm = TRUE), 2),
    .groups = "drop"
  ) %>%
  # Filter to most common SOCs
  dplyr::filter(n_events >= 5) %>%
  # Sort
dplyr::arrange(ARM, dplyr::desc(n_events))

# Display top results
utils::head(complete_ae_summary, 15)

# A tibble: 15 × 6
   ARM     AEBODSYS                  n_events n_subjects n_serious mean_severity
   <chr>   <chr>                        <int>      <int>     <int>         <dbl>
 1 Placebo GENERAL DISORDERS AND AD…       48         21         0          1.25
 2 Placebo SKIN AND SUBCUTANEOUS TI…       47         21         0          1.4 
 3 Placebo INFECTIONS AND INFESTATI…       35         16         0          1.2 
 4 Placebo CARDIAC DISORDERS               27         13         0          1.41
 5 Placebo GASTROINTESTINAL DISORDE…       26         17         0          1.15
 6 Placebo INVESTIGATIONS                  19         10         0          1.16
 7 Placebo NERVOUS SYSTEM DISORDERS        16         12         0          1.38
 8 Placebo RESPIRATORY, THORACIC AN…       15         10         0          1.07
 9 Placebo PSYCHIATRIC DISORDERS           14         10         0          1.43
10 Placebo INJURY, POISONING AND PR…        9          4         0          1.44
11 Placebo EYE DISORDERS                    8          4         0          1.38
12 Placebo METABOLISM AND NUTRITION…        8          6         0          1.25
13 Placebo MUSCULOSKELETAL AND CONN…        8          5         0          1.5 
14 Placebo VASCULAR DISORDERS               7          3         0          1.43
15 Placebo RENAL AND URINARY DISORD…        5          4         0          1.2

8 Day 2 Deliverable Summary

Today you completed the following:

Task	Status
Understood SDTM domain classes	Done
Learned variable naming conventions	Done
Transformation 1: filter() with multiple conditions	Done
Transformation 2: filter() with %in%	Done
Transformation 3: select() with helper functions	Done
Transformation 4: mutate() with case_when()	Done
Transformation 5: mutate() with across()	Done
Transformation 6: group_by + summarise (AE by SOC)	Done
Transformation 7: Summarize by treatment arm	Done
Transformation 8: pivot_wider()	Done
Transformation 9: pivot_longer()	Done
Transformation 10: anti_join() for data QC	Done

9 Key Takeaways

SDTM has four domain classes: Special Purpose, Events, Interventions, Findings
Variable naming is systematic: Prefix (domain) + suffix (content type)
--TESTCD vs --TEST: Short code vs full name
--ORRES vs --STRESN: Original vs standardized results
Tidyverse is foundational: 50% of clinical programming is dplyr/tidyr operations

10 Resources

CDISC SDTM Implementation Guide v3.4 - Official SDTM documentation
tidyverse.org - Complete tidyverse documentation
pharmaverse blog: ‘The pharmaverse (hi)story’ - Background on pharmaverse development
dplyr cheatsheet - Quick reference for dplyr

11 What’s Next?

In Day 3, we’ll explore:

Controlled Terminology and why it matters for submissions
MedDRA coding and creating lookup tables
CDISC CT concepts: AESEV, AEREL, AESER values
How metacore + metatools automate CT validation