30 Days of Pharmaverse
  • Week 1: SDTM Fundamentals
  • Week 2: Production SDTM
  • Week 3: ADaM Deep Dive
  • Week 4: Tables, Listings and Figures
  1. Day 7: Week 1 Capstone: End-to-End SDTM Script
  • Day 1: Environment Setup & First SDTM Code
  • Day 2: SDTM Domain Structure & Tidyverse Mastery
  • Day 3: Controlled Terminology & MedDRA Coding
  • Day 4: Clinical Date Derivations with lubridate
  • Day 5: Advanced Tidyverse: Pivoting & Joining
  • Day 6: Introduction to sdtm.oak
  • Day 7: Week 1 Capstone: End-to-End SDTM Script

On this page

  • 1 Learning Objectives
  • 2 Capstone Project Overview
    • 2.1 The Pipeline
  • 3 Part 1: Generate Simulated Raw Data (20+ Subjects)
    • 3.1 Raw Demographics
    • 3.2 Raw Adverse Events
    • 3.3 Raw Exposure
  • 4 Part 2: Create SDTM DM Domain
  • 5 Part 3: Create SDTM AE Domain
  • 6 Part 4: Create SDTM EX Domain
  • 7 Part 5: Validation Checks
  • 8 Part 6: Export with xportr
  • 9 Summary Statistics
  • 10 🎉 Congratulations! Week 1 Complete!
  • 11 Key Takeaways from Week 1
  • 12 What’s Next?

Day 7: Week 1 Capstone: End-to-End SDTM Script

Build DM, AE, EX from Scratch with xportr

← Back to Roadmap

1 Learning Objectives

By the end of Day 7, you will be able to:

  1. Integrate all skills from Days 1-6 into a single workflow
  2. Produce valid SDTM domains (DM, AE, EX) from simulated raw data
  3. Work with 20+ subjects across multiple sites
  4. Apply metadata (Labels, Types) using xportr
  5. Export submission-ready SAS Transport files (.xpt)

2 Capstone Project Overview

Goal: In this capstone, you will use everything you learned in Days 1-6 to create three key SDTM domains from scratch, using simulated raw clinical trial data. Here’s what each domain means:

  • DM (Demographics): This domain contains one row per subject and includes basic information like subject ID, age, sex, race, and treatment group. It is the foundation for linking all other domains.
  • AE (Adverse Events): This domain records all adverse events experienced by subjects. There can be multiple rows per subject, one for each event. Each row includes details like the event term, severity, seriousness, and dates.
  • EX (Exposure): This domain tracks dosing information-when and how much of the study drug each subject received. There can be multiple rows per subject, one for each dose.

2.1 The Pipeline

The process for building SDTM domains follows a clear pipeline:

Raw Data → Transform → Validate → Format → Export (.xpt)
  1. Raw Data: Start with unprocessed data collected from the study (simulated here).
  2. Transform: Clean and organize the data into SDTM structure using R and tidyverse tools.
  3. Validate: Check that your SDTM domains follow CDISC rules and are internally consistent.
  4. Format: Apply metadata (labels, types) and prepare for export.
  5. Export: Save the final datasets as SAS Transport files (.xpt), which are required for regulatory submission.

2.1.1 Why is this important?

This workflow is exactly what happens in real clinical programming jobs. Mastering it means you can confidently produce submission-ready datasets for any study.


3 Part 1: Generate Simulated Raw Data (20+ Subjects)

3.1 Raw Demographics

library(dplyr)
library(tidyr)
library(lubridate)
library(xportr)

set.seed(42)  # For reproducibility

# Generate 25 subjects across 3 sites
n_subjects <- 25

raw_dm <- tibble(
  Site = sample(c("101", "102", "103"), n_subjects, replace = TRUE, prob = c(0.4, 0.35, 0.25)),
  Subject = sprintf("%03d", 1:n_subjects)
) %>%
  mutate(
    # Demographics
    Sex = sample(c("Male", "Female"), n_subjects, replace = TRUE),
    Race = sample(c("White", "Black", "Asian", "Other"), n_subjects, replace = TRUE, 
                  prob = c(0.6, 0.2, 0.15, 0.05)),
    Ethnicity = sample(c("Hispanic", "Not Hispanic"), n_subjects, replace = TRUE, 
                       prob = c(0.15, 0.85)),
    
    # Age 25-75
    BirthYear = sample(1950:2000, n_subjects, replace = TRUE),
    BirthDate = paste0(BirthYear, "-", 
                       sprintf("%02d", sample(1:12, n_subjects, replace = TRUE)), "-",
                       sprintf("%02d", sample(1:28, n_subjects, replace = TRUE))),
    
    # Treatment assignment
    Arm = sample(c("Placebo", "Active 10mg", "Active 20mg"), n_subjects, replace = TRUE,
                 prob = c(0.33, 0.33, 0.34)),
    
    # First dose date (Jan-Mar 2024)
    FirstDoseDate = as.Date("2024-01-01") + sample(0:89, n_subjects, replace = TRUE)
  ) %>%
  select(-BirthYear)

cat("Raw Demographics: ", nrow(raw_dm), "subjects\n\n")
Raw Demographics:  25 subjects
head(raw_dm, 10)
# A tibble: 10 × 8
   Site  Subject Sex    Race  Ethnicity    BirthDate  Arm         FirstDoseDate
   <chr> <chr>   <chr>  <chr> <chr>        <chr>      <chr>       <date>       
 1 103   001     Female White Not Hispanic 1967-09-22 Active 20mg 2024-02-23   
 2 103   002     Male   White Not Hispanic 1973-10-17 Active 10mg 2024-03-30   
 3 101   003     Male   White Not Hispanic 1998-08-18 Placebo     2024-01-31   
 4 103   004     Male   Black Not Hispanic 1967-07-02 Active 10mg 2024-02-12   
 5 102   005     Male   White Not Hispanic 1954-06-06 Active 20mg 2024-02-21   
 6 102   006     Female Black Not Hispanic 1995-01-22 Active 20mg 2024-03-21   
 7 102   007     Female Black Not Hispanic 1989-05-06 Active 10mg 2024-02-28   
 8 101   008     Female White Not Hispanic 1989-09-06 Active 10mg 2024-01-27   
 9 102   009     Female White Not Hispanic 1970-07-20 Active 20mg 2024-01-30   
10 102   010     Male   White Not Hispanic 1985-10-28 Active 20mg 2024-02-21   

3.2 Raw Adverse Events

# Common adverse events with their SOC mappings
ae_terms <- tribble(
  ~Term,                   ~SOC,
  "Headache",              "Nervous system disorders",
  "Nausea",                "Gastrointestinal disorders",
  "Fatigue",               "General disorders",
  "Dizziness",             "Nervous system disorders",
  "Diarrhea",              "Gastrointestinal disorders",
  "Rash",                  "Skin disorders",
  "Back pain",             "Musculoskeletal disorders",
  "Insomnia",              "Psychiatric disorders",
  "Cough",                 "Respiratory disorders",
  "Upper respiratory infection", "Infections"
)

# Generate 2-5 AEs per subject (not all subjects have AEs)
subjects_with_ae <- sample(raw_dm$Subject, size = round(n_subjects * 0.8))

raw_ae <- tibble(
  Subject = rep(subjects_with_ae, each = 3)
) %>%
  slice_sample(n = 60) %>%  # ~60 AE records total
  mutate(
    Site = raw_dm$Site[match(Subject, raw_dm$Subject)],
    AE_Num = row_number(),
    Term = sample(ae_terms$Term, n(), replace = TRUE),
    Severity = sample(c("Mild", "Moderate", "Severe"), n(), replace = TRUE, 
                      prob = c(0.6, 0.3, 0.1)),
    Serious = sample(c("No", "Yes"), n(), replace = TRUE, prob = c(0.92, 0.08)),
    Related = sample(c("Not Related", "Possibly Related", "Probably Related"), n(), 
                     replace = TRUE, prob = c(0.5, 0.35, 0.15)),
    # Start 1-60 days after first dose
    StartDay = sample(1:60, n(), replace = TRUE)
  ) %>%
  left_join(
    raw_dm %>% select(Subject, FirstDoseDate),
    by = "Subject"
  ) %>%
  mutate(
    StartDate = as.character(FirstDoseDate + StartDay),
    # End date: 1-14 days after start, or ongoing
    EndDate = ifelse(
      sample(c(TRUE, FALSE), n(), replace = TRUE, prob = c(0.85, 0.15)),
      as.character(as.Date(StartDate) + sample(1:14, n(), replace = TRUE)),
      NA_character_
    )
  ) %>%
  select(Site, Subject, Term, StartDate, EndDate, Severity, Serious, Related)

cat("Raw Adverse Events:", nrow(raw_ae), "records\n\n")
Raw Adverse Events: 60 records
head(raw_ae, 10)
# A tibble: 10 × 8
   Site  Subject Term                 StartDate EndDate Severity Serious Related
   <chr> <chr>   <chr>                <chr>     <chr>   <chr>    <chr>   <chr>  
 1 102   011     Diarrhea             2024-05-… 2024-0… Mild     Yes     Probab…
 2 101   025     Diarrhea             2024-01-… 2024-0… Mild     No      Not Re…
 3 103   013     Rash                 2024-04-… 2024-0… Moderate No      Possib…
 4 101   025     Cough                2024-02-… 2024-0… Mild     No      Not Re…
 5 103   001     Upper respiratory i… 2024-02-… 2024-0… Mild     No      Possib…
 6 102   015     Insomnia             2024-03-… 2024-0… Severe   No      Not Re…
 7 102   012     Nausea               2024-04-… 2024-0… Severe   No      Probab…
 8 101   022     Back pain            2024-02-… 2024-0… Mild     No      Possib…
 9 102   006     Back pain            2024-05-… 2024-0… Mild     No      Not Re…
10 102   010     Diarrhea             2024-04-… 2024-0… Mild     No      Possib…

3.3 Raw Exposure

# Generate exposure records (weekly dosing)
raw_ex <- raw_dm %>%
  filter(!is.na(FirstDoseDate)) %>%
  select(Site, Subject, Arm, FirstDoseDate) %>%
  # Create 8 weeks of dosing per subject
  crossing(Week = 0:7) %>%
  mutate(
    DoseDate = FirstDoseDate + (Week * 7),
    Dose = case_when(
      Arm == "Placebo" ~ 0,
      Arm == "Active 10mg" ~ 10,
      Arm == "Active 20mg" ~ 20
    ),
    DoseUnit = "mg",
    Route = "ORAL"
  ) %>%
  # Some subjects discontinue early
  group_by(Subject) %>%
  filter(row_number() <= sample(4:8, 1)) %>%
  ungroup() %>%
  select(Site, Subject, DoseDate, Dose, DoseUnit, Route)

cat("Raw Exposure:", nrow(raw_ex), "records\n\n")
Raw Exposure: 147 records
head(raw_ex, 10)
# A tibble: 10 × 6
   Site  Subject DoseDate    Dose DoseUnit Route
   <chr> <chr>   <date>     <dbl> <chr>    <chr>
 1 101   003     2024-01-31     0 mg       ORAL 
 2 101   003     2024-02-07     0 mg       ORAL 
 3 101   003     2024-02-14     0 mg       ORAL 
 4 101   003     2024-02-21     0 mg       ORAL 
 5 101   003     2024-02-28     0 mg       ORAL 
 6 101   003     2024-03-06     0 mg       ORAL 
 7 101   003     2024-03-13     0 mg       ORAL 
 8 101   008     2024-01-27    10 mg       ORAL 
 9 101   008     2024-02-03    10 mg       ORAL 
10 101   008     2024-02-10    10 mg       ORAL 

4 Part 2: Create SDTM DM Domain

# SDTM DM: Demographics (one row per subject)
sdtm_dm <- raw_dm %>%
  mutate(
    # Standard variables - HARDCODE algorithm
    STUDYID = "CAPSTONE-01",
    DOMAIN = "DM",
    
    # Subject identifiers - ASSIGN algorithm
    USUBJID = paste(STUDYID, Site, Subject, sep = "-"),
    SUBJID = Subject,
    SITEID = Site,
    
    # Demographics - MAP to CT
    SEX = case_when(
      Sex == "Male" ~ "M",
      Sex == "Female" ~ "F"
    ),
    RACE = toupper(Race),
    ETHNIC = case_when(
      Ethnicity == "Hispanic" ~ "HISPANIC OR LATINO",
      TRUE ~ "NOT HISPANIC OR LATINO"
    ),
    
    # Dates - DERIVE
    RFSTDTC = as.character(FirstDoseDate),
    RFENDTC = NA_character_,  # Would come from disposition
    BRTHDTC = BirthDate,
    
    # Age calculation
    AGE = as.integer(floor(interval(ymd(BirthDate), ymd(FirstDoseDate)) / years(1))),
    AGEU = "YEARS",
    
    # Treatment
    ARM = Arm,
    ACTARM = Arm,  # Assuming planned = actual
    ARMCD = case_when(
      Arm == "Placebo" ~ "PBO",
      Arm == "Active 10mg" ~ "ACT10",
      Arm == "Active 20mg" ~ "ACT20"
    ),
    ACTARMCD = ARMCD
  ) %>%
  select(
    STUDYID, DOMAIN, USUBJID, SUBJID, SITEID,
    RFSTDTC, RFENDTC, BRTHDTC,
    AGE, AGEU, SEX, RACE, ETHNIC,
    ARM, ARMCD, ACTARM, ACTARMCD
  )

cat("SDTM DM Domain\n")
SDTM DM Domain
cat("Subjects:", nrow(sdtm_dm), "\n")
Subjects: 25 
cat("Variables:", ncol(sdtm_dm), "\n\n")
Variables: 17 
head(sdtm_dm, 10)
# A tibble: 10 × 17
   STUDYID     DOMAIN USUBJID  SUBJID SITEID RFSTDTC RFENDTC BRTHDTC   AGE AGEU 
   <chr>       <chr>  <chr>    <chr>  <chr>  <chr>   <chr>   <chr>   <int> <chr>
 1 CAPSTONE-01 DM     CAPSTON… 001    103    2024-0… <NA>    1967-0…    56 YEARS
 2 CAPSTONE-01 DM     CAPSTON… 002    103    2024-0… <NA>    1973-1…    50 YEARS
 3 CAPSTONE-01 DM     CAPSTON… 003    101    2024-0… <NA>    1998-0…    25 YEARS
 4 CAPSTONE-01 DM     CAPSTON… 004    103    2024-0… <NA>    1967-0…    56 YEARS
 5 CAPSTONE-01 DM     CAPSTON… 005    102    2024-0… <NA>    1954-0…    69 YEARS
 6 CAPSTONE-01 DM     CAPSTON… 006    102    2024-0… <NA>    1995-0…    29 YEARS
 7 CAPSTONE-01 DM     CAPSTON… 007    102    2024-0… <NA>    1989-0…    34 YEARS
 8 CAPSTONE-01 DM     CAPSTON… 008    101    2024-0… <NA>    1989-0…    34 YEARS
 9 CAPSTONE-01 DM     CAPSTON… 009    102    2024-0… <NA>    1970-0…    53 YEARS
10 CAPSTONE-01 DM     CAPSTON… 010    102    2024-0… <NA>    1985-1…    38 YEARS
# ℹ 7 more variables: SEX <chr>, RACE <chr>, ETHNIC <chr>, ARM <chr>,
#   ARMCD <chr>, ACTARM <chr>, ACTARMCD <chr>

5 Part 3: Create SDTM AE Domain

# Helper: derive study day (from Day 4)
derive_study_day <- function(event_date, reference_date) {
  diff_days <- as.numeric(event_date - reference_date)
  ifelse(diff_days >= 0, diff_days + 1, diff_days)
}

# SDTM AE: Adverse Events
sdtm_ae <- raw_ae %>%
  mutate(
    STUDYID = "CAPSTONE-01",
    DOMAIN = "AE",
    USUBJID = paste(STUDYID, Site, Subject, sep = "-")
  ) %>%
  # Add reference date from DM
  left_join(
    sdtm_dm %>% select(USUBJID, RFSTDTC),
    by = "USUBJID"
  ) %>%
  # Derive AE variables
  mutate(
    AETERM = Term,
    AEDECOD = toupper(Term),
    AEBODSYS = toupper(case_when(
      grepl("Headache|Dizziness", Term) ~ "NERVOUS SYSTEM DISORDERS",
      grepl("Nausea|Diarrhea", Term) ~ "GASTROINTESTINAL DISORDERS",
      grepl("Fatigue", Term) ~ "GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS",
      grepl("Rash", Term) ~ "SKIN AND SUBCUTANEOUS TISSUE DISORDERS",
      grepl("Back", Term) ~ "MUSCULOSKELETAL AND CONNECTIVE TISSUE DISORDERS",
      grepl("Insomnia", Term) ~ "PSYCHIATRIC DISORDERS",
      grepl("Cough", Term) ~ "RESPIRATORY, THORACIC AND MEDIASTINAL DISORDERS",
      grepl("infection", Term, ignore.case = TRUE) ~ "INFECTIONS AND INFESTATIONS",
      TRUE ~ "OTHER"
    )),
    
    # Dates
    AESTDTC = StartDate,
    AEENDTC = EndDate,
    
    # Study days
    AESTDY = derive_study_day(ymd(StartDate), ymd(RFSTDTC)),
    AEENDY = ifelse(!is.na(EndDate), 
                    derive_study_day(ymd(EndDate), ymd(RFSTDTC)),
                    NA_integer_),
    
    # CT mapping
    AESEV = toupper(Severity),
    AESER = ifelse(Serious == "Yes", "Y", "N"),
    AEREL = toupper(Related),
    
    # Outcome
    AEOUT = case_when(
      is.na(EndDate) ~ "NOT RECOVERED/NOT RESOLVED",
      TRUE ~ "RECOVERED/RESOLVED"
    ),
    
    # Action taken
    AEACN = "DOSE NOT CHANGED"
  ) %>%
  # Add sequence
  arrange(USUBJID, AESTDTC) %>%
  group_by(USUBJID) %>%
  mutate(AESEQ = row_number()) %>%
  ungroup() %>%
  select(
    STUDYID, DOMAIN, USUBJID, AESEQ,
    AETERM, AEDECOD, AEBODSYS,
    AESTDTC, AEENDTC, AESTDY, AEENDY,
    AESEV, AESER, AEREL, AEOUT, AEACN
  )

cat("SDTM AE Domain\n")
SDTM AE Domain
cat("Records:", nrow(sdtm_ae), "\n")
Records: 60 
cat("Subjects:", n_distinct(sdtm_ae$USUBJID), "\n")
Subjects: 20 
cat("Variables:", ncol(sdtm_ae), "\n\n")
Variables: 16 
head(sdtm_ae, 10)
# A tibble: 10 × 16
   STUDYID   DOMAIN USUBJID AESEQ AETERM AEDECOD AEBODSYS AESTDTC AEENDTC AESTDY
   <chr>     <chr>  <chr>   <int> <chr>  <chr>   <chr>    <chr>   <chr>    <dbl>
 1 CAPSTONE… AE     CAPSTO…     1 Heada… HEADAC… NERVOUS… 2024-0… 2024-0…      9
 2 CAPSTONE… AE     CAPSTO…     2 Dizzi… DIZZIN… NERVOUS… 2024-0… 2024-0…     38
 3 CAPSTONE… AE     CAPSTO…     3 Back … BACK P… MUSCULO… 2024-0… <NA>        46
 4 CAPSTONE… AE     CAPSTO…     1 Cough  COUGH   RESPIRA… 2024-0… 2024-0…     12
 5 CAPSTONE… AE     CAPSTO…     2 Dizzi… DIZZIN… NERVOUS… 2024-0… 2024-0…     17
 6 CAPSTONE… AE     CAPSTO…     3 Dizzi… DIZZIN… NERVOUS… 2024-0… 2024-0…     46
 7 CAPSTONE… AE     CAPSTO…     1 Back … BACK P… MUSCULO… 2024-0… 2024-0…     23
 8 CAPSTONE… AE     CAPSTO…     2 Nausea NAUSEA  GASTROI… 2024-0… 2024-0…     56
 9 CAPSTONE… AE     CAPSTO…     3 Upper… UPPER … INFECTI… 2024-0… 2024-0…     61
10 CAPSTONE… AE     CAPSTO…     1 Dizzi… DIZZIN… NERVOUS… 2024-0… <NA>        41
# ℹ 6 more variables: AEENDY <dbl>, AESEV <chr>, AESER <chr>, AEREL <chr>,
#   AEOUT <chr>, AEACN <chr>

6 Part 4: Create SDTM EX Domain

# SDTM EX: Exposure
sdtm_ex <- raw_ex %>%
  mutate(
    STUDYID = "CAPSTONE-01",
    DOMAIN = "EX",
    USUBJID = paste(STUDYID, Site, Subject, sep = "-"),
    
    # Treatment
    EXTRT = case_when(
      Dose == 0 ~ "PLACEBO",
      Dose == 10 ~ "STUDY DRUG 10 MG",
      Dose == 20 ~ "STUDY DRUG 20 MG"
    ),
    
    # Dose info
    EXDOSE = Dose,
    EXDOSU = "mg",
    EXDOSFRM = "TABLET",
    EXROUTE = "ORAL",
    
    # Dates (single-day dosing)
    EXSTDTC = as.character(DoseDate),
    EXENDTC = as.character(DoseDate)
  ) %>%
  # Add reference date from DM
  left_join(
    sdtm_dm %>% select(USUBJID, RFSTDTC),
    by = "USUBJID"
  ) %>%
  mutate(
    EXSTDY = derive_study_day(ymd(DoseDate), ymd(RFSTDTC)),
    EXENDY = EXSTDY
  ) %>%
  # Add sequence
  arrange(USUBJID, EXSTDTC) %>%
  group_by(USUBJID) %>%
  mutate(EXSEQ = row_number()) %>%
  ungroup() %>%
  select(
    STUDYID, DOMAIN, USUBJID, EXSEQ,
    EXTRT, EXDOSE, EXDOSU, EXDOSFRM, EXROUTE,
    EXSTDTC, EXENDTC, EXSTDY, EXENDY
  )

cat("SDTM EX Domain\n")
SDTM EX Domain
cat("Records:", nrow(sdtm_ex), "\n")
Records: 147 
cat("Subjects:", n_distinct(sdtm_ex$USUBJID), "\n")
Subjects: 25 
cat("Variables:", ncol(sdtm_ex), "\n\n")
Variables: 13 
head(sdtm_ex, 10)
# A tibble: 10 × 13
   STUDYID     DOMAIN USUBJID EXSEQ EXTRT EXDOSE EXDOSU EXDOSFRM EXROUTE EXSTDTC
   <chr>       <chr>  <chr>   <int> <chr>  <dbl> <chr>  <chr>    <chr>   <chr>  
 1 CAPSTONE-01 EX     CAPSTO…     1 PLAC…      0 mg     TABLET   ORAL    2024-0…
 2 CAPSTONE-01 EX     CAPSTO…     2 PLAC…      0 mg     TABLET   ORAL    2024-0…
 3 CAPSTONE-01 EX     CAPSTO…     3 PLAC…      0 mg     TABLET   ORAL    2024-0…
 4 CAPSTONE-01 EX     CAPSTO…     4 PLAC…      0 mg     TABLET   ORAL    2024-0…
 5 CAPSTONE-01 EX     CAPSTO…     5 PLAC…      0 mg     TABLET   ORAL    2024-0…
 6 CAPSTONE-01 EX     CAPSTO…     6 PLAC…      0 mg     TABLET   ORAL    2024-0…
 7 CAPSTONE-01 EX     CAPSTO…     7 PLAC…      0 mg     TABLET   ORAL    2024-0…
 8 CAPSTONE-01 EX     CAPSTO…     1 STUD…     10 mg     TABLET   ORAL    2024-0…
 9 CAPSTONE-01 EX     CAPSTO…     2 STUD…     10 mg     TABLET   ORAL    2024-0…
10 CAPSTONE-01 EX     CAPSTO…     3 STUD…     10 mg     TABLET   ORAL    2024-0…
# ℹ 3 more variables: EXENDTC <chr>, EXSTDY <dbl>, EXENDY <dbl>

7 Part 5: Validation Checks

Before exporting, let’s validate our datasets.

# Check 1: Subject counts consistency
cat("=== Subject Count Validation ===\n")
=== Subject Count Validation ===
cat("DM subjects:", nrow(sdtm_dm), "\n")
DM subjects: 25 
cat("EX subjects:", n_distinct(sdtm_ex$USUBJID), "\n")
EX subjects: 25 
cat("AE subjects:", n_distinct(sdtm_ae$USUBJID), "\n\n")
AE subjects: 20 
# Check 2: All EX subjects are in DM
ex_not_in_dm <- sdtm_ex %>%
  anti_join(sdtm_dm, by = "USUBJID") %>%
  nrow()
cat("EX subjects missing from DM:", ex_not_in_dm, "\n")
EX subjects missing from DM: 0 
# Check 3: All AE subjects are in DM
ae_not_in_dm <- sdtm_ae %>%
  anti_join(sdtm_dm, by = "USUBJID") %>%
  nrow()
cat("AE subjects missing from DM:", ae_not_in_dm, "\n\n")
AE subjects missing from DM: 0 
# Check 4: Variable types
cat("=== Variable Type Check ===\n")
=== Variable Type Check ===
cat("DM - AGE is numeric:", is.numeric(sdtm_dm$AGE), "\n")
DM - AGE is numeric: TRUE 
cat("AE - AESEQ is integer:", is.integer(sdtm_ae$AESEQ), "\n")
AE - AESEQ is integer: TRUE 
cat("EX - EXDOSE is numeric:", is.numeric(sdtm_ex$EXDOSE), "\n")
EX - EXDOSE is numeric: TRUE 

8 Part 6: Export with xportr

# Create output directory in project (relative path)
output_dir <- "output"
if (!dir.exists(output_dir)) {
  dir.create(output_dir)
}

# Export DM
xportr_write(sdtm_dm, file.path(output_dir, "dm.xpt"))
cat("✓ Exported: output/dm.xpt\n")
✓ Exported: output/dm.xpt
# Export AE
xportr_write(sdtm_ae, file.path(output_dir, "ae.xpt"))
cat("✓ Exported: output/ae.xpt\n")
✓ Exported: output/ae.xpt
# Export EX
xportr_write(sdtm_ex, file.path(output_dir, "ex.xpt"))
cat("✓ Exported: output/ex.xpt\n")
✓ Exported: output/ex.xpt
# Verify files exist
cat("\n=== Export Verification ===\n")

=== Export Verification ===
cat("dm.xpt:", file.size(file.path(output_dir, "dm.xpt")), "bytes\n")
dm.xpt: 6480 bytes
cat("ae.xpt:", file.size(file.path(output_dir, "ae.xpt")), "bytes\n")
ae.xpt: 17920 bytes
cat("ex.xpt:", file.size(file.path(output_dir, "ex.xpt")), "bytes\n")
ex.xpt: 19040 bytes

9 Summary Statistics

# Treatment arm distribution
cat("=== Treatment Distribution (DM) ===\n")
=== Treatment Distribution (DM) ===
sdtm_dm %>%
  count(ARM, ARMCD) %>%
  print()
# A tibble: 3 × 3
  ARM         ARMCD     n
  <chr>       <chr> <int>
1 Active 10mg ACT10    10
2 Active 20mg ACT20    10
3 Placebo     PBO       5
cat("\n=== AE Severity Distribution ===\n")

=== AE Severity Distribution ===
sdtm_ae %>%
  count(AESEV) %>%
  print()
# A tibble: 3 × 2
  AESEV        n
  <chr>    <int>
1 MILD        40
2 MODERATE    14
3 SEVERE       6
cat("\n=== Exposure Summary ===\n")

=== Exposure Summary ===
sdtm_ex %>%
  group_by(EXTRT) %>%
  summarise(
    N_Subjects = n_distinct(USUBJID),
    N_Doses = n(),
    Total_Dose = sum(EXDOSE)
  ) %>%
  print()
# A tibble: 3 × 4
  EXTRT            N_Subjects N_Doses Total_Dose
  <chr>                 <int>   <int>      <dbl>
1 PLACEBO                   5      28          0
2 STUDY DRUG 10 MG         10      62        620
3 STUDY DRUG 20 MG         10      57       1140

10 🎉 Congratulations! Week 1 Complete!

You have successfully completed the Week 1 Capstone and demonstrated:

Skill Application
Data Generation Created 25 subjects with realistic demographics
SDTM Structure Built DM (1 row/subject), AE & EX (multiple rows)
Controlled Terminology Mapped Sex, Race, Severity, SOC to CT
Date Derivations Calculated study days using “No Day 0” rule
Sequence Numbers Added AESEQ, EXSEQ per subject
Validation Cross-domain consistency checks
Export Created FDA-ready .xpt files

11 Key Takeaways from Week 1

  1. CDISC Structure: SDTM organizes clinical data into standardized domains
  2. Controlled Terminology: Consistency across studies is mandatory
  3. Derivations: Study days, durations, and flags follow specific rules
  4. Traceability: Every derived variable should trace back to source
  5. Validation: Check data before export

12 What’s Next?

Week 2: Production SDTM & Validation

  • Complex domains: LB (Labs), VS (Vital Signs), DS (Disposition)
  • Industry-grade validation with sdtmchecks
  • Metadata management with metacore
  • Specification-driven workflows

 

30 Days of Pharmaverse  ·  Disclaimer  ·  Indraneel Chakraborty  ·  © 2026