Day 1: Environment Setup & First SDTM Code

SDTM Programming Walkthrough

1 Learning Objectives

By the end of Day 1, you will be able to:

Set up your R environment with essential pharmaverse packages
Load and explore SDTM datasets from pharmaversesdtm
Understand the CDISC data flow: Raw → SDTM → ADaM → TLFs
Use dplyr to explore relationships between SDTM domains
Perform basic data manipulation operations on clinical data

2 Introduction to CDISC and the Pharmaverse

2.1 What is CDISC?

CDISC (Clinical Data Interchange Standards Consortium) is a global organization that creates standards for how clinical trial data should be collected, organized, and shared. These standards are important because they:

Ensure Consistency: Data from different studies and companies can be compared easily because everyone uses the same structure and terms.
Enable Interoperability: Data can move smoothly between different software systems, making it easier for teams, sponsors, and regulators to work together.
Support Regulatory Submission: Regulatory agencies like the FDA (U.S.) and EMA (Europe) require data in CDISC formats for new drug applications. Using CDISC makes your data ready for these submissions.

2.1.1 Why does this matter?

Imagine you are reading a book, but every chapter is written in a different language or format. It would be very hard to understand the story! CDISC makes sure all the “chapters” (datasets) in a clinical trial are written in the same “language” so everyone can understand and use them.

2.2 The CDISC Data Flow

Understanding how data moves through a clinical trial is key for programmers. Here’s the typical flow:

Raw Data → SDTM → ADaM → TLFs

Raw Data: This is the data collected directly from sources like electronic data capture (EDC) systems, lab machines, or paper forms. It is often messy and not standardized.
SDTM (Study Data Tabulation Model): This is the first step of standardization. SDTM organizes the raw data into specific domains (tables) with defined variables and formats. Each domain represents a type of information, like demographics (DM), adverse events (AE), or lab results (LB).
ADaM (Analysis Data Model): These datasets are derived from SDTM and are designed to make statistical analysis easier. They include derived variables, flags, and are structured to answer specific analysis questions.
TLFs (Tables, Listings, and Figures): These are the final outputs-summaries, tables, and graphs that go into clinical study reports and regulatory submissions.

2.2.1 Example:

Suppose a subject’s blood pressure is measured during a study. The measurement is first recorded as raw data. It is then organized into the SDTM VS (Vital Signs) domain. Later, it might be used to calculate a derived variable (like average blood pressure) in ADaM, and finally, the results are summarized in a table for the study report.

Stage	Description	Example
Raw Data	Data collected from EDC systems, labs, etc.	CRF exports, lab files
SDTM	Study Data Tabulation Model - standardized clinical data	DM, AE, EX, LB domains
ADaM	Analysis Data Model - analysis-ready datasets	ADSL, ADAE, ADLB
TLFs	Tables, Listings, and Figures for submission	Demographics table, AE summary

Why does this flow matter?

Think of this pipeline like a factory assembly line. Raw data arrives in messy, inconsistent shapes - different sponsors, different EDC vendors, different column names. SDTM is the standardization step: every study’s data gets reshaped into the same structure so the FDA (and your future self) can always find the same variable in the same place. ADaM then derives analysis-ready variables from that standardized base, and TLFs are the final output - the actual tables and figures that go into your Clinical Study Report (CSR). Skipping or muddling any stage creates compliance and reproducibility risks.

2.3 What is the Pharmaverse?

The pharmaverse is a cross-industry collaboration of R packages designed for clinical reporting workflows. It provides open-source tools that map to each stage of the CDISC data flow:

SDTM creation: sdtm.oak, pharmaversesdtm (example data)
ADaM creation: admiral (the flagship package)
TLG creation: rtables, tern, tfrmt
Validation & Export: xportr, metacore, sdtmchecks

What is pharmaversesdtm?

pharmaversesdtm ships pre-built, realistic SDTM datasets based on a fictional Phase III clinical trial. These datasets follow the official CDISC SDTM Implementation Guide (SDTMIG) and are the same ones used internally by the admiral development team for testing. You can browse them at the pharmaversesdtm GitHub repository.

3 Package Installation & Loading

The following packages are essential for today’s exercises. Each serves a specific purpose in the clinical programming workflow.

3.1 Required Packages

Package	Purpose
`admiral`	ADaM dataset creation - the core pharmaverse package
`pharmaversesdtm`	Example SDTM datasets for learning and testing
`dplyr`	Data manipulation (filter, select, mutate, join, etc.)
`haven`	Read/write SAS transport files (.xpt)
`xportr`	Apply metadata and export to .xpt for FDA submission

3.2 Step 1: Install Packages (if needed)

The code below checks if each package is installed and installs it if missing. This is a safe pattern that won’t reinstall packages you already have.

How requireNamespace() works

requireNamespace("pkg", quietly = TRUE) tries to locate the package without actually loading it into your session. It returns TRUE if the package is found, FALSE if not. Wrapping it in ! (logical NOT) flips the result - so the install.packages() call inside the if block only fires when the package is not installed. The quietly = TRUE argument suppresses any console messages during the check. This is the idiomatic way to write conditional installs in R scripts. (Reference: R documentation for requireNamespace)

# Install packages if not already installed
if (!requireNamespace("admiral", quietly = TRUE)) install.packages("admiral")
if (!requireNamespace("pharmaversesdtm", quietly = TRUE)) install.packages("pharmaversesdtm")
if (!requireNamespace("xportr", quietly = TRUE)) install.packages("xportr")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
if (!requireNamespace("haven", quietly = TRUE)) install.packages("haven")

What just happened?

R checked your local package library for each package. If a package was missing, it downloaded and installed it from CRAN. If it was already installed, nothing happened - no unnecessary downloads, no version conflicts. You can verify what’s installed by running installed.packages()[, c("Package", "Version")] in your console.

3.3 Step 2: Load Packages

Now we load the packages into our R session. The library() function makes all functions from a package available for use.

library() vs require() vs requireNamespace()

library(pkg) - loads the package and errors if it’s not found. Use this at the top of scripts/Quarto files.
require(pkg) - loads the package and warns (but doesn’t error) if not found. Commonly used inside functions.
requireNamespace(pkg) - checks existence without attaching the package to your session. Used for conditional installs (as above).

For day-to-day scripts, library() is the right choice because a hard error is better than silently continuing without a needed package. (Reference: Hadley Wickham’s R Packages - Namespaces)

# Load packages into the session
library(admiral)
library(pharmaversesdtm)
library(dplyr)
library(haven)
library(xportr)

What do these messages mean?

When you load packages, you may see messages about package versions and any function name conflicts. This is normal and helps you understand which package’s function will be used when names overlap.

What just happened?

After running the library() calls, all exported functions from these five packages are now available in your global environment. For example, you can now type admiral::derive_vars_dt() or just derive_vars_dt() directly. If you see a warning like The following object is masked from 'package:X': filter, it means two packages export a function with the same name - R will use the most recently loaded one. You can always be explicit, e.g., dplyr::filter(), to avoid ambiguity. Run search() in your console to see the full list of attached packages in the order R searches them.

4 Loading SDTM Datasets

The pharmaversesdtm package contains example SDTM datasets that mirror real clinical trial data. These are the same datasets used by production teams when developing and testing admiral.

4.1 What are SDTM Domains?

SDTM organizes clinical data into domains, each representing a specific type of information. The full SDTM standard defines 30+ domains across several categories (General Observation Classes, Special Purpose, Trial Design, and Relationship datasets). The exact count grows with each new version of the SDTMIG as CDISC adds domains for new therapeutic areas and data types.

The table below shows the most commonly used (“core”) domains - the ones you will encounter in nearly every clinical trial:

Domain Class	Domain	Description
Special Purpose	DM	Demographics - one row per subject
Events	AE	Adverse Events - multiple rows per subject
Events	DS	Disposition - study milestones
Interventions	EX	Exposure - drug dosing information
Interventions	CM	Concomitant Medications
Findings	LB	Laboratory Results
Findings	VS	Vital Signs

“Core” domains vs. the full SDTM standard

We use the term “core” here to mean the domains you’ll work with most frequently - DM, AE, EX, LB, VS, DS, CM. These appear in virtually every clinical study. However, the full SDTM Implementation Guide (SDTMIG), available from CDISC, defines many more domains, including:

Additional General Observation domains: MH (Medical History), PE (Physical Examination), EG (ECG), QS (Questionnaires), FA (Findings About), and others
Trial Design domains: TA (Trial Arms), TE (Trial Elements), TV (Trial Visits), TI (Trial Inclusion/Exclusion Criteria) - these describe the study design rather than individual subject data
Relationship datasets: RELREC (Related Records), SUPP– (Supplemental Qualifiers) - these capture relationships and non-standard variables

The SDTMIG also allows sponsors to create custom domains when no standard domain fits the data, as long as they follow the general observation class structure.

What is the SDTMIG?

The SDTM Implementation Guide (SDTMIG) is the companion document to the SDTM standard. While the SDTM model defines the conceptual framework (observation classes, general variable roles), the SDTMIG provides the practical implementation details:

Domain specifications: Which variables belong in each domain, their data types, controlled terminology, and whether they are Required, Expected, or Permissible
Assumptions and business rules: How to handle specific data scenarios (e.g., partial dates, multiple occurrences, supplemental qualifiers)
Examples: Sample data showing how real clinical data maps into SDTM structure

The current version is SDTMIG v3.4. You can obtain it from CDISC (a free CDISC account is required to access the full documentation).

Domain classes and their structure rules

The SDTMIG defines three main observation classes with strict structural rules:

Events (AE, DS, CE…): Things that happened to the subject. Each row is a distinct occurrence. Key variable: --TERM (the verbatim text) and --DECOD (the standardized dictionary term).
Interventions (EX, CM, PR…): Things that were given to or done for the subject. Each row is a treatment record. Key variable: --TRT (treatment name).
Findings (LB, VS, EG…): Measurements or observations collected from the subject. Each row is a single result. Key variables: --TESTCD (short test code) and --ORRES (original result).

The Special Purpose domains (DM, CO, SE, SV) don’t follow the three-class model; they have their own SDTMIG-specified structures.

4.2 Step 3: Load the Core SDTM Domains

Out of the 30+ domains defined in the SDTMIG, we’ll load four core domains today - the ones that appear in virtually every clinical trial and form the foundation for most safety and efficacy analyses:

How data() loads package datasets

data("dm", package = "pharmaversesdtm") reaches into the pharmaversesdtm package’s data/ folder and loads the dm object into your global environment. Without the package = argument, R searches all attached packages. After this call, dm is a plain R data.frame (or tibble) sitting in your workspace - just like one you’d import from a CSV. (Reference: R documentation - data())

# Load SDTM domains from pharmaversesdtm
# DM = Demographics (one row per subject)
data("dm", package = "pharmaversesdtm")

# AE = Adverse Events (multiple rows per subject)
data("ae", package = "pharmaversesdtm")

# EX = Exposure/Dosing (multiple rows per subject)
data("ex", package = "pharmaversesdtm")

# LB = Laboratory Results (many rows per subject)
data("lb", package = "pharmaversesdtm")

What just happened?

Four new objects - dm, ae, ex, and lb - have been created in your R global environment. You can confirm this by running ls() in the console, which lists all objects currently in memory. Each is a data.frame (with haven-style labelled attributes, since SDTM datasets mimic SAS transport format). You can inspect labels with attr(dm$ARM, "label").

5 Exploring the Data

Before working with any dataset, it’s essential to understand its structure. Let’s explore each domain using different R functions.

A note on terminology: “column names” vs. “variables”

In CDISC and pharma programming, the terms “column name” and “variable” refer to the same thing - a single field in a dataset. For example, USUBJID, AGE, AEDECOD are all variables (in CDISC terminology) that appear as columns in the data frame. If you come from a database background, you might also know them as “fields” or “attributes”. Throughout this course, we use these terms interchangeably.

Key variables: uniquely identifying each row

If you come from a database background, you know how important key fields (primary keys) are - the set of columns that uniquely identify a single row in a table. SDTM domains follow the same concept:

Domain	Key Variables (Natural Key)	What each row represents
DM	`USUBJID`	One row per subject
AE	`USUBJID`, `AESEQ`	One row per adverse event per subject
EX	`USUBJID`, `EXSEQ`	One row per dosing record per subject
LB	`USUBJID`, `LBSEQ`	One row per lab result per subject
VS	`USUBJID`, `VSSEQ`	One row per vital sign measurement per subject

The --SEQ variable (e.g., AESEQ, EXSEQ) is a sequence number assigned within each subject - it acts as a record-level identifier that, combined with USUBJID, forms the natural key for the domain. This is conceptually identical to a composite primary key in a relational database.

In addition, STUDYID + USUBJID together form a globally unique subject identifier across studies. Always verify that your key variables truly produce unique rows - duplicate keys indicate data quality issues.

5.1 Method 1: `glimpse()` - Quick Structure Overview

The glimpse() function from dplyr provides a compact view of the data structure, showing column names, types, and a preview of values.

What does glimpse() show you?

glimpse() is essentially a transposed version of print(). For each column it shows:

The column name (e.g., USUBJID)
The data type in abbreviated form: <chr> = character, <dbl> = double/numeric, <lgl> = logical, <date> = Date, <dttm> = POSIXct datetime
A sample of values from the first few rows

This is especially useful for wide datasets (many columns) where a regular print() would wrap awkwardly. (Reference: dplyr glimpse() documentation)

# Explore the Demographics (DM) domain
# This shows us all variables and their types
dplyr::glimpse(dm)

Rows: 306
Columns: 26
$ STUDYID  <chr> "CDISCPILOT01", "CDISCPILOT01", "CDISCPILOT01", "CDISCPILOT01…
$ DOMAIN   <chr> "DM", "DM", "DM", "DM", "DM", "DM", "DM", "DM", "DM", "DM", "…
$ USUBJID  <chr> "01-701-1015", "01-701-1023", "01-701-1028", "01-701-1033", "…
$ SUBJID   <chr> "1015", "1023", "1028", "1033", "1034", "1047", "1057", "1097…
$ RFSTDTC  <chr> "2014-01-02", "2012-08-05", "2013-07-19", "2014-03-18", "2014…
$ RFENDTC  <chr> "2014-07-02", "2012-09-02", "2014-01-14", "2014-04-14", "2014…
$ RFXSTDTC <chr> "2014-01-02", "2012-08-05", "2013-07-19", "2014-03-18", "2014…
$ RFXENDTC <chr> "2014-07-02", "2012-09-01", "2014-01-14", "2014-03-31", "2014…
$ RFICDTC  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ RFPENDTC <chr> "2014-07-02T11:45", "2013-02-18", "2014-01-14T11:10", "2014-0…
$ DTHDTC   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ DTHFL    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ SITEID   <chr> "701", "701", "701", "701", "701", "701", "701", "701", "701"…
$ BRTHDTC  <chr> "1950-12-26", "1948-07-22", "1942-07-11", "1940-03-10", "1937…
$ AGE      <dbl> 63, 64, 71, 74, 77, 85, 59, 68, 81, 84, 52, 84, 81, 57, 75, 5…
$ AGEU     <chr> "YEARS", "YEARS", "YEARS", "YEARS", "YEARS", "YEARS", "YEARS"…
$ SEX      <chr> "F", "M", "M", "M", "F", "F", "F", "M", "F", "M", "M", "M", "…
$ RACE     <chr> "WHITE", "WHITE", "WHITE", "WHITE", "WHITE", "WHITE", "WHITE"…
$ ETHNIC   <chr> "HISPANIC OR LATINO", "HISPANIC OR LATINO", "NOT HISPANIC OR …
$ ARMCD    <chr> "Pbo", "Pbo", "Xan_Hi", "Xan_Lo", "Xan_Hi", "Pbo", "Scrnfail"…
$ ARM      <chr> "Placebo", "Placebo", "Xanomeline High Dose", "Xanomeline Low…
$ ACTARMCD <chr> "Pbo", "Pbo", "Xan_Hi", "Xan_Lo", "Xan_Hi", "Pbo", "Scrnfail"…
$ ACTARM   <chr> "Placebo", "Placebo", "Xanomeline High Dose", "Xanomeline Low…
$ COUNTRY  <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA"…
$ DMDTC    <chr> "2013-12-26", "2012-07-22", "2013-07-11", "2014-03-10", "2014…
$ DMDY     <dbl> -7, -14, -8, -8, -7, -21, NA, -9, -13, -7, -13, -6, -5, NA, -…

Reading the output

After running glimpse(dm), look for:

Rows: The total number of subjects. In SDTM DM, this should be one row per subject - a fundamental SDTMIG rule.
Columns: The number of variables in the domain.
Column types: Most SDTM character variables (like USUBJID, ARM, SEX) will show <chr>. Numeric variables like AGE will show <dbl>. Date/time character variables (ISO 8601 format, e.g., RFSTDTC) remain <chr> in SDTM - they get converted to proper Date objects only in ADaM.
Values: Scan the ARM column - you’ll see the treatment arm labels for this trial. Scan ACTARM to see if any subjects received something different from their planned arm.

Key DM Variables

USUBJID: Unique Subject Identifier - the key that links all domains together
RFSTDTC: Reference Start Date (usually first dose date)
RFENDTC: Reference End Date
ARM: Treatment Arm (e.g., “Placebo”, “Drug 80mg”)
ACTARM: Actual Treatment Arm

5.2 Method 2: `head()` - View First Few Rows

The head() function shows the first n rows of data (default is 6).

head() vs tail() vs slice()

head(df, n) - first n rows (default 6). Useful for a quick sanity check.
tail(df, n) - last n rows. Useful for checking if sorting worked correctly or if data was appended properly.
dplyr::slice(df, rows) - select any arbitrary rows by position. More flexible for exploring middle sections of large datasets.

For very wide datasets, even head() output can be hard to read in the console. In those cases, View(head(df)) opens the RStudio data viewer for a spreadsheet-style look. (Reference: R documentation - head())

# View the first 6 rows of Adverse Events
utils::head(ae)

# A tibble: 6 × 35
  STUDYID  DOMAIN USUBJID AESEQ AESPID AETERM AELLT AELLTCD AEDECOD AEPTCD AEHLT
  <chr>    <chr>  <chr>   <dbl> <chr>  <chr>  <chr>   <dbl> <chr>    <dbl> <chr>
1 CDISCPI… AE     01-701…     1 E07    APPLI… APPL…      NA APPLIC…     NA HLT_…
2 CDISCPI… AE     01-701…     2 E08    APPLI… APPL…      NA APPLIC…     NA HLT_…
3 CDISCPI… AE     01-701…     3 E06    DIARR… DIAR…      NA DIARRH…     NA HLT_…
4 CDISCPI… AE     01-701…     3 E10    ATRIO… AV B…      NA ATRIOV…     NA HLT_…
5 CDISCPI… AE     01-701…     1 E08    ERYTH… ERYT…      NA ERYTHE…     NA HLT_…
6 CDISCPI… AE     01-701…     2 E09    ERYTH… LOCA…      NA ERYTHE…     NA HLT_…
# ℹ 24 more variables: AEHLTCD <dbl>, AEHLGT <chr>, AEHLGTCD <dbl>,
#   AEBODSYS <chr>, AEBDSYCD <dbl>, AESOC <chr>, AESOCCD <dbl>, AESEV <chr>,
#   AESER <chr>, AEACN <chr>, AEREL <chr>, AEOUT <chr>, AESCAN <chr>,
#   AESCONG <chr>, AESDISAB <chr>, AESDTH <chr>, AESHOSP <chr>, AESLIFE <chr>,
#   AESOD <chr>, AEDTC <chr>, AESTDTC <chr>, AEENDTC <chr>, AESTDY <dbl>,
#   AEENDY <dbl>

What to look for in the AE output

Multiple rows per subject: Unlike DM, you’ll see the same USUBJID appearing more than once - each row is a different adverse event for that subject.
AEDECOD: The MedDRA-standardized Preferred Term for the adverse event. This is what gets used in safety tables.
AEBODSYS: The MedDRA System Organ Class (SOC). Used for grouping in safety summaries.
AESEV: Severity - typically MILD, MODERATE, or SEVERE.
AESER: Serious flag - Y = serious, N = non-serious. This drives regulatory safety reporting obligations.
Date columns (AESTDTC, AEENDTC): Stored as ISO 8601 character strings (e.g., "2014-01-02"). These will be converted to numeric study-day variables (AESTDY) in ADaM.

5.3 Method 3: `dim()` and `names()` - Dimensions and Column Names

Understanding dim(), nrow(), and ncol()

dim(df) returns a numeric vector of length 2: c(nrow, ncol). So dim(dm)[1] gives you the row count and dim(dm)[2] gives the column count. You can also use the dedicated nrow(df) and ncol(df) functions for readability. The cat() function here concatenates and prints the text + number together in a readable line. "\n" inserts a newline character so each result prints on its own line. (Reference: R documentation - dim())

# How many rows and columns in each domain?
cat("DM dimensions:", dim(dm), "\n")

DM dimensions: 306 26

cat("AE dimensions:", dim(ae), "\n")

AE dimensions: 1191 35

cat("EX dimensions:", dim(ex), "\n")

EX dimensions: 591 17

cat("LB dimensions:", dim(lb), "\n")

LB dimensions: 59580 23

Interpreting the dimensions

The two numbers printed for each domain are [rows] [columns]. You should notice:

DM has the fewest rows - one per subject. This is the “spine” of your trial.
AE has more rows than DM - multiple adverse events can occur per subject.
LB typically has the most rows of any domain - lab panels are collected repeatedly over many visits, creating a large number of records per subject.
The column count reflects how many SDTM variables that domain collects. LB has many columns because it captures rich measurement metadata (units, reference ranges, methods, etc.).

Compare nrow(ae) to nrow(dm) - the ratio gives you an intuitive feel for how “event-heavy” this trial is.

# What variables are in the AE domain?
names(ae)

 [1] "STUDYID"  "DOMAIN"   "USUBJID"  "AESEQ"    "AESPID"   "AETERM"  
 [7] "AELLT"    "AELLTCD"  "AEDECOD"  "AEPTCD"   "AEHLT"    "AEHLTCD" 
[13] "AEHLGT"   "AEHLGTCD" "AEBODSYS" "AEBDSYCD" "AESOC"    "AESOCCD" 
[19] "AESEV"    "AESER"    "AEACN"    "AEREL"    "AEOUT"    "AESCAN"  
[25] "AESCONG"  "AESDISAB" "AESDTH"   "AESHOSP"  "AESLIFE"  "AESOD"   
[31] "AEDTC"    "AESTDTC"  "AEENDTC"  "AESTDY"   "AEENDY"

Reading names() output

names() returns a character vector of all column names. In SDTM, variable names follow a strict two-part convention: a domain prefix (usually 2 characters, e.g., AE) followed by a core variable name suffix (e.g., TERM, DECOD, SEV). So AEDECOD = AE domain + DECOD suffix. This naming convention is defined in the SDTMIG and makes variables instantly recognizable across domains - AESTDTC, LBSTDTC, and EXSTDTC all mean “start date/time” for their respective domains.

Variables without a domain prefix (like USUBJID, STUDYID, DOMAIN) are identifier variables shared across all domains.

6 Understanding Domain Relationships

One of the most important concepts in clinical programming is understanding how domains relate to each other. USUBJID is the universal key that connects all domains - much like a foreign key in a relational database.

6.1 The USUBJID Connection

        ┌─────────────┐
        │     DM      │
        │ (1 row per  │
        │  subject)   │
        └──────┬──────┘
               │ USUBJID
    ┌──────────┼──────────┬──────────┐
    │ 1:many   │ 1:many   │ 1:many   │ 1:many
    ▼          ▼          ▼          ▼
┌────────┐  ┌───────┐  ┌───────┐  ┌────────┐
│  AE    │  │  EX   │  │  LB   │  │  VS    │
│(events)│  │(doses)│  │(labs) │  │(vitals)│
└────────┘  └───────┘  └───────┘  └────────┘

Domain relationships: one-to-one vs. one-to-many

If you come from a database background, you’ll recognize the relational patterns between SDTM domains:

Relationship	From → To	Explanation
One-to-one	DM → each subject	DM has exactly one row per `USUBJID`. It is the “master” table.
One-to-many	DM → AE	Each subject (one DM row) can have zero, one, or many AE records.
One-to-many	DM → EX	Each subject can have multiple dosing records.
One-to-many	DM → LB	Each subject can have hundreds of lab results across visits.
One-to-many	DM → VS	Each subject can have many vital sign measurements.

When you inner_join(dm, ae, by = "USUBJID"), the single DM row for a subject gets replicated for each of that subject’s AE rows - this is the standard one-to-many join behavior. Understanding this is critical for avoiding unexpected row counts after joins.

In database terms, USUBJID in DM acts as the primary key, while USUBJID in AE, EX, LB, etc. acts as a foreign key referencing back to DM.

Why USUBJID is always a character variable

USUBJID is always stored as a character (<chr>) string, never a number, even though it often contains digits. This is because it is a concatenated identifier - typically structured as "STUDYID-SITEID-SUBJECTID" (e.g., "CDISCPILOT01-01-701"). Treating it as numeric would risk stripping leading zeros, causing mismatches when joining domains. Always use == for exact string matching, never numeric comparisons. (Reference: CDISC SDTMIG - Identifier Variables, available from CDISC)

6.2 Step 4: Join DM and AE (Demographics + Adverse Events)

When analyzing adverse events, we often need demographic information like treatment arm. Let’s join these domains.

What is a join and why do we need it?

In relational data, information is split across multiple tables to avoid duplication. DM stores demographic info (one row per subject) while AE stores adverse events (many rows per subject). To analyze AEs by treatment arm, we need to combine these - that’s what a join does. We match rows from two tables based on a shared key column (here: USUBJID). The dplyr::select() before the join ensures we only bring in the specific DM columns we need, keeping the result clean. (Reference: dplyr two-table verbs documentation)

# Join Demographics (DM) with Adverse Events (AE)
# We select key DM variables and join on USUBJID

dm_ae <- dm %>%
  # Select only the DM variables we need
  dplyr::select(USUBJID, RFSTDTC, RFENDTC, ARM, AGE, SEX, RACE) %>%
  # Inner join keeps only subjects that appear in both datasets
  dplyr::inner_join(ae, by = "USUBJID")

# How many records do we have now?
cat("Number of AE records with demographic info:", nrow(dm_ae), "\n")

Number of AE records with demographic info: 1191

# Preview the joined data
utils::head(dm_ae)

# A tibble: 6 × 41
  USUBJID    RFSTDTC RFENDTC ARM     AGE SEX   RACE  STUDYID DOMAIN AESEQ AESPID
  <chr>      <chr>   <chr>   <chr> <dbl> <chr> <chr> <chr>   <chr>  <dbl> <chr> 
1 01-701-10… 2014-0… 2014-0… Plac…    63 F     WHITE CDISCP… AE         1 E07   
2 01-701-10… 2014-0… 2014-0… Plac…    63 F     WHITE CDISCP… AE         2 E08   
3 01-701-10… 2014-0… 2014-0… Plac…    63 F     WHITE CDISCP… AE         3 E06   
4 01-701-10… 2012-0… 2012-0… Plac…    64 M     WHITE CDISCP… AE         3 E10   
5 01-701-10… 2012-0… 2012-0… Plac…    64 M     WHITE CDISCP… AE         1 E08   
6 01-701-10… 2012-0… 2012-0… Plac…    64 M     WHITE CDISCP… AE         2 E09   
# ℹ 30 more variables: AETERM <chr>, AELLT <chr>, AELLTCD <dbl>, AEDECOD <chr>,
#   AEPTCD <dbl>, AEHLT <chr>, AEHLTCD <dbl>, AEHLGT <chr>, AEHLGTCD <dbl>,
#   AEBODSYS <chr>, AEBDSYCD <dbl>, AESOC <chr>, AESOCCD <dbl>, AESEV <chr>,
#   AESER <chr>, AEACN <chr>, AEREL <chr>, AEOUT <chr>, AESCAN <chr>,
#   AESCONG <chr>, AESDISAB <chr>, AESDTH <chr>, AESHOSP <chr>, AESLIFE <chr>,
#   AESOD <chr>, AEDTC <chr>, AESTDTC <chr>, AEENDTC <chr>, AESTDY <dbl>,
#   AEENDY <dbl>

What changed after the join?

Before the join, ae had its own set of columns and rows. After the join:

Row count: nrow(dm_ae) equals nrow(ae) (since every AE record should match a subject in DM for a clean trial dataset). If these numbers differed, it would flag a data quality issue - AE records for subjects not in DM.
Column count: dm_ae now has 7 columns from DM (USUBJID, RFSTDTC, RFENDTC, ARM, AGE, SEX, RACE) + all columns from AE, minus one USUBJID since the key is merged. This is the “enriched” AE dataset with demographic context.
New columns visible: In the head() output you’ll now see ARM, AGE, SEX, RACE appearing alongside AEDECOD, AESEV, etc. - this combination is what allows you to break down AEs by treatment arm in safety tables.

Why use inner_join()?

inner_join(): Keeps only rows where USUBJID exists in BOTH datasets
left_join(): Keeps all rows from the left dataset (DM), adds matching rows from right (AE)
anti_join(): Keeps rows from left where there is NO match in right (useful for finding missing data)

6.3 Step 5: Join DM and EX (Demographics + Exposure)

Exposure data tells us about drug dosing. Let’s combine this with demographics.

What is the EX domain used for?

The EX (Exposure) domain records every dose of study drug administered to a subject - the drug name, dose, dose unit, route of administration, and start/end dates. In ADaM, EX is the primary source for:

Calculating treatment-emergent flags (was the AE after the first dose?)
Deriving last dose date (used for safety follow-up windows)
Computing dose intensity and total exposure for pharmacokinetic analyses

A subject can have multiple EX rows if they received multiple doses, dose changes, or multiple study periods. (Reference: CDISC SDTMIG - EX Domain, available from CDISC)

# Join Demographics (DM) with Exposure (EX)
dm_ex <- dm %>%
  dplyr::select(USUBJID, ARM, AGE, SEX) %>%
  dplyr::inner_join(ex, by = "USUBJID")

# Preview the result
dplyr::glimpse(dm_ex)

Rows: 591
Columns: 20
$ USUBJID  <chr> "01-701-1015", "01-701-1015", "01-701-1015", "01-701-1023", "…
$ ARM      <chr> "Placebo", "Placebo", "Placebo", "Placebo", "Placebo", "Xanom…
$ AGE      <dbl> 63, 63, 63, 64, 64, 71, 71, 71, 74, 77, 77, 77, 85, 85, 68, 6…
$ SEX      <chr> "F", "F", "F", "M", "M", "M", "M", "M", "M", "F", "F", "F", "…
$ STUDYID  <chr> "CDISCPILOT01", "CDISCPILOT01", "CDISCPILOT01", "CDISCPILOT01…
$ DOMAIN   <chr> "EX", "EX", "EX", "EX", "EX", "EX", "EX", "EX", "EX", "EX", "…
$ EXSEQ    <dbl> 1, 2, 3, 1, 2, 1, 2, 3, 1, 1, 2, 3, 1, 2, 1, 2, 3, 1, 1, 2, 1…
$ EXTRT    <chr> "PLACEBO", "PLACEBO", "PLACEBO", "PLACEBO", "PLACEBO", "XANOM…
$ EXDOSE   <dbl> 0, 0, 0, 0, 0, 54, 81, 54, 54, 54, 81, 54, 0, 0, 54, 54, 54, …
$ EXDOSU   <chr> "mg", "mg", "mg", "mg", "mg", "mg", "mg", "mg", "mg", "mg", "…
$ EXDOSFRM <chr> "PATCH", "PATCH", "PATCH", "PATCH", "PATCH", "PATCH", "PATCH"…
$ EXDOSFRQ <chr> "QD", "QD", "QD", "QD", "QD", "QD", "QD", "QD", "QD", "QD", "…
$ EXROUTE  <chr> "TRANSDERMAL", "TRANSDERMAL", "TRANSDERMAL", "TRANSDERMAL", "…
$ VISITNUM <dbl> 3, 4, 12, 3, 4, 3, 4, 12, 3, 3, 4, 12, 3, 4, 3, 4, 12, 3, 3, …
$ VISIT    <chr> "BASELINE", "WEEK 2", "WEEK 24", "BASELINE", "WEEK 2", "BASEL…
$ VISITDY  <dbl> 1, 14, 168, 1, 14, 1, 14, 168, 1, 1, 14, 168, 1, 14, 1, 14, 1…
$ EXSTDTC  <chr> "2014-01-02", "2014-01-17", "2014-06-19", "2012-08-05", "2012…
$ EXENDTC  <chr> "2014-01-16", "2014-06-18", "2014-07-02", "2012-08-27", "2012…
$ EXSTDY   <dbl> 1, 16, 169, 1, 24, 1, 15, 173, 1, 1, 16, 171, 1, 15, 1, 16, 1…
$ EXENDY   <dbl> 15, 168, 182, 23, 28, 14, 172, 180, 14, 15, 170, 183, 14, 26,…

What changed after the DM-EX join?

glimpse(dm_ex) shows you:

Rows: Equal to nrow(ex) - one row per dosing record (not per subject). A subject receiving 10 doses contributes 10 rows.
Columns: The 4 selected DM variables (USUBJID, ARM, AGE, SEX) have been added to all EX columns. You should now see both ARM (from DM) and EXTRT (the actual drug name from EX) side by side - useful for verifying that subjects in the “Placebo” arm received the placebo drug and not an active compound.
Data types: Notice that EX dose variables like EXDOSE will be <dbl> (numeric), while most other EX variables are <chr>. The date variables EXSTDTC and EXENDTC are ISO 8601 character strings, just like in AE.

7 Hands-On: 5+ dplyr Operations

The deliverable for Day 1 requires demonstrating at least 5 dplyr operations. Let’s work through practical examples.

7.1 Operation 1: `filter()` - Subset Rows

Filter AEs to show only serious adverse events (SAEs).

How filter() works

filter() evaluates a logical condition for every row and keeps only those where the condition is TRUE. AESER == "Y" compares the value in the AESER column to the string "Y" - the double equals == is a comparison (not assignment). You can combine multiple conditions with:

& or , for AND (both must be true)
| for OR (either can be true)
! for NOT (negates the condition)

Example: filter(AESER == "Y" & AESEV == "SEVERE") for serious AND severe AEs.

(Reference: dplyr filter() documentation)

# Filter for Serious Adverse Events only
serious_aes <- ae %>%
  dplyr::filter(AESER == "Y")

cat("Total AEs:", nrow(ae), "\n")

Total AEs: 1191

cat("Serious AEs:", nrow(serious_aes), "\n")

Serious AEs: 3

What just happened?

serious_aes is a subset of ae with the exact same columns but fewer rows. The two cat() lines reveal:

How many total adverse events were recorded in the trial.
How many of those were flagged as serious (AESER == "Y").

The difference is the count of non-serious AEs. In clinical trials, serious AEs trigger expedited regulatory reporting (e.g., 7-day or 15-day safety reports to the FDA), so isolating them quickly is an everyday task for safety programmers. The percentage of serious AEs relative to total AEs is a key metric reviewed by Data Safety Monitoring Boards (DSMBs).

7.2 Operation 2: `select()` - Choose Columns

Select only the variables we need for a quick AE overview.

How select() works

select() subsets columns (variables), not rows. It takes column names (unquoted) as arguments. You can also:

Drop columns by prefixing with -: select(-STUDYID, -DOMAIN)
Rename during selection: select(Subject = USUBJID, Event = AEDECOD)
Use helper functions: select(starts_with("AE")) selects all columns beginning with “AE”; select(contains("DTC")) grabs all date columns.

select() does not change row count - it only changes column count. (Reference: dplyr select() documentation)

# Select key AE variables
ae_summary <- ae %>%
  dplyr::select(USUBJID, AEDECOD, AESEV, AESER, AESTDTC, AEENDTC)

utils::head(ae_summary)

# A tibble: 6 × 6
  USUBJID     AEDECOD                              AESEV   AESER AESTDTC AEENDTC
  <chr>       <chr>                                <chr>   <chr> <chr>   <chr>  
1 01-701-1015 APPLICATION SITE ERYTHEMA            MILD    N     2014-0… <NA>   
2 01-701-1015 APPLICATION SITE PRURITUS            MILD    N     2014-0… <NA>   
3 01-701-1015 DIARRHOEA                            MILD    N     2014-0… 2014-0…
4 01-701-1023 ATRIOVENTRICULAR BLOCK SECOND DEGREE MILD    N     2012-0… <NA>   
5 01-701-1023 ERYTHEMA                             MILD    N     2012-0… 2012-0…
6 01-701-1023 ERYTHEMA                             MODERA… N     2012-0… <NA>

What changed?

Compare ncol(ae) vs ncol(ae_summary) - the full AE domain has many more columns than the 6 we selected. The row count is identical to ae. ae_summary is a focused, readable slice of the data containing just the variables needed for a quick listing of adverse events: who experienced what, how severe, was it serious, and when did it start and end. This pattern - select only what you need - is a best practice that makes downstream code faster and easier to read.

7.3 Operation 3: `mutate()` - Create New Variables

Create a flag for treatment-related adverse events.

How mutate() works

mutate() adds new columns (or overwrites existing ones) without changing the row count. Every expression inside mutate() is evaluated for each row. dplyr::if_else(condition, true_value, false_value) is a vectorized conditional - it’s safer than base R’s ifelse() because it enforces consistent data types for both outcomes.

The | operator here means OR - AEREL == "PROBABLE" | AEREL == "POSSIBLE" is TRUE if either condition holds. You can chain multiple mutate() calls or add many variables in a single call separated by commas. (Reference: dplyr mutate() documentation)

# Create a new variable: treatment-related flag
ae_with_flag <- ae %>%
  dplyr::mutate(
    # Create a related flag (Y if related, N otherwise)
    AEREL_FLAG = dplyr::if_else(AEREL == "PROBABLE" | AEREL == "POSSIBLE", "Y", "N")
  )

# Check the distribution
table(ae_with_flag$AEREL_FLAG, useNA = "ifany")


   N    Y <NA> 
 483  704    4

What changed?

ae_with_flag has one more column than ae - the new AEREL_FLAG variable you just derived. Row count is unchanged. The table() output shows you the distribution of this new flag:

Y: AEs where the investigator assessed the event as probably or possibly related to the study drug - these are treatment-related AEs, a critical subset for benefit-risk assessment.
N: AEs not considered related to treatment.
<NA>: If any subjects had a missing AEREL value, if_else() propagates NA. The useNA = "ifany" argument in table() makes missing values visible - always check for NAs before assuming your flag is complete.

The ratio of Y to total tells you the treatment-related AE rate - a key safety metric in every clinical study report.

7.4 Operation 4: `group_by()` + `summarise()` - Aggregate Data

Count adverse events by System Organ Class (SOC).

How group_by() + summarise() works

group_by() silently marks the data frame with grouping information - it doesn’t visibly change the data. summarise() then collapses each group into a single summary row using the aggregation functions you provide:

n() counts the total number of rows in the group (= total events)
n_distinct(USUBJID) counts the number of unique subjects in the group

The difference between n() and n_distinct() is important: a subject who experiences the same SOC twice would contribute 2 to n_events but only 1 to n_subjects. Regulatory tables typically report subjects, not events, to avoid overcounting. arrange(desc(...)) then sorts from highest to lowest. (Reference: dplyr group_by() + summarise() documentation)

# Count AEs by System Organ Class
ae_by_soc <- ae %>%
  dplyr::group_by(AEBODSYS) %>%
  dplyr::summarise(
    n_events = dplyr::n(),
    n_subjects = dplyr::n_distinct(USUBJID)
  ) %>%
  dplyr::arrange(dplyr::desc(n_events))

# Show top 10 SOCs
utils::head(ae_by_soc, 10)

# A tibble: 10 × 3
   AEBODSYS                                             n_events n_subjects
   <chr>                                                   <int>      <int>
 1 GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS      292        108
 2 SKIN AND SUBCUTANEOUS TISSUE DISORDERS                    276        105
 3 NERVOUS SYSTEM DISORDERS                                  101         59
 4 CARDIAC DISORDERS                                          91         44
 5 GASTROINTESTINAL DISORDERS                                 87         53
 6 INFECTIONS AND INFESTATIONS                                73         39
 7 RESPIRATORY, THORACIC AND MEDIASTINAL DISORDERS            53         30
 8 INVESTIGATIONS                                             40         23
 9 PSYCHIATRIC DISORDERS                                      40         29
10 INJURY, POISONING AND PROCEDURAL COMPLICATIONS             29         14

What changed?

ae_by_soc is a completely transformed data frame:

Before: ae had one row per adverse event record (many rows, many columns).
After: ae_by_soc has one row per System Organ Class (few rows, 3 columns: AEBODSYS, n_events, n_subjects).

This is the power of group_by() + summarise() - it condenses detailed transactional data into an analytical summary. The result, sorted by n_events descending, gives you an instant view of which organ systems are most affected in this trial. This is the backbone of the MedDRA SOC-level adverse event summary table that appears in every Clinical Study Report.

7.5 Operation 5: `arrange()` - Sort Data

Sort AEs by subject and start date.

How arrange() works

arrange() reorders rows based on one or more columns. By default it sorts ascending (A→Z for characters, smallest→largest for numbers). Wrap a column in desc() for descending order. When you provide multiple columns (like USUBJID, AESTDTC), rows are first sorted by the first column, and ties are broken by the second column. arrange() does not change the columns or the number of rows.

In clinical programming, sorting by USUBJID and then chronologically by AESTDTC is a standard step before creating subject-level event narratives or listings. (Reference: dplyr arrange() documentation)

# Sort AEs chronologically within each subject
ae_sorted <- ae %>%
  dplyr::select(USUBJID, AEDECOD, AESTDTC, AESEV) %>%
  dplyr::arrange(USUBJID, AESTDTC)

utils::head(ae_sorted, 10)

# A tibble: 10 × 4
   USUBJID     AEDECOD                              AESTDTC    AESEV   
   <chr>       <chr>                                <chr>      <chr>   
 1 01-701-1015 APPLICATION SITE ERYTHEMA            2014-01-03 MILD    
 2 01-701-1015 APPLICATION SITE PRURITUS            2014-01-03 MILD    
 3 01-701-1015 DIARRHOEA                            2014-01-09 MILD    
 4 01-701-1023 ERYTHEMA                             2012-08-07 MILD    
 5 01-701-1023 ERYTHEMA                             2012-08-07 MODERATE
 6 01-701-1023 ERYTHEMA                             2012-08-07 MILD    
 7 01-701-1023 ATRIOVENTRICULAR BLOCK SECOND DEGREE 2012-08-26 MILD    
 8 01-701-1028 APPLICATION SITE ERYTHEMA            2013-07-21 MILD    
 9 01-701-1028 APPLICATION SITE PRURITUS            2013-08-08 MILD    
10 01-701-1034 APPLICATION SITE PRURITUS            2014-08-27 MILD

What changed?

The data has the same rows and columns as after the select(), but the order is different. In the output you should see:

All rows for a given USUBJID appear together (grouped by subject).
Within each subject, events are ordered from earliest AESTDTC to latest.
This chronological view lets you read a subject’s adverse event timeline from top to bottom - essential for writing individual patient narratives in safety reports and for detecting whether adverse events overlap or follow one another.

Note: AESTDTC is a character string in ISO 8601 format (YYYY-MM-DD). Alphabetical sorting of ISO 8601 dates happens to be chronologically correct - another reason CDISC mandates this date format.

7.6 Bonus Operation 6: `anti_join()` - Find Missing Data

Find subjects in DM who have no adverse events (useful for data QC).

How anti_join() works

anti_join(x, y, by = "key") returns all rows from x where the key value is not found in y. Think of it as a “NOT IN” filter. It is one of the most useful functions for data quality checking in clinical programming:

Find subjects in DM with no exposure records (potential protocol deviations)
Find subjects with lab results but no demographics (orphan records)
Find AE records for subjects not in DM (invalid USUBJIDs)

Only the columns from the left-hand dataset (x) are returned - no columns from y are added. (Reference: dplyr filtering joins documentation)

# Find subjects with NO adverse events
subjects_no_ae <- dm %>%
  dplyr::select(USUBJID, ARM) %>%
  dplyr::anti_join(ae, by = "USUBJID")

cat("Subjects with no AEs:", nrow(subjects_no_ae), "\n")

Subjects with no AEs: 81

What just happened?

subjects_no_ae contains only the DM subjects whose USUBJID does not appear anywhere in the AE domain. The count tells you how many enrolled subjects experienced zero adverse events during the trial. This information is used when calculating incidence rates - the denominator must include all subjects, not just those with AEs. It’s also a useful data QC check: in some trials, every subject is expected to have at least one medical history AE entry; a subject with no AE records at all might indicate a data collection gap.

8 Day 1 Deliverable Summary

Today you completed the following:

Task	Status
Installed and loaded pharmaverse packages	Done
Loaded DM, AE, EX, LB SDTM domains	Done
Explored data with glimpse(), head(), dim()	Done
Joined DM-AE (demographics + adverse events)	Done
Joined DM-EX (demographics + exposure)	Done
Demonstrated 5+ dplyr operations	Done

9 Key Takeaways

USUBJID is the universal key - It links all SDTM domains together
SDTM domains have classes - Special Purpose (DM), Events (AE), Interventions (EX), Findings (LB)
The CDISC data flow - Raw → SDTM → ADaM → TLFs
dplyr is essential - Most pharmaverse packages are built on tidyverse principles
pharmaversesdtm provides real-world example data - Use it for learning and testing

10 Resources

pharmaverse.org - The central hub for all pharmaverse packages
pharmaversesdtm GitHub - Source code and documentation
admiral ‘Get Started’ vignette - Introduction to ADaM creation
CDISC SDTM Implementation Guide - Official SDTM documentation
dplyr documentation - Complete reference for data manipulation

11 Frequently Asked Questions

What is the difference between SDTM and ADaM? When should I use each?

SDTM (Study Data Tabulation Model) is the raw-to-standardized transformation layer. Its purpose is to faithfully represent the collected data in a consistent, submission-ready format - minimal derivations, maximum traceability. Every SDTM variable should map directly back to a source CRF field or lab system.

ADaM (Analysis Data Model) is the analysis-ready layer. It derives new variables (e.g., baseline flags, analysis windows, treatment-emergent flags, study day calculations) that are needed to produce the statistical outputs in your Clinical Study Report.

Rule of thumb: If a statistician needs to compute something, it lives in ADaM. If it was collected directly from the subject, it lives in SDTM. Never perform analysis directly on raw SDTM without first deriving the needed ADaM variables - this ensures reproducibility and regulatory traceability.

Reference: CDISC ADaM Overview

Why are SDTM date variables stored as character strings instead of R Date objects?

SDTM uses ISO 8601 format for dates and times (e.g., "2014-01-02", "2014-01-02T08:30:00"). One key reason they remain character strings in SDTM is that partial dates are allowed - a subject might have an adverse event where only the year and month are known ("2014-01"), or even just the year ("2014"). R’s Date class cannot represent partial dates.

In ADaM, admiral provides dedicated functions like derive_vars_dt() and derive_vars_dtm() that convert ISO 8601 character strings to proper R Date or POSIXct objects, imputing partial dates according to CDISC rules.

Reference: admiral - Date and Time Imputation

What is the %>% (pipe) operator and should I use |> instead?

The %>% operator comes from the magrittr package (re-exported by dplyr). It passes the object on its left as the first argument to the function on its right, allowing you to chain operations into readable pipelines instead of nesting function calls.

# Without pipe (hard to read):
head(arrange(select(ae, USUBJID, AEDECOD), USUBJID), 5)

# With pipe (reads left-to-right like a sentence):
ae %>% select(USUBJID, AEDECOD) %>% arrange(USUBJID) %>% head(5)

R 4.1+ introduced the native pipe |> which works similarly but has no package dependency. For most tidyverse/pharmaverse code, both are interchangeable. The main difference is that |> requires R ≥ 4.1 and does not support the . placeholder that %>% allows. New pharmaverse code is gradually migrating to |>.

Reference: R 4.1 release notes - Native pipe

What does dplyr::n() vs dplyr::n_distinct() mean, and when should I use each?

n() counts the total number of rows in the current group. If a subject appears 5 times in the group (5 AE records), n() counts all 5.
n_distinct(USUBJID) counts the number of unique values of USUBJID in the group. If the same subject appears 5 times, n_distinct(USUBJID) counts them as 1.

In clinical reporting, regulatory guidance (ICH E9) and FDA expectations require adverse event tables to report subject counts (n_distinct), not event counts (n()), in the primary summary. Event counts are reported separately. Always ask: “Am I counting occurrences or people?”

Reference: dplyr n_distinct() documentation

What is a .xpt file and why does clinical data need to be exported as one?

A SAS Transport File (.xpt) is a format originally developed by SAS Institute for transferring datasets between different SAS versions and platforms. The FDA and EMA mandate .xpt format for electronic submissions because it is a well-documented, stable binary format that regulatory reviewers’ SAS systems can reliably read, regardless of the sponsor’s software environment.

Even though pharmaverse teams work in R, the final submission datasets must be exported as .xpt files. The haven package can read and write .xpt files in R, and xportr adds the layer of applying CDISC metadata (variable labels, lengths, formats) before export to ensure the .xpt file meets submission standards.

Reference: FDA Study Data Technical Conformance Guide

What is AEREL and what values can it take in SDTM?

AEREL is the causality assessment variable in the AE domain - it records the investigator’s judgment about whether the adverse event was caused by the study drug. The CDISC SDTMIG defines controlled terminology for this variable via the CDISC CT (Controlled Terminology) codelist C66726:

UNLIKELY - The event is unlikely to be related to the study drug
POSSIBLE - There is a reasonable possibility the event could be related
PROBABLE - The event is probably related to the study drug
DEFINITELY RELATED - The event is definitively related
NOT RELATED - No relationship to the study drug

Some sponsors use a simpler Yes/No relatedness flag (AEOUT). The exact values used in your study must match your study-specific annotated CRF and the controlled terminology published by CDISC.

Reference: CDISC Controlled Terminology - AE Causality

Why does inner_join() sometimes produce more rows than the left dataset?

This happens when the right-hand dataset has duplicate key values. For example, if a subject appears 3 times in AE (3 adverse events), and you join DM (1 row for that subject) to AE using inner_join, the result has 3 rows for that subject - the DM row is duplicated once for each matching AE row. This is expected and correct behavior for a one-to-many relationship.

If you accidentally end up with more rows than expected after a join, check for unintended duplicate keys in either dataset using:

ae %>% dplyr::count(USUBJID) %>% dplyr::filter(n > 1)

This is a common source of row inflation bugs in clinical programming and should always be verified.

Reference: dplyr Joins documentation

What is the pharmaverse and is it endorsed by regulatory agencies?

The pharmaverse is a voluntary, cross-industry consortium of pharmaceutical companies (including Roche, GSK, Pfizer, Novartis, and many others) that collaboratively develops and maintains R packages for clinical reporting. It is not an official regulatory body and its packages are not “certified” by the FDA or EMA.

However, the FDA has indicated openness to R-based submissions and the packages are built to comply with CDISC standards, which are regulatory requirements. The admiral package, in particular, generates ADaM datasets that are structurally identical to what production SAS programs would produce - it has been used in actual regulatory submissions.

The key validation requirement is that your company validates the use of any software (including R packages) in a submission context, following your own SOPs and potentially using tools like valtools or thevalidatoR.

Reference: pharmaverse.org - About

12 What’s Next?

In Day 2, we’ll dive deeper into: - SDTM domain classes and their specific structures - Advanced tidyverse operations - Variable naming conventions (–TESTCD, –ORRES, –DY patterns) - Creating data summaries by treatment arm

1 Learning Objectives

2 Introduction to CDISC and the Pharmaverse

2.1 What is CDISC?

2.1.1 Why does this matter?

2.2 The CDISC Data Flow

2.2.1 Example:

2.3 What is the Pharmaverse?

3 Package Installation & Loading

3.1 Required Packages

3.2 Step 1: Install Packages (if needed)

3.3 Step 2: Load Packages

4 Loading SDTM Datasets

4.1 What are SDTM Domains?

4.2 Step 3: Load the Core SDTM Domains

5 Exploring the Data

5.1 Method 1: glimpse() - Quick Structure Overview

5.2 Method 2: head() - View First Few Rows

5.3 Method 3: dim() and names() - Dimensions and Column Names

6 Understanding Domain Relationships

6.1 The USUBJID Connection

6.2 Step 4: Join DM and AE (Demographics + Adverse Events)

6.3 Step 5: Join DM and EX (Demographics + Exposure)

7 Hands-On: 5+ dplyr Operations

7.1 Operation 1: filter() - Subset Rows

7.2 Operation 2: select() - Choose Columns

7.3 Operation 3: mutate() - Create New Variables

7.4 Operation 4: group_by() + summarise() - Aggregate Data

7.5 Operation 5: arrange() - Sort Data

7.6 Bonus Operation 6: anti_join() - Find Missing Data

8 Day 1 Deliverable Summary

9 Key Takeaways

10 Resources

11 Frequently Asked Questions

12 What’s Next?

5.1 Method 1: `glimpse()` - Quick Structure Overview

5.2 Method 2: `head()` - View First Few Rows

5.3 Method 3: `dim()` and `names()` - Dimensions and Column Names

7.1 Operation 1: `filter()` - Subset Rows

7.2 Operation 2: `select()` - Choose Columns

7.3 Operation 3: `mutate()` - Create New Variables

7.4 Operation 4: `group_by()` + `summarise()` - Aggregate Data

7.5 Operation 5: `arrange()` - Sort Data

7.6 Bonus Operation 6: `anti_join()` - Find Missing Data