# Install packages if not already installed
if (!requireNamespace("admiral", quietly = TRUE)) install.packages("admiral")
if (!requireNamespace("pharmaversesdtm", quietly = TRUE)) install.packages("pharmaversesdtm")
if (!requireNamespace("xportr", quietly = TRUE)) install.packages("xportr")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
if (!requireNamespace("haven", quietly = TRUE)) install.packages("haven")Day 1: Environment Setup & First SDTM Code
SDTM Programming Walkthrough
1 Learning Objectives
By the end of Day 1, you will be able to:
- Set up your R environment with essential pharmaverse packages
- Load and explore SDTM datasets from
pharmaversesdtm - Understand the CDISC data flow: Raw → SDTM → ADaM → TLFs
- Use
dplyrto explore relationships between SDTM domains - Perform basic data manipulation operations on clinical data
2 Introduction to CDISC and the Pharmaverse
2.1 What is CDISC?
CDISC (Clinical Data Interchange Standards Consortium) is a global organization that creates standards for how clinical trial data should be collected, organized, and shared. These standards are important because they:
- Ensure Consistency: Data from different studies and companies can be compared easily because everyone uses the same structure and terms.
- Enable Interoperability: Data can move smoothly between different software systems, making it easier for teams, sponsors, and regulators to work together.
- Support Regulatory Submission: Regulatory agencies like the FDA (U.S.) and EMA (Europe) require data in CDISC formats for new drug applications. Using CDISC makes your data ready for these submissions.
2.1.1 Why does this matter?
Imagine you are reading a book, but every chapter is written in a different language or format. It would be very hard to understand the story! CDISC makes sure all the “chapters” (datasets) in a clinical trial are written in the same “language” so everyone can understand and use them.
2.2 The CDISC Data Flow
Understanding how data moves through a clinical trial is key for programmers. Here’s the typical flow:
Raw Data → SDTM → ADaM → TLFs
- Raw Data: This is the data collected directly from sources like electronic data capture (EDC) systems, lab machines, or paper forms. It is often messy and not standardized.
- SDTM (Study Data Tabulation Model): This is the first step of standardization. SDTM organizes the raw data into specific domains (tables) with defined variables and formats. Each domain represents a type of information, like demographics (DM), adverse events (AE), or lab results (LB).
- ADaM (Analysis Data Model): These datasets are derived from SDTM and are designed to make statistical analysis easier. They include derived variables, flags, and are structured to answer specific analysis questions.
- TLFs (Tables, Listings, and Figures): These are the final outputs-summaries, tables, and graphs that go into clinical study reports and regulatory submissions.
2.2.1 Example:
Suppose a subject’s blood pressure is measured during a study. The measurement is first recorded as raw data. It is then organized into the SDTM VS (Vital Signs) domain. Later, it might be used to calculate a derived variable (like average blood pressure) in ADaM, and finally, the results are summarized in a table for the study report.
| Stage | Description | Example |
|---|---|---|
| Raw Data | Data collected from EDC systems, labs, etc. | CRF exports, lab files |
| SDTM | Study Data Tabulation Model - standardized clinical data | DM, AE, EX, LB domains |
| ADaM | Analysis Data Model - analysis-ready datasets | ADSL, ADAE, ADLB |
| TLFs | Tables, Listings, and Figures for submission | Demographics table, AE summary |
Think of this pipeline like a factory assembly line. Raw data arrives in messy, inconsistent shapes - different sponsors, different EDC vendors, different column names. SDTM is the standardization step: every study’s data gets reshaped into the same structure so the FDA (and your future self) can always find the same variable in the same place. ADaM then derives analysis-ready variables from that standardized base, and TLFs are the final output - the actual tables and figures that go into your Clinical Study Report (CSR). Skipping or muddling any stage creates compliance and reproducibility risks.
2.3 What is the Pharmaverse?
The pharmaverse is a cross-industry collaboration of R packages designed for clinical reporting workflows. It provides open-source tools that map to each stage of the CDISC data flow:
- SDTM creation:
sdtm.oak,pharmaversesdtm(example data) - ADaM creation:
admiral(the flagship package) - TLG creation:
rtables,tern,tfrmt - Validation & Export:
xportr,metacore,sdtmchecks
pharmaversesdtm?
pharmaversesdtm ships pre-built, realistic SDTM datasets based on a fictional Phase III clinical trial. These datasets follow the official CDISC SDTM Implementation Guide (SDTMIG) and are the same ones used internally by the admiral development team for testing. You can browse them at the pharmaversesdtm GitHub repository.
3 Package Installation & Loading
The following packages are essential for today’s exercises. Each serves a specific purpose in the clinical programming workflow.
3.1 Required Packages
| Package | Purpose |
|---|---|
admiral |
ADaM dataset creation - the core pharmaverse package |
pharmaversesdtm |
Example SDTM datasets for learning and testing |
dplyr |
Data manipulation (filter, select, mutate, join, etc.) |
haven |
Read/write SAS transport files (.xpt) |
xportr |
Apply metadata and export to .xpt for FDA submission |
3.2 Step 1: Install Packages (if needed)
The code below checks if each package is installed and installs it if missing. This is a safe pattern that won’t reinstall packages you already have.
requireNamespace() works
requireNamespace("pkg", quietly = TRUE) tries to locate the package without actually loading it into your session. It returns TRUE if the package is found, FALSE if not. Wrapping it in ! (logical NOT) flips the result - so the install.packages() call inside the if block only fires when the package is not installed. The quietly = TRUE argument suppresses any console messages during the check. This is the idiomatic way to write conditional installs in R scripts. (Reference: R documentation for requireNamespace)
R checked your local package library for each package. If a package was missing, it downloaded and installed it from CRAN. If it was already installed, nothing happened - no unnecessary downloads, no version conflicts. You can verify what’s installed by running installed.packages()[, c("Package", "Version")] in your console.
3.3 Step 2: Load Packages
Now we load the packages into our R session. The library() function makes all functions from a package available for use.
library() vs require() vs requireNamespace()
library(pkg)- loads the package and errors if it’s not found. Use this at the top of scripts/Quarto files.
require(pkg)- loads the package and warns (but doesn’t error) if not found. Commonly used inside functions.
requireNamespace(pkg)- checks existence without attaching the package to your session. Used for conditional installs (as above).
For day-to-day scripts, library() is the right choice because a hard error is better than silently continuing without a needed package. (Reference: Hadley Wickham’s R Packages - Namespaces)
# Load packages into the session
library(admiral)
library(pharmaversesdtm)
library(dplyr)
library(haven)
library(xportr)When you load packages, you may see messages about package versions and any function name conflicts. This is normal and helps you understand which package’s function will be used when names overlap.
After running the library() calls, all exported functions from these five packages are now available in your global environment. For example, you can now type admiral::derive_vars_dt() or just derive_vars_dt() directly. If you see a warning like The following object is masked from 'package:X': filter, it means two packages export a function with the same name - R will use the most recently loaded one. You can always be explicit, e.g., dplyr::filter(), to avoid ambiguity. Run search() in your console to see the full list of attached packages in the order R searches them.
4 Loading SDTM Datasets
The pharmaversesdtm package contains example SDTM datasets that mirror real clinical trial data. These are the same datasets used by production teams when developing and testing admiral.
4.1 What are SDTM Domains?
SDTM organizes clinical data into domains, each representing a specific type of information. The full SDTM standard defines 30+ domains across several categories (General Observation Classes, Special Purpose, Trial Design, and Relationship datasets). The exact count grows with each new version of the SDTMIG as CDISC adds domains for new therapeutic areas and data types.
The table below shows the most commonly used (“core”) domains - the ones you will encounter in nearly every clinical trial:
| Domain Class | Domain | Description |
|---|---|---|
| Special Purpose | DM | Demographics - one row per subject |
| Events | AE | Adverse Events - multiple rows per subject |
| Events | DS | Disposition - study milestones |
| Interventions | EX | Exposure - drug dosing information |
| Interventions | CM | Concomitant Medications |
| Findings | LB | Laboratory Results |
| Findings | VS | Vital Signs |
We use the term “core” here to mean the domains you’ll work with most frequently - DM, AE, EX, LB, VS, DS, CM. These appear in virtually every clinical study. However, the full SDTM Implementation Guide (SDTMIG), available from CDISC, defines many more domains, including:
- Additional General Observation domains: MH (Medical History), PE (Physical Examination), EG (ECG), QS (Questionnaires), FA (Findings About), and others
- Trial Design domains: TA (Trial Arms), TE (Trial Elements), TV (Trial Visits), TI (Trial Inclusion/Exclusion Criteria) - these describe the study design rather than individual subject data
- Relationship datasets: RELREC (Related Records), SUPP– (Supplemental Qualifiers) - these capture relationships and non-standard variables
The SDTMIG also allows sponsors to create custom domains when no standard domain fits the data, as long as they follow the general observation class structure.
The SDTM Implementation Guide (SDTMIG) is the companion document to the SDTM standard. While the SDTM model defines the conceptual framework (observation classes, general variable roles), the SDTMIG provides the practical implementation details:
- Domain specifications: Which variables belong in each domain, their data types, controlled terminology, and whether they are Required, Expected, or Permissible
- Assumptions and business rules: How to handle specific data scenarios (e.g., partial dates, multiple occurrences, supplemental qualifiers)
- Examples: Sample data showing how real clinical data maps into SDTM structure
The current version is SDTMIG v3.4. You can obtain it from CDISC (a free CDISC account is required to access the full documentation).
The SDTMIG defines three main observation classes with strict structural rules:
- Events (AE, DS, CE…): Things that happened to the subject. Each row is a distinct occurrence. Key variable:
--TERM(the verbatim text) and--DECOD(the standardized dictionary term).
- Interventions (EX, CM, PR…): Things that were given to or done for the subject. Each row is a treatment record. Key variable:
--TRT(treatment name).
- Findings (LB, VS, EG…): Measurements or observations collected from the subject. Each row is a single result. Key variables:
--TESTCD(short test code) and--ORRES(original result).
The Special Purpose domains (DM, CO, SE, SV) don’t follow the three-class model; they have their own SDTMIG-specified structures.
4.2 Step 3: Load the Core SDTM Domains
Out of the 30+ domains defined in the SDTMIG, we’ll load four core domains today - the ones that appear in virtually every clinical trial and form the foundation for most safety and efficacy analyses:
data() loads package datasets
data("dm", package = "pharmaversesdtm") reaches into the pharmaversesdtm package’s data/ folder and loads the dm object into your global environment. Without the package = argument, R searches all attached packages. After this call, dm is a plain R data.frame (or tibble) sitting in your workspace - just like one you’d import from a CSV. (Reference: R documentation - data())
# Load SDTM domains from pharmaversesdtm
# DM = Demographics (one row per subject)
data("dm", package = "pharmaversesdtm")
# AE = Adverse Events (multiple rows per subject)
data("ae", package = "pharmaversesdtm")
# EX = Exposure/Dosing (multiple rows per subject)
data("ex", package = "pharmaversesdtm")
# LB = Laboratory Results (many rows per subject)
data("lb", package = "pharmaversesdtm")Four new objects - dm, ae, ex, and lb - have been created in your R global environment. You can confirm this by running ls() in the console, which lists all objects currently in memory. Each is a data.frame (with haven-style labelled attributes, since SDTM datasets mimic SAS transport format). You can inspect labels with attr(dm$ARM, "label").
5 Exploring the Data
Before working with any dataset, it’s essential to understand its structure. Let’s explore each domain using different R functions.
In CDISC and pharma programming, the terms “column name” and “variable” refer to the same thing - a single field in a dataset. For example, USUBJID, AGE, AEDECOD are all variables (in CDISC terminology) that appear as columns in the data frame. If you come from a database background, you might also know them as “fields” or “attributes”. Throughout this course, we use these terms interchangeably.
If you come from a database background, you know how important key fields (primary keys) are - the set of columns that uniquely identify a single row in a table. SDTM domains follow the same concept:
| Domain | Key Variables (Natural Key) | What each row represents |
|---|---|---|
| DM | USUBJID |
One row per subject |
| AE | USUBJID, AESEQ |
One row per adverse event per subject |
| EX | USUBJID, EXSEQ |
One row per dosing record per subject |
| LB | USUBJID, LBSEQ |
One row per lab result per subject |
| VS | USUBJID, VSSEQ |
One row per vital sign measurement per subject |
The --SEQ variable (e.g., AESEQ, EXSEQ) is a sequence number assigned within each subject - it acts as a record-level identifier that, combined with USUBJID, forms the natural key for the domain. This is conceptually identical to a composite primary key in a relational database.
In addition, STUDYID + USUBJID together form a globally unique subject identifier across studies. Always verify that your key variables truly produce unique rows - duplicate keys indicate data quality issues.
5.1 Method 1: glimpse() - Quick Structure Overview
The glimpse() function from dplyr provides a compact view of the data structure, showing column names, types, and a preview of values.
glimpse() show you?
glimpse() is essentially a transposed version of print(). For each column it shows:
- The column name (e.g.,
USUBJID)
- The data type in abbreviated form:
<chr>= character,<dbl>= double/numeric,<lgl>= logical,<date>= Date,<dttm>= POSIXct datetime
- A sample of values from the first few rows
This is especially useful for wide datasets (many columns) where a regular print() would wrap awkwardly. (Reference: dplyr glimpse() documentation)
# Explore the Demographics (DM) domain
# This shows us all variables and their types
dplyr::glimpse(dm)Rows: 306
Columns: 26
$ STUDYID <chr> "CDISCPILOT01", "CDISCPILOT01", "CDISCPILOT01", "CDISCPILOT01…
$ DOMAIN <chr> "DM", "DM", "DM", "DM", "DM", "DM", "DM", "DM", "DM", "DM", "…
$ USUBJID <chr> "01-701-1015", "01-701-1023", "01-701-1028", "01-701-1033", "…
$ SUBJID <chr> "1015", "1023", "1028", "1033", "1034", "1047", "1057", "1097…
$ RFSTDTC <chr> "2014-01-02", "2012-08-05", "2013-07-19", "2014-03-18", "2014…
$ RFENDTC <chr> "2014-07-02", "2012-09-02", "2014-01-14", "2014-04-14", "2014…
$ RFXSTDTC <chr> "2014-01-02", "2012-08-05", "2013-07-19", "2014-03-18", "2014…
$ RFXENDTC <chr> "2014-07-02", "2012-09-01", "2014-01-14", "2014-03-31", "2014…
$ RFICDTC <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ RFPENDTC <chr> "2014-07-02T11:45", "2013-02-18", "2014-01-14T11:10", "2014-0…
$ DTHDTC <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ DTHFL <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ SITEID <chr> "701", "701", "701", "701", "701", "701", "701", "701", "701"…
$ BRTHDTC <chr> "1950-12-26", "1948-07-22", "1942-07-11", "1940-03-10", "1937…
$ AGE <dbl> 63, 64, 71, 74, 77, 85, 59, 68, 81, 84, 52, 84, 81, 57, 75, 5…
$ AGEU <chr> "YEARS", "YEARS", "YEARS", "YEARS", "YEARS", "YEARS", "YEARS"…
$ SEX <chr> "F", "M", "M", "M", "F", "F", "F", "M", "F", "M", "M", "M", "…
$ RACE <chr> "WHITE", "WHITE", "WHITE", "WHITE", "WHITE", "WHITE", "WHITE"…
$ ETHNIC <chr> "HISPANIC OR LATINO", "HISPANIC OR LATINO", "NOT HISPANIC OR …
$ ARMCD <chr> "Pbo", "Pbo", "Xan_Hi", "Xan_Lo", "Xan_Hi", "Pbo", "Scrnfail"…
$ ARM <chr> "Placebo", "Placebo", "Xanomeline High Dose", "Xanomeline Low…
$ ACTARMCD <chr> "Pbo", "Pbo", "Xan_Hi", "Xan_Lo", "Xan_Hi", "Pbo", "Scrnfail"…
$ ACTARM <chr> "Placebo", "Placebo", "Xanomeline High Dose", "Xanomeline Low…
$ COUNTRY <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA"…
$ DMDTC <chr> "2013-12-26", "2012-07-22", "2013-07-11", "2014-03-10", "2014…
$ DMDY <dbl> -7, -14, -8, -8, -7, -21, NA, -9, -13, -7, -13, -6, -5, NA, -…
After running glimpse(dm), look for:
- Rows: The total number of subjects. In SDTM DM, this should be one row per subject - a fundamental SDTMIG rule.
- Columns: The number of variables in the domain.
- Column types: Most SDTM character variables (like
USUBJID,ARM,SEX) will show<chr>. Numeric variables likeAGEwill show<dbl>. Date/time character variables (ISO 8601 format, e.g.,RFSTDTC) remain<chr>in SDTM - they get converted to proper Date objects only in ADaM. - Values: Scan the
ARMcolumn - you’ll see the treatment arm labels for this trial. ScanACTARMto see if any subjects received something different from their planned arm.
- USUBJID: Unique Subject Identifier - the key that links all domains together
- RFSTDTC: Reference Start Date (usually first dose date)
- RFENDTC: Reference End Date
- ARM: Treatment Arm (e.g., “Placebo”, “Drug 80mg”)
- ACTARM: Actual Treatment Arm
5.2 Method 2: head() - View First Few Rows
The head() function shows the first n rows of data (default is 6).
head() vs tail() vs slice()
head(df, n)- firstnrows (default 6). Useful for a quick sanity check.
tail(df, n)- lastnrows. Useful for checking if sorting worked correctly or if data was appended properly.
dplyr::slice(df, rows)- select any arbitrary rows by position. More flexible for exploring middle sections of large datasets.
For very wide datasets, even head() output can be hard to read in the console. In those cases, View(head(df)) opens the RStudio data viewer for a spreadsheet-style look. (Reference: R documentation - head())
# View the first 6 rows of Adverse Events
utils::head(ae)# A tibble: 6 × 35
STUDYID DOMAIN USUBJID AESEQ AESPID AETERM AELLT AELLTCD AEDECOD AEPTCD AEHLT
<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr>
1 CDISCPI… AE 01-701… 1 E07 APPLI… APPL… NA APPLIC… NA HLT_…
2 CDISCPI… AE 01-701… 2 E08 APPLI… APPL… NA APPLIC… NA HLT_…
3 CDISCPI… AE 01-701… 3 E06 DIARR… DIAR… NA DIARRH… NA HLT_…
4 CDISCPI… AE 01-701… 3 E10 ATRIO… AV B… NA ATRIOV… NA HLT_…
5 CDISCPI… AE 01-701… 1 E08 ERYTH… ERYT… NA ERYTHE… NA HLT_…
6 CDISCPI… AE 01-701… 2 E09 ERYTH… LOCA… NA ERYTHE… NA HLT_…
# ℹ 24 more variables: AEHLTCD <dbl>, AEHLGT <chr>, AEHLGTCD <dbl>,
# AEBODSYS <chr>, AEBDSYCD <dbl>, AESOC <chr>, AESOCCD <dbl>, AESEV <chr>,
# AESER <chr>, AEACN <chr>, AEREL <chr>, AEOUT <chr>, AESCAN <chr>,
# AESCONG <chr>, AESDISAB <chr>, AESDTH <chr>, AESHOSP <chr>, AESLIFE <chr>,
# AESOD <chr>, AEDTC <chr>, AESTDTC <chr>, AEENDTC <chr>, AESTDY <dbl>,
# AEENDY <dbl>
- Multiple rows per subject: Unlike DM, you’ll see the same
USUBJIDappearing more than once - each row is a different adverse event for that subject. AEDECOD: The MedDRA-standardized Preferred Term for the adverse event. This is what gets used in safety tables.AEBODSYS: The MedDRA System Organ Class (SOC). Used for grouping in safety summaries.AESEV: Severity - typicallyMILD,MODERATE, orSEVERE.AESER: Serious flag -Y= serious,N= non-serious. This drives regulatory safety reporting obligations.- Date columns (
AESTDTC,AEENDTC): Stored as ISO 8601 character strings (e.g.,"2014-01-02"). These will be converted to numeric study-day variables (AESTDY) in ADaM.
5.3 Method 3: dim() and names() - Dimensions and Column Names
dim(), nrow(), and ncol()
dim(df) returns a numeric vector of length 2: c(nrow, ncol). So dim(dm)[1] gives you the row count and dim(dm)[2] gives the column count. You can also use the dedicated nrow(df) and ncol(df) functions for readability. The cat() function here concatenates and prints the text + number together in a readable line. "\n" inserts a newline character so each result prints on its own line. (Reference: R documentation - dim())
# How many rows and columns in each domain?
cat("DM dimensions:", dim(dm), "\n")DM dimensions: 306 26
cat("AE dimensions:", dim(ae), "\n")AE dimensions: 1191 35
cat("EX dimensions:", dim(ex), "\n")EX dimensions: 591 17
cat("LB dimensions:", dim(lb), "\n")LB dimensions: 59580 23
The two numbers printed for each domain are [rows] [columns]. You should notice:
- DM has the fewest rows - one per subject. This is the “spine” of your trial.
- AE has more rows than DM - multiple adverse events can occur per subject.
- LB typically has the most rows of any domain - lab panels are collected repeatedly over many visits, creating a large number of records per subject.
- The column count reflects how many SDTM variables that domain collects. LB has many columns because it captures rich measurement metadata (units, reference ranges, methods, etc.).
Compare nrow(ae) to nrow(dm) - the ratio gives you an intuitive feel for how “event-heavy” this trial is.
# What variables are in the AE domain?
names(ae) [1] "STUDYID" "DOMAIN" "USUBJID" "AESEQ" "AESPID" "AETERM"
[7] "AELLT" "AELLTCD" "AEDECOD" "AEPTCD" "AEHLT" "AEHLTCD"
[13] "AEHLGT" "AEHLGTCD" "AEBODSYS" "AEBDSYCD" "AESOC" "AESOCCD"
[19] "AESEV" "AESER" "AEACN" "AEREL" "AEOUT" "AESCAN"
[25] "AESCONG" "AESDISAB" "AESDTH" "AESHOSP" "AESLIFE" "AESOD"
[31] "AEDTC" "AESTDTC" "AEENDTC" "AESTDY" "AEENDY"
names() output
names() returns a character vector of all column names. In SDTM, variable names follow a strict two-part convention: a domain prefix (usually 2 characters, e.g., AE) followed by a core variable name suffix (e.g., TERM, DECOD, SEV). So AEDECOD = AE domain + DECOD suffix. This naming convention is defined in the SDTMIG and makes variables instantly recognizable across domains - AESTDTC, LBSTDTC, and EXSTDTC all mean “start date/time” for their respective domains.
Variables without a domain prefix (like USUBJID, STUDYID, DOMAIN) are identifier variables shared across all domains.
6 Understanding Domain Relationships
One of the most important concepts in clinical programming is understanding how domains relate to each other. USUBJID is the universal key that connects all domains - much like a foreign key in a relational database.
6.1 The USUBJID Connection
┌─────────────┐
│ DM │
│ (1 row per │
│ subject) │
└──────┬──────┘
│ USUBJID
┌──────────┼──────────┬──────────┐
│ 1:many │ 1:many │ 1:many │ 1:many
▼ ▼ ▼ ▼
┌────────┐ ┌───────┐ ┌───────┐ ┌────────┐
│ AE │ │ EX │ │ LB │ │ VS │
│(events)│ │(doses)│ │(labs) │ │(vitals)│
└────────┘ └───────┘ └───────┘ └────────┘
If you come from a database background, you’ll recognize the relational patterns between SDTM domains:
| Relationship | From → To | Explanation |
|---|---|---|
| One-to-one | DM → each subject | DM has exactly one row per USUBJID. It is the “master” table. |
| One-to-many | DM → AE | Each subject (one DM row) can have zero, one, or many AE records. |
| One-to-many | DM → EX | Each subject can have multiple dosing records. |
| One-to-many | DM → LB | Each subject can have hundreds of lab results across visits. |
| One-to-many | DM → VS | Each subject can have many vital sign measurements. |
When you inner_join(dm, ae, by = "USUBJID"), the single DM row for a subject gets replicated for each of that subject’s AE rows - this is the standard one-to-many join behavior. Understanding this is critical for avoiding unexpected row counts after joins.
In database terms, USUBJID in DM acts as the primary key, while USUBJID in AE, EX, LB, etc. acts as a foreign key referencing back to DM.
USUBJID is always stored as a character (<chr>) string, never a number, even though it often contains digits. This is because it is a concatenated identifier - typically structured as "STUDYID-SITEID-SUBJECTID" (e.g., "CDISCPILOT01-01-701"). Treating it as numeric would risk stripping leading zeros, causing mismatches when joining domains. Always use == for exact string matching, never numeric comparisons. (Reference: CDISC SDTMIG - Identifier Variables, available from CDISC)
6.2 Step 4: Join DM and AE (Demographics + Adverse Events)
When analyzing adverse events, we often need demographic information like treatment arm. Let’s join these domains.
In relational data, information is split across multiple tables to avoid duplication. DM stores demographic info (one row per subject) while AE stores adverse events (many rows per subject). To analyze AEs by treatment arm, we need to combine these - that’s what a join does. We match rows from two tables based on a shared key column (here: USUBJID). The dplyr::select() before the join ensures we only bring in the specific DM columns we need, keeping the result clean. (Reference: dplyr two-table verbs documentation)
# Join Demographics (DM) with Adverse Events (AE)
# We select key DM variables and join on USUBJID
dm_ae <- dm %>%
# Select only the DM variables we need
dplyr::select(USUBJID, RFSTDTC, RFENDTC, ARM, AGE, SEX, RACE) %>%
# Inner join keeps only subjects that appear in both datasets
dplyr::inner_join(ae, by = "USUBJID")
# How many records do we have now?
cat("Number of AE records with demographic info:", nrow(dm_ae), "\n")Number of AE records with demographic info: 1191
# Preview the joined data
utils::head(dm_ae)# A tibble: 6 × 41
USUBJID RFSTDTC RFENDTC ARM AGE SEX RACE STUDYID DOMAIN AESEQ AESPID
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr>
1 01-701-10… 2014-0… 2014-0… Plac… 63 F WHITE CDISCP… AE 1 E07
2 01-701-10… 2014-0… 2014-0… Plac… 63 F WHITE CDISCP… AE 2 E08
3 01-701-10… 2014-0… 2014-0… Plac… 63 F WHITE CDISCP… AE 3 E06
4 01-701-10… 2012-0… 2012-0… Plac… 64 M WHITE CDISCP… AE 3 E10
5 01-701-10… 2012-0… 2012-0… Plac… 64 M WHITE CDISCP… AE 1 E08
6 01-701-10… 2012-0… 2012-0… Plac… 64 M WHITE CDISCP… AE 2 E09
# ℹ 30 more variables: AETERM <chr>, AELLT <chr>, AELLTCD <dbl>, AEDECOD <chr>,
# AEPTCD <dbl>, AEHLT <chr>, AEHLTCD <dbl>, AEHLGT <chr>, AEHLGTCD <dbl>,
# AEBODSYS <chr>, AEBDSYCD <dbl>, AESOC <chr>, AESOCCD <dbl>, AESEV <chr>,
# AESER <chr>, AEACN <chr>, AEREL <chr>, AEOUT <chr>, AESCAN <chr>,
# AESCONG <chr>, AESDISAB <chr>, AESDTH <chr>, AESHOSP <chr>, AESLIFE <chr>,
# AESOD <chr>, AEDTC <chr>, AESTDTC <chr>, AEENDTC <chr>, AESTDY <dbl>,
# AEENDY <dbl>
Before the join, ae had its own set of columns and rows. After the join:
- Row count:
nrow(dm_ae)equalsnrow(ae)(since every AE record should match a subject in DM for a clean trial dataset). If these numbers differed, it would flag a data quality issue - AE records for subjects not in DM. - Column count:
dm_aenow has 7 columns from DM (USUBJID, RFSTDTC, RFENDTC, ARM, AGE, SEX, RACE) + all columns from AE, minus one USUBJID since the key is merged. This is the “enriched” AE dataset with demographic context. - New columns visible: In the
head()output you’ll now seeARM,AGE,SEX,RACEappearing alongsideAEDECOD,AESEV, etc. - this combination is what allows you to break down AEs by treatment arm in safety tables.
inner_join()?
- inner_join(): Keeps only rows where USUBJID exists in BOTH datasets
- left_join(): Keeps all rows from the left dataset (DM), adds matching rows from right (AE)
- anti_join(): Keeps rows from left where there is NO match in right (useful for finding missing data)
6.3 Step 5: Join DM and EX (Demographics + Exposure)
Exposure data tells us about drug dosing. Let’s combine this with demographics.
The EX (Exposure) domain records every dose of study drug administered to a subject - the drug name, dose, dose unit, route of administration, and start/end dates. In ADaM, EX is the primary source for:
- Calculating treatment-emergent flags (was the AE after the first dose?)
- Deriving last dose date (used for safety follow-up windows)
- Computing dose intensity and total exposure for pharmacokinetic analyses
A subject can have multiple EX rows if they received multiple doses, dose changes, or multiple study periods. (Reference: CDISC SDTMIG - EX Domain, available from CDISC)
# Join Demographics (DM) with Exposure (EX)
dm_ex <- dm %>%
dplyr::select(USUBJID, ARM, AGE, SEX) %>%
dplyr::inner_join(ex, by = "USUBJID")
# Preview the result
dplyr::glimpse(dm_ex)Rows: 591
Columns: 20
$ USUBJID <chr> "01-701-1015", "01-701-1015", "01-701-1015", "01-701-1023", "…
$ ARM <chr> "Placebo", "Placebo", "Placebo", "Placebo", "Placebo", "Xanom…
$ AGE <dbl> 63, 63, 63, 64, 64, 71, 71, 71, 74, 77, 77, 77, 85, 85, 68, 6…
$ SEX <chr> "F", "F", "F", "M", "M", "M", "M", "M", "M", "F", "F", "F", "…
$ STUDYID <chr> "CDISCPILOT01", "CDISCPILOT01", "CDISCPILOT01", "CDISCPILOT01…
$ DOMAIN <chr> "EX", "EX", "EX", "EX", "EX", "EX", "EX", "EX", "EX", "EX", "…
$ EXSEQ <dbl> 1, 2, 3, 1, 2, 1, 2, 3, 1, 1, 2, 3, 1, 2, 1, 2, 3, 1, 1, 2, 1…
$ EXTRT <chr> "PLACEBO", "PLACEBO", "PLACEBO", "PLACEBO", "PLACEBO", "XANOM…
$ EXDOSE <dbl> 0, 0, 0, 0, 0, 54, 81, 54, 54, 54, 81, 54, 0, 0, 54, 54, 54, …
$ EXDOSU <chr> "mg", "mg", "mg", "mg", "mg", "mg", "mg", "mg", "mg", "mg", "…
$ EXDOSFRM <chr> "PATCH", "PATCH", "PATCH", "PATCH", "PATCH", "PATCH", "PATCH"…
$ EXDOSFRQ <chr> "QD", "QD", "QD", "QD", "QD", "QD", "QD", "QD", "QD", "QD", "…
$ EXROUTE <chr> "TRANSDERMAL", "TRANSDERMAL", "TRANSDERMAL", "TRANSDERMAL", "…
$ VISITNUM <dbl> 3, 4, 12, 3, 4, 3, 4, 12, 3, 3, 4, 12, 3, 4, 3, 4, 12, 3, 3, …
$ VISIT <chr> "BASELINE", "WEEK 2", "WEEK 24", "BASELINE", "WEEK 2", "BASEL…
$ VISITDY <dbl> 1, 14, 168, 1, 14, 1, 14, 168, 1, 1, 14, 168, 1, 14, 1, 14, 1…
$ EXSTDTC <chr> "2014-01-02", "2014-01-17", "2014-06-19", "2012-08-05", "2012…
$ EXENDTC <chr> "2014-01-16", "2014-06-18", "2014-07-02", "2012-08-27", "2012…
$ EXSTDY <dbl> 1, 16, 169, 1, 24, 1, 15, 173, 1, 1, 16, 171, 1, 15, 1, 16, 1…
$ EXENDY <dbl> 15, 168, 182, 23, 28, 14, 172, 180, 14, 15, 170, 183, 14, 26,…
glimpse(dm_ex) shows you:
- Rows: Equal to
nrow(ex)- one row per dosing record (not per subject). A subject receiving 10 doses contributes 10 rows. - Columns: The 4 selected DM variables (USUBJID, ARM, AGE, SEX) have been added to all EX columns. You should now see both
ARM(from DM) andEXTRT(the actual drug name from EX) side by side - useful for verifying that subjects in the “Placebo” arm received the placebo drug and not an active compound. - Data types: Notice that EX dose variables like
EXDOSEwill be<dbl>(numeric), while most other EX variables are<chr>. The date variablesEXSTDTCandEXENDTCare ISO 8601 character strings, just like in AE.
7 Hands-On: 5+ dplyr Operations
The deliverable for Day 1 requires demonstrating at least 5 dplyr operations. Let’s work through practical examples.
7.1 Operation 1: filter() - Subset Rows
Filter AEs to show only serious adverse events (SAEs).
filter() works
filter() evaluates a logical condition for every row and keeps only those where the condition is TRUE. AESER == "Y" compares the value in the AESER column to the string "Y" - the double equals == is a comparison (not assignment). You can combine multiple conditions with:
&or,for AND (both must be true)|for OR (either can be true)!for NOT (negates the condition)
Example: filter(AESER == "Y" & AESEV == "SEVERE") for serious AND severe AEs.
# Filter for Serious Adverse Events only
serious_aes <- ae %>%
dplyr::filter(AESER == "Y")
cat("Total AEs:", nrow(ae), "\n")Total AEs: 1191
cat("Serious AEs:", nrow(serious_aes), "\n")Serious AEs: 3
serious_aes is a subset of ae with the exact same columns but fewer rows. The two cat() lines reveal:
- How many total adverse events were recorded in the trial.
- How many of those were flagged as serious (
AESER == "Y").
The difference is the count of non-serious AEs. In clinical trials, serious AEs trigger expedited regulatory reporting (e.g., 7-day or 15-day safety reports to the FDA), so isolating them quickly is an everyday task for safety programmers. The percentage of serious AEs relative to total AEs is a key metric reviewed by Data Safety Monitoring Boards (DSMBs).
7.2 Operation 2: select() - Choose Columns
Select only the variables we need for a quick AE overview.
select() works
select() subsets columns (variables), not rows. It takes column names (unquoted) as arguments. You can also:
- Drop columns by prefixing with
-:select(-STUDYID, -DOMAIN) - Rename during selection:
select(Subject = USUBJID, Event = AEDECOD) - Use helper functions:
select(starts_with("AE"))selects all columns beginning with “AE”;select(contains("DTC"))grabs all date columns.
select() does not change row count - it only changes column count. (Reference: dplyr select() documentation)
# Select key AE variables
ae_summary <- ae %>%
dplyr::select(USUBJID, AEDECOD, AESEV, AESER, AESTDTC, AEENDTC)
utils::head(ae_summary)# A tibble: 6 × 6
USUBJID AEDECOD AESEV AESER AESTDTC AEENDTC
<chr> <chr> <chr> <chr> <chr> <chr>
1 01-701-1015 APPLICATION SITE ERYTHEMA MILD N 2014-0… <NA>
2 01-701-1015 APPLICATION SITE PRURITUS MILD N 2014-0… <NA>
3 01-701-1015 DIARRHOEA MILD N 2014-0… 2014-0…
4 01-701-1023 ATRIOVENTRICULAR BLOCK SECOND DEGREE MILD N 2012-0… <NA>
5 01-701-1023 ERYTHEMA MILD N 2012-0… 2012-0…
6 01-701-1023 ERYTHEMA MODERA… N 2012-0… <NA>
Compare ncol(ae) vs ncol(ae_summary) - the full AE domain has many more columns than the 6 we selected. The row count is identical to ae. ae_summary is a focused, readable slice of the data containing just the variables needed for a quick listing of adverse events: who experienced what, how severe, was it serious, and when did it start and end. This pattern - select only what you need - is a best practice that makes downstream code faster and easier to read.
7.3 Operation 3: mutate() - Create New Variables
Create a flag for treatment-related adverse events.
mutate() works
mutate() adds new columns (or overwrites existing ones) without changing the row count. Every expression inside mutate() is evaluated for each row. dplyr::if_else(condition, true_value, false_value) is a vectorized conditional - it’s safer than base R’s ifelse() because it enforces consistent data types for both outcomes.
The | operator here means OR - AEREL == "PROBABLE" | AEREL == "POSSIBLE" is TRUE if either condition holds. You can chain multiple mutate() calls or add many variables in a single call separated by commas. (Reference: dplyr mutate() documentation)
# Create a new variable: treatment-related flag
ae_with_flag <- ae %>%
dplyr::mutate(
# Create a related flag (Y if related, N otherwise)
AEREL_FLAG = dplyr::if_else(AEREL == "PROBABLE" | AEREL == "POSSIBLE", "Y", "N")
)
# Check the distribution
table(ae_with_flag$AEREL_FLAG, useNA = "ifany")
N Y <NA>
483 704 4
ae_with_flag has one more column than ae - the new AEREL_FLAG variable you just derived. Row count is unchanged. The table() output shows you the distribution of this new flag:
- Y: AEs where the investigator assessed the event as probably or possibly related to the study drug - these are treatment-related AEs, a critical subset for benefit-risk assessment.
- N: AEs not considered related to treatment.
<NA>: If any subjects had a missingAERELvalue,if_else()propagatesNA. TheuseNA = "ifany"argument intable()makes missing values visible - always check for NAs before assuming your flag is complete.
The ratio of Y to total tells you the treatment-related AE rate - a key safety metric in every clinical study report.
7.4 Operation 4: group_by() + summarise() - Aggregate Data
Count adverse events by System Organ Class (SOC).
group_by() + summarise() works
group_by() silently marks the data frame with grouping information - it doesn’t visibly change the data. summarise() then collapses each group into a single summary row using the aggregation functions you provide:
n()counts the total number of rows in the group (= total events)n_distinct(USUBJID)counts the number of unique subjects in the group
The difference between n() and n_distinct() is important: a subject who experiences the same SOC twice would contribute 2 to n_events but only 1 to n_subjects. Regulatory tables typically report subjects, not events, to avoid overcounting. arrange(desc(...)) then sorts from highest to lowest. (Reference: dplyr group_by() + summarise() documentation)
# Count AEs by System Organ Class
ae_by_soc <- ae %>%
dplyr::group_by(AEBODSYS) %>%
dplyr::summarise(
n_events = dplyr::n(),
n_subjects = dplyr::n_distinct(USUBJID)
) %>%
dplyr::arrange(dplyr::desc(n_events))
# Show top 10 SOCs
utils::head(ae_by_soc, 10)# A tibble: 10 × 3
AEBODSYS n_events n_subjects
<chr> <int> <int>
1 GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS 292 108
2 SKIN AND SUBCUTANEOUS TISSUE DISORDERS 276 105
3 NERVOUS SYSTEM DISORDERS 101 59
4 CARDIAC DISORDERS 91 44
5 GASTROINTESTINAL DISORDERS 87 53
6 INFECTIONS AND INFESTATIONS 73 39
7 RESPIRATORY, THORACIC AND MEDIASTINAL DISORDERS 53 30
8 INVESTIGATIONS 40 23
9 PSYCHIATRIC DISORDERS 40 29
10 INJURY, POISONING AND PROCEDURAL COMPLICATIONS 29 14
ae_by_soc is a completely transformed data frame:
- Before:
aehad one row per adverse event record (many rows, many columns). - After:
ae_by_sochas one row per System Organ Class (few rows, 3 columns:AEBODSYS,n_events,n_subjects).
This is the power of group_by() + summarise() - it condenses detailed transactional data into an analytical summary. The result, sorted by n_events descending, gives you an instant view of which organ systems are most affected in this trial. This is the backbone of the MedDRA SOC-level adverse event summary table that appears in every Clinical Study Report.
7.5 Operation 5: arrange() - Sort Data
Sort AEs by subject and start date.
arrange() works
arrange() reorders rows based on one or more columns. By default it sorts ascending (A→Z for characters, smallest→largest for numbers). Wrap a column in desc() for descending order. When you provide multiple columns (like USUBJID, AESTDTC), rows are first sorted by the first column, and ties are broken by the second column. arrange() does not change the columns or the number of rows.
In clinical programming, sorting by USUBJID and then chronologically by AESTDTC is a standard step before creating subject-level event narratives or listings. (Reference: dplyr arrange() documentation)
# Sort AEs chronologically within each subject
ae_sorted <- ae %>%
dplyr::select(USUBJID, AEDECOD, AESTDTC, AESEV) %>%
dplyr::arrange(USUBJID, AESTDTC)
utils::head(ae_sorted, 10)# A tibble: 10 × 4
USUBJID AEDECOD AESTDTC AESEV
<chr> <chr> <chr> <chr>
1 01-701-1015 APPLICATION SITE ERYTHEMA 2014-01-03 MILD
2 01-701-1015 APPLICATION SITE PRURITUS 2014-01-03 MILD
3 01-701-1015 DIARRHOEA 2014-01-09 MILD
4 01-701-1023 ERYTHEMA 2012-08-07 MILD
5 01-701-1023 ERYTHEMA 2012-08-07 MODERATE
6 01-701-1023 ERYTHEMA 2012-08-07 MILD
7 01-701-1023 ATRIOVENTRICULAR BLOCK SECOND DEGREE 2012-08-26 MILD
8 01-701-1028 APPLICATION SITE ERYTHEMA 2013-07-21 MILD
9 01-701-1028 APPLICATION SITE PRURITUS 2013-08-08 MILD
10 01-701-1034 APPLICATION SITE PRURITUS 2014-08-27 MILD
The data has the same rows and columns as after the select(), but the order is different. In the output you should see:
- All rows for a given
USUBJIDappear together (grouped by subject). - Within each subject, events are ordered from earliest
AESTDTCto latest. - This chronological view lets you read a subject’s adverse event timeline from top to bottom - essential for writing individual patient narratives in safety reports and for detecting whether adverse events overlap or follow one another.
Note: AESTDTC is a character string in ISO 8601 format (YYYY-MM-DD). Alphabetical sorting of ISO 8601 dates happens to be chronologically correct - another reason CDISC mandates this date format.
7.6 Bonus Operation 6: anti_join() - Find Missing Data
Find subjects in DM who have no adverse events (useful for data QC).
anti_join() works
anti_join(x, y, by = "key") returns all rows from x where the key value is not found in y. Think of it as a “NOT IN” filter. It is one of the most useful functions for data quality checking in clinical programming:
- Find subjects in DM with no exposure records (potential protocol deviations)
- Find subjects with lab results but no demographics (orphan records)
- Find AE records for subjects not in DM (invalid USUBJIDs)
Only the columns from the left-hand dataset (x) are returned - no columns from y are added. (Reference: dplyr filtering joins documentation)
# Find subjects with NO adverse events
subjects_no_ae <- dm %>%
dplyr::select(USUBJID, ARM) %>%
dplyr::anti_join(ae, by = "USUBJID")
cat("Subjects with no AEs:", nrow(subjects_no_ae), "\n")Subjects with no AEs: 81
subjects_no_ae contains only the DM subjects whose USUBJID does not appear anywhere in the AE domain. The count tells you how many enrolled subjects experienced zero adverse events during the trial. This information is used when calculating incidence rates - the denominator must include all subjects, not just those with AEs. It’s also a useful data QC check: in some trials, every subject is expected to have at least one medical history AE entry; a subject with no AE records at all might indicate a data collection gap.
8 Day 1 Deliverable Summary
Today you completed the following:
| Task | Status |
|---|---|
| Installed and loaded pharmaverse packages | Done |
| Loaded DM, AE, EX, LB SDTM domains | Done |
| Explored data with glimpse(), head(), dim() | Done |
| Joined DM-AE (demographics + adverse events) | Done |
| Joined DM-EX (demographics + exposure) | Done |
| Demonstrated 5+ dplyr operations | Done |
9 Key Takeaways
- USUBJID is the universal key - It links all SDTM domains together
- SDTM domains have classes - Special Purpose (DM), Events (AE), Interventions (EX), Findings (LB)
- The CDISC data flow - Raw → SDTM → ADaM → TLFs
- dplyr is essential - Most pharmaverse packages are built on tidyverse principles
- pharmaversesdtm provides real-world example data - Use it for learning and testing
10 Resources
- pharmaverse.org - The central hub for all pharmaverse packages
- pharmaversesdtm GitHub - Source code and documentation
- admiral ‘Get Started’ vignette - Introduction to ADaM creation
- CDISC SDTM Implementation Guide - Official SDTM documentation
- dplyr documentation - Complete reference for data manipulation
11 Frequently Asked Questions
SDTM (Study Data Tabulation Model) is the raw-to-standardized transformation layer. Its purpose is to faithfully represent the collected data in a consistent, submission-ready format - minimal derivations, maximum traceability. Every SDTM variable should map directly back to a source CRF field or lab system.
ADaM (Analysis Data Model) is the analysis-ready layer. It derives new variables (e.g., baseline flags, analysis windows, treatment-emergent flags, study day calculations) that are needed to produce the statistical outputs in your Clinical Study Report.
Rule of thumb: If a statistician needs to compute something, it lives in ADaM. If it was collected directly from the subject, it lives in SDTM. Never perform analysis directly on raw SDTM without first deriving the needed ADaM variables - this ensures reproducibility and regulatory traceability.
SDTM uses ISO 8601 format for dates and times (e.g., "2014-01-02", "2014-01-02T08:30:00"). One key reason they remain character strings in SDTM is that partial dates are allowed - a subject might have an adverse event where only the year and month are known ("2014-01"), or even just the year ("2014"). R’s Date class cannot represent partial dates.
In ADaM, admiral provides dedicated functions like derive_vars_dt() and derive_vars_dtm() that convert ISO 8601 character strings to proper R Date or POSIXct objects, imputing partial dates according to CDISC rules.
%>% (pipe) operator and should I use |> instead?
The %>% operator comes from the magrittr package (re-exported by dplyr). It passes the object on its left as the first argument to the function on its right, allowing you to chain operations into readable pipelines instead of nesting function calls.
# Without pipe (hard to read):
head(arrange(select(ae, USUBJID, AEDECOD), USUBJID), 5)
# With pipe (reads left-to-right like a sentence):
ae %>% select(USUBJID, AEDECOD) %>% arrange(USUBJID) %>% head(5)R 4.1+ introduced the native pipe |> which works similarly but has no package dependency. For most tidyverse/pharmaverse code, both are interchangeable. The main difference is that |> requires R ≥ 4.1 and does not support the . placeholder that %>% allows. New pharmaverse code is gradually migrating to |>.
dplyr::n() vs dplyr::n_distinct() mean, and when should I use each?
n()counts the total number of rows in the current group. If a subject appears 5 times in the group (5 AE records),n()counts all 5.n_distinct(USUBJID)counts the number of unique values ofUSUBJIDin the group. If the same subject appears 5 times,n_distinct(USUBJID)counts them as 1.
In clinical reporting, regulatory guidance (ICH E9) and FDA expectations require adverse event tables to report subject counts (n_distinct), not event counts (n()), in the primary summary. Event counts are reported separately. Always ask: “Am I counting occurrences or people?”
.xpt file and why does clinical data need to be exported as one?
A SAS Transport File (.xpt) is a format originally developed by SAS Institute for transferring datasets between different SAS versions and platforms. The FDA and EMA mandate .xpt format for electronic submissions because it is a well-documented, stable binary format that regulatory reviewers’ SAS systems can reliably read, regardless of the sponsor’s software environment.
Even though pharmaverse teams work in R, the final submission datasets must be exported as .xpt files. The haven package can read and write .xpt files in R, and xportr adds the layer of applying CDISC metadata (variable labels, lengths, formats) before export to ensure the .xpt file meets submission standards.
AEREL and what values can it take in SDTM?
AEREL is the causality assessment variable in the AE domain - it records the investigator’s judgment about whether the adverse event was caused by the study drug. The CDISC SDTMIG defines controlled terminology for this variable via the CDISC CT (Controlled Terminology) codelist C66726:
UNLIKELY- The event is unlikely to be related to the study drugPOSSIBLE- There is a reasonable possibility the event could be relatedPROBABLE- The event is probably related to the study drugDEFINITELY RELATED- The event is definitively relatedNOT RELATED- No relationship to the study drug
Some sponsors use a simpler Yes/No relatedness flag (AEOUT). The exact values used in your study must match your study-specific annotated CRF and the controlled terminology published by CDISC.
inner_join() sometimes produce more rows than the left dataset?
This happens when the right-hand dataset has duplicate key values. For example, if a subject appears 3 times in AE (3 adverse events), and you join DM (1 row for that subject) to AE using inner_join, the result has 3 rows for that subject - the DM row is duplicated once for each matching AE row. This is expected and correct behavior for a one-to-many relationship.
If you accidentally end up with more rows than expected after a join, check for unintended duplicate keys in either dataset using:
ae %>% dplyr::count(USUBJID) %>% dplyr::filter(n > 1)This is a common source of row inflation bugs in clinical programming and should always be verified.
The pharmaverse is a voluntary, cross-industry consortium of pharmaceutical companies (including Roche, GSK, Pfizer, Novartis, and many others) that collaboratively develops and maintains R packages for clinical reporting. It is not an official regulatory body and its packages are not “certified” by the FDA or EMA.
However, the FDA has indicated openness to R-based submissions and the packages are built to comply with CDISC standards, which are regulatory requirements. The admiral package, in particular, generates ADaM datasets that are structurally identical to what production SAS programs would produce - it has been used in actual regulatory submissions.
The key validation requirement is that your company validates the use of any software (including R packages) in a submission context, following your own SOPs and potentially using tools like valtools or thevalidatoR.
12 What’s Next?
In Day 2, we’ll dive deeper into: - SDTM domain classes and their specific structures - Advanced tidyverse operations - Variable naming conventions (–TESTCD, –ORRES, –DY patterns) - Creating data summaries by treatment arm