if (!requireNamespace("dplyr", quietly = TRUE)) suppressMessages(install.packages("dplyr"))
if (!requireNamespace("lubridate", quietly = TRUE)) suppressMessages(install.packages("lubridate"))
if (!requireNamespace("pharmaversesdtm", quietly = TRUE)) suppressMessages(install.packages("pharmaversesdtm"))
if (!requireNamespace("metacore", quietly = TRUE)) suppressMessages(install.packages("metacore"))
if (!requireNamespace("metatools", quietly = TRUE)) suppressMessages(install.packages("metatools"))
if (!requireNamespace("xportr", quietly = TRUE)) suppressMessages(install.packages("xportr"))
library(dplyr)
library(lubridate)
library(pharmaversesdtm)
library(metacore)
library(metatools)
library(xportr)Day 14: Week 2 Capstone - Metadata-Driven SDTM with metacore & xportr
Specification-Driven Workflows for Submission-Ready Data
1 Learning Objectives
By the end of Day 14 (Week 2 Capstone), you will be able to:
- Load specification objects using
metacore- the Pharmaverse metadata standard - Use
metatoolsto apply metadata checks and select columns from specs - Use
xportrto apply labels, types, formats, lengths, and variable ordering from specs - Build an end-to-end pipeline: raw data β SDTM β validate β export
.xpt - Appreciate why metadata-driven workflows are the future of clinical programming
- Understand everything needed before starting ADaM datasets in Week 3
2 Why Metadata-Driven Workflows?
2.1 The Problem with Manual Programming
In traditional clinical programming, programmers manually:
- Assign variable labels (
label(dm$USUBJID) <- "Unique Subject Identifier") - Set variable types (character vs numeric)
- Order variables in the correct sequence
- Set variable lengths for transport files
This is error-prone, tedious, and hard to maintain.
Instead of hardcoding metadata in your programs, you:
- Define metadata once in a specification file (Excel, or a
metacoreobject) - Load metadata into R using
metacore - Apply metadata to your datasets using
xportr/metatools - Validate that your datasets match the specification
This ensures:
- Consistency - All datasets follow the same rules
- Reproducibility - Metadata changes automatically propagate
- Compliance - Variable labels, types, and lengths match
define.xml
3 Package Installation & Loading
3.1 Required Packages
| Package | Purpose |
|---|---|
metacore |
Load and manage dataset specifications |
metatools |
Apply metadata-based transformations and checks |
xportr |
Apply labels, types, formats, lengths; export .xpt |
dplyr |
Data manipulation |
pharmaversesdtm |
Example SDTM datasets |
4 Understanding metacore
4.1 What is metacore?
metacore is a Pharmaverse package that provides a standardized way to represent dataset specifications in R. Think of it as the bridge between your Excel specification file and your R programs.
4.2 The metacore Object Structure
A metacore object contains multiple related tables:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β METACORE OBJECT β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ds_spec = Dataset-level metadata β
β (domain name, label, structure) β
β β
β ds_vars = Variable-level metadata β
β (which variables belong to which dataset) β
β β
β var_spec = Variable specifications β
β (variable name, label, type, length, format) β
β β
β value_spec = Value-level metadata β
β (codelist values, decode values) β
β β
β derivations = Derivation metadata β
β (how variables are derived) β
β β
β codelist = Code list definitions β
β (controlled terminology) β
β β
β supp = Supplemental qualifiers β
β (SUPP-- domain information) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
4.3 Creating a metacore Object
In production, youβd load this from an Excel specification. For learning, letβs build one:
Metadata components created:
Dataset specs: 3 datasets
Variable assignments: 28 variable-dataset pairs
Variable specs: 25 unique variables
# In production, you would create the metacore object like this:
mc <- metacore::metacore(
ds_spec = ds_spec,
ds_vars = ds_vars,
var_spec = var_spec
)
# Or more commonly, load from a specification file:
mc <- metacore::spec_to_metacore("path/to/specs.xlsx")The most common way to create a metacore object in production is using spec_to_metacore(), which reads from a formatted Excel specification file. This spec file is typically created by the study statistician or data standards team and contains all the metadata for every dataset and variable.
5 Using xportr: The Metadata Application Engine
5.1 What xportr Does
xportr is the workhorse for making your datasets submission-ready. It applies metadata from your specification to your actual data:
xportr: Key Functions
# A tibble: 6 Γ 2
Function What_It_Does
<chr> <chr>
1 xportr_type() Coerce variables to the correct type (character/numeric)
2 xportr_length() Set variable lengths for SAS transport
3 xportr_label() Apply variable labels from specification
4 xportr_order() Reorder variables to match specification
5 xportr_format() Apply SAS display formats
6 xportr_write() Export the dataset as a .xpt (SAS transport) file
5.2 Applying xportr Step by Step
Letβs work through the full pipeline using the DM domain:
DM dataset before xportr:
Rows: 15
Cols: 16
Rows: 15
Columns: 16
$ STUDYID <chr> "CDISC01", "CDISC01", "CDISC01", "CDISC01", "CDISC01", "CDISCβ¦
$ DOMAIN <chr> "DM", "DM", "DM", "DM", "DM", "DM", "DM", "DM", "DM", "DM", "β¦
$ SUBJID <chr> "001", "002", "003", "004", "005", "006", "007", "008", "009"β¦
$ SITEID <chr> "101", "101", "101", "101", "102", "102", "102", "101", "103"β¦
$ AGE <int> 71, 27, 65, 49, 51, 60, 61, 55, 69, 29, 44, 58, 52, 64, 27
$ AGEU <chr> "YEARS", "YEARS", "YEARS", "YEARS", "YEARS", "YEARS", "YEARS"β¦
$ SEX <chr> "F", "M", "F", "F", "F", "M", "M", "F", "F", "F", "F", "F", "β¦
$ RACE <chr> "WHITE", "BLACK OR AFRICAN AMERICAN", "BLACK OR AFRICAN AMERIβ¦
$ ETHNIC <chr> "NOT HISPANIC OR LATINO", "NOT HISPANIC OR LATINO", "NOT HISPβ¦
$ ARM <chr> "Placebo", "Active 20mg", "Active 20mg", "Placebo", "Placebo"β¦
$ ARMCD <chr> "PBO", "ACT20", "ACT20", "PBO", "PBO", "ACT10", "PBO", "ACT10β¦
$ ACTARM <chr> "Placebo", "Active 20mg", "Active 20mg", "Placebo", "Placebo"β¦
$ ACTARMCD <chr> "PBO", "ACT20", "ACT20", "PBO", "PBO", "ACT10", "PBO", "ACT10β¦
$ RFSTDTC <chr> "2024-02-09", "2024-01-21", "2024-02-05", "2024-02-26", "2024β¦
$ RFENDTC <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
$ USUBJID <chr> "CDISC01-101-001", "CDISC01-101-002", "CDISC01-101-003", "CDIβ¦
5.3 Step 2: Create a Specification for xportr
DM specification:
# A tibble: 16 Γ 5
variable type label length order
<chr> <chr> <chr> <int> <int>
1 STUDYID character Study Identifier 12 1
2 DOMAIN character Domain Abbreviation 2 2
3 USUBJID character Unique Subject Identifier 40 3
4 SUBJID character Subject Identifier for the Study 8 4
5 SITEID character Study Site Identifier 8 5
6 AGE numeric Age 8 6
7 AGEU character Age Units 6 7
8 SEX character Sex 2 8
9 RACE character Race 40 9
10 ETHNIC character Ethnicity 40 10
11 ARM character Planned Arm 40 11
12 ARMCD character Planned Arm Code 20 12
13 ACTARM character Actual Arm 40 13
14 ACTARMCD character Actual Arm Code 20 14
15 RFSTDTC character Subject Reference Start Date/Time 20 15
16 RFENDTC character Subject Reference End Date/Time 20 16
5.4 Step 3: Apply xportr Functions
After xportr_type():
AGE type: integer
STUDYID type: character
After xportr_label():
Variable Label
1 STUDYID Study Identifier
2 DOMAIN Domain Abbreviation
3 SUBJID Subject Identifier for the Study
4 SITEID Study Site Identifier
5 AGE Age
6 AGEU Age Units
7 SEX Sex
8 RACE Race
After xportr_order():
Variable order: STUDYID, DOMAIN, USUBJID, SUBJID, SITEID, AGE, AGEU, SEX, RACE, ETHNIC, ARM, ARMCD, ACTARM, ACTARMCD, RFSTDTC, RFENDTC
After xportr_length():
All metadata applied β
5.5 Step 4: Export as .xpt
Exported: output/dm.xpt
File size: 7920 bytes
In practice, youβd chain all xportr functions together:
dm_final <- raw_dm %>%
xportr_type(spec, domain = "DM") %>%
xportr_length(spec, domain = "DM") %>%
xportr_label(spec, domain = "DM") %>%
xportr_order(spec, domain = "DM") %>%
xportr_format(spec, domain = "DM") %>%
xportr_write("output/dm.xpt")This single pipeline takes your raw dataset and makes it submission-ready!
6 Using metatools for Metadata-Based Checks
6.1 What metatools Provides
metatools helps you work with metadata - selecting variables, checking CT compliance, and building datasets from specs:
metatools: Key Functions
# A tibble: 5 Γ 2
Function Purpose
<chr> <chr>
1 build_from_derived() Create a dataset shell from specification
2 check_ct_col() Check if column values match controlled terminology
3 check_variables() Verify dataset variables match specification
4 combine_supp() Combine SUPP-- with parent domain
5 drop_unspec_vars() Remove variables not in the specification
7 End-to-End Capstone Pipeline
Letβs build a complete pipeline that takes raw data through to validated, exported SDTM:
=== WEEK 2 CAPSTONE: END-TO-END SDTM PIPELINE ===
Step 1: Generate raw clinical data
Demographics: 20 subjects
Adverse Events: 30 records
Step 2: Transform to SDTM format
SDTM DM: 20 rows x 16 cols
SDTM AE: 30 rows x 12 cols
Step 3: Validate SDTM domains
Orphan AE records: 16 β
DM required variables: All present β
AE required variables: All present β
SEX controlled terminology: Valid β
AESEV controlled terminology: Valid β
AE date logic (start <= end): Valid β
Step 4: Apply metadata and export
DM metadata applied:
Variables: 16
Order: STUDYID β DOMAIN β USUBJID β SUBJID β SITEID β AGE β AGEU β SEX β RACE β ETHNIC β ARM β ARMCD β ACTARM β ACTARMCD β RFSTDTC β RFENDTC
Labels applied: 16 of 16
Step 5: Export as .xpt files
β Exported: output/dm.xpt - 9520 bytes
β Exported: output/ae.xpt - 25120 bytes
8 Week 2 Review: Everything Youβve Learned
=== WEEK 2 COMPLETE REVIEW ===
# A tibble: 7 Γ 3
Day Topic Key_Skill
<chr> <chr> <chr>
1 8 LB Domain & Unit Standardization Unit conversion, reference ranges,β¦
2 9 VS & Repeated Measures Multiple readings, VSPOS, VSTPT, wβ¦
3 10 AE Domain Mastery & SAE Logic Severity vs seriousness, TEAE, SAEβ¦
4 11 Disposition & Trial Design DS domain, EPOCH, TA/TV/TS, ADSL pβ¦
5 12 Data Cuts with datacutr Patient-level & record-level cuts,β¦
6 13 SDTM Validation with sdtmchecks FDA business rules, cross-domain cβ¦
7 14 Metadata-Driven SDTM (this capstone) metacore, metatools, xportr pipeliβ¦
8.1 What Youβre Now Ready For
You now have a solid foundation in SDTM:
In Week 3, we will use admiral to build ADaM datasets (ADSL, ADAE, ADVS, ADLB) from the SDTM data youβve mastered.
9 Deliverable Summary
Today you completed the following:
| Task | Status |
|---|---|
| Understood metacore specification objects | β Done |
| Created variable specifications for DM and AE | β Done |
| Applied xportr_type, xportr_label, xportr_order, xportr_length | β Done |
| Exported submission-ready .xpt files | β Done |
| Built an end-to-end pipeline: raw β SDTM β validate β export | β Done |
| Reviewed all Week 2 topics | β Done |
10 Key Takeaways
- Metadata-driven is the future - Define once, apply everywhere
metacorestandardizes specs - One R object for all dataset/variable metadataxportrapplies metadata - Types, labels, lengths, ordering, and exportmetatoolsenables checks - Verify your data matches the specification- The pipeline is reproducible - Same spec + same code = same output every time
- Youβre ready for ADaM - All SDTM fundamentals are in place
11 Resources
- metacore Documentation - Official metacore package
- metatools Documentation - Metadata utility functions
- xportr Documentation - SAS transport export
- Admiral Documentation - ADaM derivation package (Week 3!)
- Pharmaverse.org - R packages for clinical data
- FDA Data Standards Resources - FDA guidance
12 π Congratulations! Week 2 Complete!
Youβve now mastered:
- Complex SDTM domains (LB, VS, AE, DS)
- Production workflows (data cuts, validation, metadata)
- Pharmaverse tools (datacutr, sdtmchecks, metacore, xportr)
- Regulatory requirements (SAE logic, TEAE derivation, controlled terminology)
13 Whatβs Next?
Week 3: ADaM Datasets with Admiral
- Using
admiralto derive ADaM datasets from SDTM - Building ADSL (Subject-Level Analysis Dataset)
- Creating BDS datasets: ADVS, ADLB
- Deriving baseline, change from baseline, shift tables
- ADAE creation with treatment-emergent logic