library(dplyr)
library(lubridate)
library(pharmaversesdtm)
# Load the data we'll use
data("ae", package = "pharmaversesdtm")
data("dm", package = "pharmaversesdtm")Day 4: Clinical Date Derivations with lubridate
Study Day, Duration, and Partial Date Imputation
1 Learning Objectives
By the end of Day 4, you will be able to:
- Parse ISO 8601 clinical dates using
lubridate - Calculate Study Days (
--DYvariables) adhering to CDISC rules (no Day 0) - Calculate Durations (
--DUR) between start and end dates - Handle Partial Dates (e.g., “2023-05”) using imputation strategies
- Create reusable functions for date derivations
2 Introduction
2.1 Dates in CDISC (ISO 8601)
In clinical trials, dates and times are very important. CDISC requires that all dates and times follow the ISO 8601 standard format. This makes it easy for computers and people to read and compare dates. The format looks like this: YYYY-MM-DDThh:mm:ss.
| Format | Example | Description |
|---|---|---|
| Complete date | 2024-01-15 |
Full date |
| Date + time | 2024-01-15T09:30 |
Date with time |
| Year-Month only | 2024-01 |
Partial (missing day) |
| Year only | 2024 |
Partial (month + day) |
2.1.1 Why does this matter?
If everyone uses the same date format, it’s much easier to merge, analyze, and review data. It also prevents mistakes that can happen if dates are written in different ways (like 01/02/2024- is that January 2 or February 1?).
2.2 The “No Day 0” Rule
When calculating study days (like how many days since a subject started treatment), CDISC has a special rule: There is NO Day 0. The reference date (usually the first dose) is always Day 1. Days before the reference date are negative numbers (Day -1, Day -2, etc.).
- Day 1 = Reference date (usually first dose)
- Day -1 = The day before reference date
- There is NO Day 0
2.2.1 Why is this important?
This rule keeps things clear and avoids confusion. If you see Day 1, you know it’s the first day of treatment. There’s never a Day 0 to worry about.
2.3 The Role of lubridate
Working with dates in R can be tricky, but the lubridate package (part of the tidyverse) makes it much easier. It helps you:
- Parse (read) dates from character strings
- Do date math (like adding days)
- Extract parts of a date (year, month, day)
This is very helpful for clinical data, where you often need to calculate things like study day, duration, or handle partial dates.
3 Package Installation & Loading
3.1 Required Packages
| Package | Purpose |
|---|---|
dplyr |
Data manipulation |
lubridate |
Date/time manipulation |
pharmaversesdtm |
Source data |
3.2 Load Packages
4 Part 1: Basic Date Parsing
First, let’s learn how to convert character strings (which SDTM dates always are) into actual Date objects in R.
4.1 Parsing Complete Dates
# Use ymd() for Year-Month-Day strings
date_str <- "2024-01-15"
date_obj <- ymd(date_str)
cat("Original string:", date_str, "\n")Original string: 2024-01-15
cat("Parsed date:", as.character(date_obj), "\n")Parsed date: 2024-01-15
cat("Class:", class(date_obj), "\n")Class: Date
4.2 Parsing Dates with Times
# ymd_hms() for full datetime
datetime_full <- "2024-01-15T14:30:00"
ymd_hms(datetime_full)[1] "2024-01-15 14:30:00 UTC"
# ymd_hm() when seconds are missing
datetime_short <- "2024-01-15T14:30"
ymd_hm(datetime_short)[1] "2024-01-15 14:30:00 UTC"
4.3 Extracting Date Components
sample_date <- ymd("2024-03-15")
cat("Year:", year(sample_date), "\n")Year: 2024
cat("Month:", month(sample_date), "\n")Month: 3
cat("Day:", day(sample_date), "\n")Day: 15
cat("Weekday:", wday(sample_date, label = TRUE), "\n")Weekday: 6
5 Part 2: Calculating Study Day (--DY)
The Study Day (--DY) indicates the number of days from the Reference Start Date (RFSTDTC).
5.1 The Formula
If event_date >= reference_date:
Study Day = (event_date - reference_date) + 1
If event_date < reference_date:
Study Day = (event_date - reference_date)
5.2 Creating a Reusable Function
This function can be reused across all your SDTM and ADaM programs.
#' Derive Study Day following CDISC "No Day 0" convention
#'
#' @param event_date Date of the event
#' @param reference_date Reference start date (usually RFSTDTC)
#' @return Integer study day (positive = on/after ref, negative = before ref)
derive_study_day <- function(event_date, reference_date) {
# Calculate raw difference in days
diff_days <- as.numeric(event_date - reference_date)
# Apply "No Day 0" rule
study_day <- ifelse(diff_days >= 0, diff_days + 1, diff_days)
return(as.integer(study_day))
}
# Test the function
test_ref <- ymd("2024-01-01")
test_dates <- c("2023-12-30", "2023-12-31", "2024-01-01", "2024-01-02", "2024-01-05")
data.frame(
Event_Date = test_dates,
Reference = "2024-01-01",
Study_Day = derive_study_day(ymd(test_dates), test_ref)
) Event_Date Reference Study_Day
1 2023-12-30 2024-01-01 -2
2 2023-12-31 2024-01-01 -1
3 2024-01-01 2024-01-01 1
4 2024-01-02 2024-01-01 2
5 2024-01-05 2024-01-01 5
Look at the output above: Day -2, Day -1, Day 1, Day 2, Day 5. There is no Day 0!
5.3 Applying to Real Data: Derive AESTDY
# Merge RFSTDTC from DM into AE
ae_with_ref <- ae %>%
select(USUBJID, AESEQ, AESTDTC, AEENDTC) %>%
inner_join(
dm %>% select(USUBJID, RFSTDTC),
by = "USUBJID"
) %>%
# Filter to complete dates only (length >= 10)
filter(nchar(AESTDTC) >= 10 & nchar(RFSTDTC) >= 10)
# Apply our function
ae_study_days <- ae_with_ref %>%
mutate(
AE_DT = ymd(substring(AESTDTC, 1, 10)),
REF_DT = ymd(substring(RFSTDTC, 1, 10)),
AESTDY = derive_study_day(AE_DT, REF_DT)
)
# Preview
ae_study_days %>%
select(USUBJID, AESTDTC, RFSTDTC, AESTDY) %>%
head(10)# A tibble: 10 × 4
USUBJID AESTDTC RFSTDTC AESTDY
<chr> <chr> <chr> <int>
1 01-701-1015 2014-01-03 2014-01-02 2
2 01-701-1015 2014-01-03 2014-01-02 2
3 01-701-1015 2014-01-09 2014-01-02 8
4 01-701-1023 2012-08-26 2012-08-05 22
5 01-701-1023 2012-08-07 2012-08-05 3
6 01-701-1023 2012-08-07 2012-08-05 3
7 01-701-1023 2012-08-07 2012-08-05 3
8 01-701-1028 2013-07-21 2013-07-19 3
9 01-701-1028 2013-08-08 2013-07-19 21
10 01-701-1034 2014-08-27 2014-07-01 58
6 Part 3: Calculating Duration (--DUR)
Duration is typically: \(EndDate - StartDate + 1\) (inclusive of both days).
6.1 Duration in Days
# Filter for records with both start and end dates
ae_dur <- ae %>%
select(USUBJID, AESEQ, AESTDTC, AEENDTC) %>%
filter(nchar(AESTDTC) >= 10 & nchar(AEENDTC) >= 10) %>%
mutate(
START_DT = ymd(substring(AESTDTC, 1, 10)),
END_DT = ymd(substring(AEENDTC, 1, 10)),
# Duration in days (inclusive)
AEDUR = as.integer(END_DT - START_DT) + 1L
)
ae_dur %>%
select(USUBJID, AESTDTC, AEENDTC, AEDUR) %>%
head(10)# A tibble: 10 × 4
USUBJID AESTDTC AEENDTC AEDUR
<chr> <chr> <chr> <int>
1 01-701-1015 2014-01-09 2014-01-11 3
2 01-701-1023 2012-08-07 2012-08-30 24
3 01-701-1023 2012-08-07 2012-08-30 24
4 01-701-1047 2013-02-12 2013-02-12 1
5 01-701-1047 2013-02-12 2013-02-12 1
6 01-701-1097 2014-02-20 2014-02-22 3
7 01-701-1097 2014-04-19 2014-04-22 4
8 01-701-1097 2014-04-19 2014-04-22 4
9 01-701-1097 2014-02-20 2014-02-20 1
10 01-701-1097 2014-03-21 2014-03-21 1
6.2 Duration Statistics
ae_dur %>%
summarise(
N = n(),
Mean_Duration = round(mean(AEDUR, na.rm = TRUE), 1),
Median_Duration = median(AEDUR, na.rm = TRUE),
Min = min(AEDUR, na.rm = TRUE),
Max = max(AEDUR, na.rm = TRUE)
)# A tibble: 1 × 5
N Mean_Duration Median_Duration Min Max
<int> <dbl> <dbl> <int> <int>
1 714 23.8 11 1 444
7 Part 4: Handling Partial Dates
In clinical data, dates are often partial (e.g., “2023-06” or just “2023”). We cannot calculate durations or study days without imputing these to full dates.
7.1 Common Imputation Strategies
| Date Type | Missing Parts | Start Date Imputation | End Date Imputation |
|---|---|---|---|
| Year-Month | Day | 01 (first) | Last day of month |
| Year only | Month + Day | 01-01 | 12-31 |
7.2 Reusable Imputation Functions
#' Impute partial START dates (conservative: use earliest possible date)
#'
#' @param date_str Character date string in ISO 8601 format
#' @return Character string with complete date
impute_start_date <- function(date_str) {
dplyr::case_when(
is.na(date_str) | date_str == "" ~ NA_character_,
# Only Year (YYYY) -> YYYY-01-01
nchar(date_str) == 4 ~ paste0(date_str, "-01-01"),
# Year-Month (YYYY-MM) -> YYYY-MM-01
nchar(date_str) == 7 ~ paste0(date_str, "-01"),
# Full date -> Keep first 10 characters
TRUE ~ substring(date_str, 1, 10)
)
}
#' Impute partial END dates (conservative: use latest possible date)
#'
#' @param date_str Character date string in ISO 8601 format
#' @return Character string with complete date
impute_end_date <- function(date_str) {
dplyr::case_when(
is.na(date_str) | date_str == "" ~ NA_character_,
# Only Year (YYYY) -> YYYY-12-31
nchar(date_str) == 4 ~ paste0(date_str, "-12-31"),
# Year-Month (YYYY-MM) -> Last day of that month
nchar(date_str) == 7 ~ {
# Parse as first of month, then get last day
first_of_month <- ymd(paste0(date_str, "-01"))
last_of_month <- ceiling_date(first_of_month, "month") - days(1)
as.character(last_of_month)
},
# Full date -> Keep first 10 characters
TRUE ~ substring(date_str, 1, 10)
)
}7.3 Testing the Imputation Functions
# Test cases with various partial dates
test_dates <- c("2023-05-12", "2023-05", "2023", "2023-02", NA)
data.frame(
Original = test_dates,
Imputed_Start = impute_start_date(test_dates),
Imputed_End = impute_end_date(test_dates)
) Original Imputed_Start Imputed_End
1 2023-05-12 2023-05-12 2023-05-12
2 2023-05 2023-05-01 2023-05-31
3 2023 2023-01-01 2023-12-31
4 2023-02 2023-02-01 2023-02-28
5 <NA> <NA> <NA>
- For start dates, we impute to the earliest possible date to be conservative about when something started.
- For end dates, we impute to the latest possible date to be conservative about when something ended.
- This approach ensures we don’t underestimate durations or miss events.
8 Part 5: Complete Exercise - Derive AESTDY and AEENDY
Let’s bring everything together with a complete derivation.
# Full AE date derivation workflow
ae_dates_complete <- ae %>%
select(USUBJID, AESEQ, AETERM, AESTDTC, AEENDTC) %>%
# Join reference date
inner_join(
dm %>% select(USUBJID, RFSTDTC),
by = "USUBJID"
) %>%
mutate(
# Impute partial dates
AESTDTC_IMP = impute_start_date(AESTDTC),
AEENDTC_IMP = impute_end_date(AEENDTC),
RFSTDTC_IMP = impute_start_date(RFSTDTC),
# Create imputation flags (for traceability)
AESTDTF = ifelse(nchar(AESTDTC) < 10, "D", NA_character_),
AEENDTF = ifelse(nchar(AEENDTC) < 10, "D", NA_character_),
# Parse dates
ASTDT = ymd(AESTDTC_IMP),
AENDT = ymd(AEENDTC_IMP),
RFSTDT = ymd(RFSTDTC_IMP),
# Derive study days
AESTDY = derive_study_day(ASTDT, RFSTDT),
AEENDY = derive_study_day(AENDT, RFSTDT),
# Derive duration
AEDUR = ifelse(!is.na(ASTDT) & !is.na(AENDT),
as.integer(AENDT - ASTDT) + 1L,
NA_integer_)
)
# Preview key results
ae_dates_complete %>%
select(USUBJID, AETERM, AESTDY, AEENDY, AEDUR, AESTDTF) %>%
head(15)# A tibble: 15 × 6
USUBJID AETERM AESTDY AEENDY AEDUR AESTDTF
<chr> <chr> <int> <int> <int> <chr>
1 01-701-1015 APPLICATION SITE ERYTHEMA 2 NA NA <NA>
2 01-701-1015 APPLICATION SITE PRURITUS 2 NA NA <NA>
3 01-701-1015 DIARRHOEA 8 10 3 <NA>
4 01-701-1023 ATRIOVENTRICULAR BLOCK SECOND DEGREE 22 NA NA <NA>
5 01-701-1023 ERYTHEMA 3 26 24 <NA>
6 01-701-1023 ERYTHEMA 3 NA NA <NA>
7 01-701-1023 ERYTHEMA 3 26 24 <NA>
8 01-701-1028 APPLICATION SITE ERYTHEMA 3 NA NA <NA>
9 01-701-1028 APPLICATION SITE PRURITUS 21 NA NA <NA>
10 01-701-1034 APPLICATION SITE PRURITUS 58 NA NA <NA>
11 01-701-1034 FATIGUE 125 NA NA <NA>
12 01-701-1047 BUNDLE BRANCH BLOCK LEFT 27 NA NA <NA>
13 01-701-1047 HIATUS HERNIA 1 1 1 <NA>
14 01-701-1047 HIATUS HERNIA 1 1 1 <NA>
15 01-701-1047 UPPER RESPIRATORY TRACT INFECTION 23 NA NA <NA>
8.1 Summary Statistics
ae_dates_complete %>%
summarise(
Total_Records = n(),
Records_with_AESTDY = sum(!is.na(AESTDY)),
Records_with_AEENDY = sum(!is.na(AEENDY)),
Imputed_Start = sum(!is.na(AESTDTF)),
Mean_Duration = round(mean(AEDUR, na.rm = TRUE), 1)
)# A tibble: 1 × 5
Total_Records Records_with_AESTDY Records_with_AEENDY Imputed_Start
<int> <int> <int> <int>
1 1191 1191 718 26
# ℹ 1 more variable: Mean_Duration <dbl>
9 🎯 Practice Exercise
9.1 Your Turn: Apply to Exposure (EX) Domain
Using the functions we created, derive study days for the Exposure domain.
data("ex", package = "pharmaversesdtm")
# TODO: Complete the following
ex_dates <- ex %>%
select(USUBJID, EXSEQ, EXSTDTC, EXENDTC) %>%
inner_join(
dm %>% select(USUBJID, RFSTDTC),
by = "USUBJID"
) %>%
mutate(
# Step 1: Impute dates
# EXSTDTC_IMP = ???
# EXENDTC_IMP = ???
# Step 2: Parse dates
# EXSTDT = ???
# EXENDT = ???
# Step 3: Derive study days
# EXSTDY = ???
# EXENDY = ???
)
# Check your results
head(ex_dates)10 Deliverable Summary
Today you completed the following:
| Task | Status |
|---|---|
Parsed ISO 8601 dates with lubridate |
✓ Done |
Created derive_study_day() function |
✓ Done |
Created impute_start_date() function |
✓ Done |
Created impute_end_date() function |
✓ Done |
| Derived AESTDY, AEENDY, AEDUR from AE | ✓ Done |
11 Key Takeaways
- ISO 8601 (
YYYY-MM-DD) is the standard format. - No Day 0: Day -1 is immediately followed by Day 1.
- Reusable Functions: Create functions for derivations you’ll use repeatedly.
- Imputation Flags: Use
--DTFvariables to indicate imputed dates. - Documentation: Comment your logic for traceability.
12 Resources
13 What’s Next?
In Day 5, we will move to Advanced Tidyverse operations:
- Reshaping data with
pivot_wider()andpivot_longer() - Creating analysis-ready datasets
- Performing complex joins for QC