Day 4: Clinical Date Derivations with lubridate

Study Day, Duration, and Partial Date Imputation

1 Learning Objectives

By the end of Day 4, you will be able to:

Parse ISO 8601 clinical dates using lubridate
Calculate Study Days (--DY variables) adhering to CDISC rules (no Day 0)
Calculate Durations (--DUR) between start and end dates
Handle Partial Dates (e.g., “2023-05”) using imputation strategies
Create reusable functions for date derivations

2 Introduction

2.1 Dates in CDISC (ISO 8601)

In clinical trials, dates and times are very important. CDISC requires that all dates and times follow the ISO 8601 standard format. This makes it easy for computers and people to read and compare dates. The format looks like this: YYYY-MM-DDThh:mm:ss.

Format	Example	Description
Complete date	`2024-01-15`	Full date
Date + time	`2024-01-15T09:30`	Date with time
Year-Month only	`2024-01`	Partial (missing day)
Year only	`2024`	Partial (month + day)

2.1.1 Why does this matter?

If everyone uses the same date format, it’s much easier to merge, analyze, and review data. It also prevents mistakes that can happen if dates are written in different ways (like 01/02/2024- is that January 2 or February 1?).

2.2 The “No Day 0” Rule

When calculating study days (like how many days since a subject started treatment), CDISC has a special rule: There is NO Day 0. The reference date (usually the first dose) is always Day 1. Days before the reference date are negative numbers (Day -1, Day -2, etc.).

Day 1 = Reference date (usually first dose)
Day -1 = The day before reference date
There is NO Day 0

2.2.1 Why is this important?

This rule keeps things clear and avoids confusion. If you see Day 1, you know it’s the first day of treatment. There’s never a Day 0 to worry about.

2.3 The Role of `lubridate`

Working with dates in R can be tricky, but the lubridate package (part of the tidyverse) makes it much easier. It helps you:

Parse (read) dates from character strings
Do date math (like adding days)
Extract parts of a date (year, month, day)

This is very helpful for clinical data, where you often need to calculate things like study day, duration, or handle partial dates.

3 Package Installation & Loading

3.1 Required Packages

Package	Purpose
`dplyr`	Data manipulation
`lubridate`	Date/time manipulation
`pharmaversesdtm`	Source data

3.2 Load Packages

library(dplyr)
library(lubridate)
library(pharmaversesdtm)

# Load the data we'll use
data("ae", package = "pharmaversesdtm")
data("dm", package = "pharmaversesdtm")

4 Part 1: Basic Date Parsing

First, let’s learn how to convert character strings (which SDTM dates always are) into actual Date objects in R.

4.1 Parsing Complete Dates

# Use ymd() for Year-Month-Day strings
date_str <- "2024-01-15"
date_obj <- ymd(date_str)

cat("Original string:", date_str, "\n")

Original string: 2024-01-15

cat("Parsed date:", as.character(date_obj), "\n")

Parsed date: 2024-01-15

cat("Class:", class(date_obj), "\n")

Class: Date

4.2 Parsing Dates with Times

# ymd_hms() for full datetime
datetime_full <- "2024-01-15T14:30:00"
ymd_hms(datetime_full)

[1] "2024-01-15 14:30:00 UTC"

# ymd_hm() when seconds are missing
datetime_short <- "2024-01-15T14:30"
ymd_hm(datetime_short)

[1] "2024-01-15 14:30:00 UTC"

4.3 Extracting Date Components

sample_date <- ymd("2024-03-15")

cat("Year:", year(sample_date), "\n")

Year: 2024

cat("Month:", month(sample_date), "\n")

Month: 3

cat("Day:", day(sample_date), "\n")

Day: 15

cat("Weekday:", wday(sample_date, label = TRUE), "\n")

Weekday: 6

5 Part 2: Calculating Study Day (`--DY`)

The Study Day (--DY) indicates the number of days from the Reference Start Date (RFSTDTC).

5.1 The Formula

If event_date >= reference_date:
    Study Day = (event_date - reference_date) + 1

If event_date < reference_date:
    Study Day = (event_date - reference_date)

5.2 Creating a Reusable Function

This function can be reused across all your SDTM and ADaM programs.

#' Derive Study Day following CDISC "No Day 0" convention
#' 
#' @param event_date Date of the event
#' @param reference_date Reference start date (usually RFSTDTC)
#' @return Integer study day (positive = on/after ref, negative = before ref)
derive_study_day <- function(event_date, reference_date) {
  # Calculate raw difference in days
  diff_days <- as.numeric(event_date - reference_date)
  
  # Apply "No Day 0" rule
  study_day <- ifelse(diff_days >= 0, diff_days + 1, diff_days)
  
  return(as.integer(study_day))
}

# Test the function
test_ref <- ymd("2024-01-01")
test_dates <- c("2023-12-30", "2023-12-31", "2024-01-01", "2024-01-02", "2024-01-05")

data.frame(
  Event_Date = test_dates,
  Reference = "2024-01-01",
  Study_Day = derive_study_day(ymd(test_dates), test_ref)
)

  Event_Date  Reference Study_Day
1 2023-12-30 2024-01-01        -2
2 2023-12-31 2024-01-01        -1
3 2024-01-01 2024-01-01         1
4 2024-01-02 2024-01-01         2
5 2024-01-05 2024-01-01         5

Notice the “No Day 0”

Look at the output above: Day -2, Day -1, Day 1, Day 2, Day 5. There is no Day 0!

5.3 Applying to Real Data: Derive `AESTDY`

# Merge RFSTDTC from DM into AE
ae_with_ref <- ae %>%
  select(USUBJID, AESEQ, AESTDTC, AEENDTC) %>%
  inner_join(
    dm %>% select(USUBJID, RFSTDTC),
    by = "USUBJID"
  ) %>%
  # Filter to complete dates only (length >= 10)
  filter(nchar(AESTDTC) >= 10 & nchar(RFSTDTC) >= 10)

# Apply our function
ae_study_days <- ae_with_ref %>%
  mutate(
    AE_DT = ymd(substring(AESTDTC, 1, 10)),
    REF_DT = ymd(substring(RFSTDTC, 1, 10)),
    AESTDY = derive_study_day(AE_DT, REF_DT)
  )

# Preview
ae_study_days %>%
  select(USUBJID, AESTDTC, RFSTDTC, AESTDY) %>%
  head(10)

# A tibble: 10 × 4
   USUBJID     AESTDTC    RFSTDTC    AESTDY
   <chr>       <chr>      <chr>       <int>
 1 01-701-1015 2014-01-03 2014-01-02      2
 2 01-701-1015 2014-01-03 2014-01-02      2
 3 01-701-1015 2014-01-09 2014-01-02      8
 4 01-701-1023 2012-08-26 2012-08-05     22
 5 01-701-1023 2012-08-07 2012-08-05      3
 6 01-701-1023 2012-08-07 2012-08-05      3
 7 01-701-1023 2012-08-07 2012-08-05      3
 8 01-701-1028 2013-07-21 2013-07-19      3
 9 01-701-1028 2013-08-08 2013-07-19     21
10 01-701-1034 2014-08-27 2014-07-01     58

6 Part 3: Calculating Duration (`--DUR`)

Duration is typically: \(EndDate - StartDate + 1\) (inclusive of both days).

6.1 Duration in Days

# Filter for records with both start and end dates
ae_dur <- ae %>%
  select(USUBJID, AESEQ, AESTDTC, AEENDTC) %>%
  filter(nchar(AESTDTC) >= 10 & nchar(AEENDTC) >= 10) %>%
  mutate(
    START_DT = ymd(substring(AESTDTC, 1, 10)),
    END_DT = ymd(substring(AEENDTC, 1, 10)),
    
    # Duration in days (inclusive)
    AEDUR = as.integer(END_DT - START_DT) + 1L
  )

ae_dur %>%
  select(USUBJID, AESTDTC, AEENDTC, AEDUR) %>%
  head(10)

# A tibble: 10 × 4
   USUBJID     AESTDTC    AEENDTC    AEDUR
   <chr>       <chr>      <chr>      <int>
 1 01-701-1015 2014-01-09 2014-01-11     3
 2 01-701-1023 2012-08-07 2012-08-30    24
 3 01-701-1023 2012-08-07 2012-08-30    24
 4 01-701-1047 2013-02-12 2013-02-12     1
 5 01-701-1047 2013-02-12 2013-02-12     1
 6 01-701-1097 2014-02-20 2014-02-22     3
 7 01-701-1097 2014-04-19 2014-04-22     4
 8 01-701-1097 2014-04-19 2014-04-22     4
 9 01-701-1097 2014-02-20 2014-02-20     1
10 01-701-1097 2014-03-21 2014-03-21     1

6.2 Duration Statistics

ae_dur %>%
  summarise(
    N = n(),
    Mean_Duration = round(mean(AEDUR, na.rm = TRUE), 1),
    Median_Duration = median(AEDUR, na.rm = TRUE),
    Min = min(AEDUR, na.rm = TRUE),
    Max = max(AEDUR, na.rm = TRUE)
  )

# A tibble: 1 × 5
      N Mean_Duration Median_Duration   Min   Max
  <int>         <dbl>           <dbl> <int> <int>
1   714          23.8              11     1   444

7 Part 4: Handling Partial Dates

In clinical data, dates are often partial (e.g., “2023-06” or just “2023”). We cannot calculate durations or study days without imputing these to full dates.

7.1 Common Imputation Strategies

Date Type	Missing Parts	Start Date Imputation	End Date Imputation
Year-Month	Day	01 (first)	Last day of month
Year only	Month + Day	01-01	12-31

7.2 Reusable Imputation Functions

#' Impute partial START dates (conservative: use earliest possible date)
#' 
#' @param date_str Character date string in ISO 8601 format
#' @return Character string with complete date
impute_start_date <- function(date_str) {
  dplyr::case_when(
    is.na(date_str) | date_str == "" ~ NA_character_,
    # Only Year (YYYY) -> YYYY-01-01
    nchar(date_str) == 4 ~ paste0(date_str, "-01-01"),
    # Year-Month (YYYY-MM) -> YYYY-MM-01
    nchar(date_str) == 7 ~ paste0(date_str, "-01"),
    # Full date -> Keep first 10 characters
    TRUE ~ substring(date_str, 1, 10)
  )
}

#' Impute partial END dates (conservative: use latest possible date)
#' 
#' @param date_str Character date string in ISO 8601 format
#' @return Character string with complete date
impute_end_date <- function(date_str) {
  dplyr::case_when(
    is.na(date_str) | date_str == "" ~ NA_character_,
    # Only Year (YYYY) -> YYYY-12-31
    nchar(date_str) == 4 ~ paste0(date_str, "-12-31"),
    # Year-Month (YYYY-MM) -> Last day of that month
    nchar(date_str) == 7 ~ {
      # Parse as first of month, then get last day
      first_of_month <- ymd(paste0(date_str, "-01"))
      last_of_month <- ceiling_date(first_of_month, "month") - days(1)
      as.character(last_of_month)
    },
    # Full date -> Keep first 10 characters
    TRUE ~ substring(date_str, 1, 10)
  )
}

7.3 Testing the Imputation Functions

# Test cases with various partial dates
test_dates <- c("2023-05-12", "2023-05", "2023", "2023-02", NA)

data.frame(
  Original = test_dates,
  Imputed_Start = impute_start_date(test_dates),
  Imputed_End = impute_end_date(test_dates)
)

    Original Imputed_Start Imputed_End
1 2023-05-12    2023-05-12  2023-05-12
2    2023-05    2023-05-01  2023-05-31
3       2023    2023-01-01  2023-12-31
4    2023-02    2023-02-01  2023-02-28
5       <NA>          <NA>        <NA>

Why Different Imputation Rules?

For start dates, we impute to the earliest possible date to be conservative about when something started.
For end dates, we impute to the latest possible date to be conservative about when something ended.
This approach ensures we don’t underestimate durations or miss events.

8 Part 5: Complete Exercise - Derive AESTDY and AEENDY

Let’s bring everything together with a complete derivation.

# Full AE date derivation workflow
ae_dates_complete <- ae %>%
  select(USUBJID, AESEQ, AETERM, AESTDTC, AEENDTC) %>%
  # Join reference date
  inner_join(
    dm %>% select(USUBJID, RFSTDTC),
    by = "USUBJID"
  ) %>%
  mutate(
    # Impute partial dates
    AESTDTC_IMP = impute_start_date(AESTDTC),
    AEENDTC_IMP = impute_end_date(AEENDTC),
    RFSTDTC_IMP = impute_start_date(RFSTDTC),
    
    # Create imputation flags (for traceability)
    AESTDTF = ifelse(nchar(AESTDTC) < 10, "D", NA_character_),
    AEENDTF = ifelse(nchar(AEENDTC) < 10, "D", NA_character_),
    
    # Parse dates
    ASTDT = ymd(AESTDTC_IMP),
    AENDT = ymd(AEENDTC_IMP),
    RFSTDT = ymd(RFSTDTC_IMP),
    
    # Derive study days
    AESTDY = derive_study_day(ASTDT, RFSTDT),
    AEENDY = derive_study_day(AENDT, RFSTDT),
    
    # Derive duration
    AEDUR = ifelse(!is.na(ASTDT) & !is.na(AENDT),
                   as.integer(AENDT - ASTDT) + 1L,
                   NA_integer_)
  )

# Preview key results
ae_dates_complete %>%
  select(USUBJID, AETERM, AESTDY, AEENDY, AEDUR, AESTDTF) %>%
  head(15)

# A tibble: 15 × 6
   USUBJID     AETERM                               AESTDY AEENDY AEDUR AESTDTF
   <chr>       <chr>                                 <int>  <int> <int> <chr>  
 1 01-701-1015 APPLICATION SITE ERYTHEMA                 2     NA    NA <NA>   
 2 01-701-1015 APPLICATION SITE PRURITUS                 2     NA    NA <NA>   
 3 01-701-1015 DIARRHOEA                                 8     10     3 <NA>   
 4 01-701-1023 ATRIOVENTRICULAR BLOCK SECOND DEGREE     22     NA    NA <NA>   
 5 01-701-1023 ERYTHEMA                                  3     26    24 <NA>   
 6 01-701-1023 ERYTHEMA                                  3     NA    NA <NA>   
 7 01-701-1023 ERYTHEMA                                  3     26    24 <NA>   
 8 01-701-1028 APPLICATION SITE ERYTHEMA                 3     NA    NA <NA>   
 9 01-701-1028 APPLICATION SITE PRURITUS                21     NA    NA <NA>   
10 01-701-1034 APPLICATION SITE PRURITUS                58     NA    NA <NA>   
11 01-701-1034 FATIGUE                                 125     NA    NA <NA>   
12 01-701-1047 BUNDLE BRANCH BLOCK LEFT                 27     NA    NA <NA>   
13 01-701-1047 HIATUS HERNIA                             1      1     1 <NA>   
14 01-701-1047 HIATUS HERNIA                             1      1     1 <NA>   
15 01-701-1047 UPPER RESPIRATORY TRACT INFECTION        23     NA    NA <NA>

8.1 Summary Statistics

ae_dates_complete %>%
  summarise(
    Total_Records = n(),
    Records_with_AESTDY = sum(!is.na(AESTDY)),
    Records_with_AEENDY = sum(!is.na(AEENDY)),
    Imputed_Start = sum(!is.na(AESTDTF)),
    Mean_Duration = round(mean(AEDUR, na.rm = TRUE), 1)
  )

# A tibble: 1 × 5
  Total_Records Records_with_AESTDY Records_with_AEENDY Imputed_Start
          <int>               <int>               <int>         <int>
1          1191                1191                 718            26
# ℹ 1 more variable: Mean_Duration <dbl>

9 🎯 Practice Exercise

9.1 Your Turn: Apply to Exposure (EX) Domain

Using the functions we created, derive study days for the Exposure domain.

data("ex", package = "pharmaversesdtm")

# TODO: Complete the following
ex_dates <- ex %>%
  select(USUBJID, EXSEQ, EXSTDTC, EXENDTC) %>%
  inner_join(
    dm %>% select(USUBJID, RFSTDTC),
    by = "USUBJID"
  ) %>%
  mutate(
    # Step 1: Impute dates
    # EXSTDTC_IMP = ???
    # EXENDTC_IMP = ???
    
    # Step 2: Parse dates
    # EXSTDT = ???
    # EXENDT = ???
    
    # Step 3: Derive study days
    # EXSTDY = ???
    # EXENDY = ???
  )

# Check your results
head(ex_dates)

10 Deliverable Summary

Today you completed the following:

Task	Status
Parsed ISO 8601 dates with `lubridate`	✓ Done
Created `derive_study_day()` function	✓ Done
Created `impute_start_date()` function	✓ Done
Created `impute_end_date()` function	✓ Done
Derived AESTDY, AEENDY, AEDUR from AE	✓ Done

11 Key Takeaways

ISO 8601 (YYYY-MM-DD) is the standard format.
No Day 0: Day -1 is immediately followed by Day 1.
Reusable Functions: Create functions for derivations you’ll use repeatedly.
Imputation Flags: Use --DTF variables to indicate imputed dates.
Documentation: Comment your logic for traceability.

12 Resources

13 What’s Next?

In Day 5, we will move to Advanced Tidyverse operations:

Reshaping data with pivot_wider() and pivot_longer()
Creating analysis-ready datasets
Performing complex joins for QC

1 Learning Objectives

2 Introduction

2.1 Dates in CDISC (ISO 8601)

2.1.1 Why does this matter?

2.2 The “No Day 0” Rule

2.2.1 Why is this important?

2.3 The Role of lubridate

3 Package Installation & Loading

3.1 Required Packages

3.2 Load Packages

4 Part 1: Basic Date Parsing

4.1 Parsing Complete Dates

4.2 Parsing Dates with Times

4.3 Extracting Date Components

5 Part 2: Calculating Study Day (--DY)

5.1 The Formula

5.2 Creating a Reusable Function

5.3 Applying to Real Data: Derive AESTDY

6 Part 3: Calculating Duration (--DUR)

6.1 Duration in Days

6.2 Duration Statistics

7 Part 4: Handling Partial Dates

7.1 Common Imputation Strategies

7.2 Reusable Imputation Functions

7.3 Testing the Imputation Functions

8 Part 5: Complete Exercise - Derive AESTDY and AEENDY

8.1 Summary Statistics

9 🎯 Practice Exercise

9.1 Your Turn: Apply to Exposure (EX) Domain

10 Deliverable Summary

11 Key Takeaways

12 Resources

13 What’s Next?

2.3 The Role of `lubridate`

5 Part 2: Calculating Study Day (`--DY`)

5.3 Applying to Real Data: Derive `AESTDY`

6 Part 3: Calculating Duration (`--DUR`)