30 Days of Pharmaverse
  • Week 1: SDTM Fundamentals
  • Week 2: Production SDTM
  • Week 3: ADaM Deep Dive
  • Week 4: Tables, Listings and Figures
  1. Day 12: Data Cuts with datacutr
  • Day 8: Complex SDTM Domains - LB (Lab Results)
  • Day 9: VS (Vital Signs) & Repeated Measures
  • Day 10: AE Domain Mastery & SAE Logic
  • Day 11: Disposition (DS) & Trial Design Domains
  • Day 12: Data Cuts with datacutr
  • Day 13: SDTM Validation with sdtmchecks
  • Day 14: Week 2 Capstone - Metadata-Driven SDTM with metacore & xportr

On this page

  • 1 Learning Objectives
  • 2 What is a Data Cut?
    • 2.1 The Problem
    • 2.2 Types of Data Cuts
  • 3 Package Installation & Loading
  • 4 Understanding datacutr
    • 4.1 What is datacutr?
    • 4.2 How datacutr Works
  • 5 Setting Up Sample Data for Data Cuts
  • 6 Applying Data Cuts Manually (Understanding the Logic)
    • 6.1 Patient-Level Cut
    • 6.2 Record-Level Cut
    • 6.3 Handling Ongoing AEs at Cutoff
  • 7 Using datacutr
    • 7.1 Step 1: Create the Datacut Dataset
    • 7.2 Step 2: Define the Cut Strategy
    • 7.3 Step 3: Apply the Cut Using datacutr Functions
    • 7.4 Step 4: Summary of Data Cut
  • 8 Real-World Data Cut Considerations
    • 8.1 Partial Dates
    • 8.2 Multiple Cutoff Dates
  • 9 Best Practices for Data Cuts
    • 9.1 DO
    • 9.2 The Data Cut in the Submission Package
  • 10 Deliverable Summary
  • 11 Key Takeaways
  • 12 Resources
  • 13 What’s Next?

Day 12: Data Cuts with datacutr

Applying Clinical Cutoff Dates for Interim & Final Analyses

← Back to Roadmap

1 Learning Objectives

By the end of Day 12, you will be able to:

  1. Explain what a clinical data cut is and why it’s essential for submissions
  2. Understand the difference between patient-level and record-level data cuts
  3. Use the datacutr package to apply cutoff dates across multiple SDTM domains
  4. Handle edge cases: ongoing AEs, partial dates, and records at the cutoff boundary
  5. Build a reproducible data cut pipeline that works across DM, AE, LB, and VS domains

2 What is a Data Cut?

2.1 The Problem

Clinical trials collect data continuously over months or years. But at some point, you need to freeze the data for analysis - this is the data cutoff.

Timeline of a clinical trial:
────────────────────────────────────────────────────────────────────────
  First Subject In          Interim Analysis         Final Analysis
       │                         │                        │
       ▼                         ▼                        ▼
  ────────────────────────────────────────────────────────────────────
  ██████████████████████████│ DATA CUT │████████████████│ DATA CUT │
                             (Cutoff 1)                  (Cutoff 2)
                                 │                           │
                                 ▼                           ▼
                        Only data BEFORE           Only data BEFORE
                        this date included         this date included
ImportantWhy Data Cuts Matter
  • Regulatory submissions require data frozen at a specific cutoff date
  • Interim analyses (e.g., for a Data Safety Monitoring Board) need a clean snapshot
  • Data integrity - analyses must be reproducible based on the same data
  • Blinding - in ongoing studies, only data up to the cutoff should be analyzed

2.2 Types of Data Cuts

Cut Type What It Means Example
Patient-level cut Exclude entire subjects enrolled after cutoff Subject consented on 2024-07-01, cutoff is 2024-06-15 → exclude entire subject
Record-level cut Keep the subject but remove records after cutoff Subject enrolled 2024-01-01, AE started 2024-07-01, cutoff 2024-06-15 → keep subject, remove this AE
Hybrid Patient-level for some domains, record-level for others Common in practice

3 Package Installation & Loading

if (!requireNamespace("dplyr", quietly = TRUE)) suppressMessages(install.packages("dplyr"))
if (!requireNamespace("lubridate", quietly = TRUE)) suppressMessages(install.packages("lubridate"))
if (!requireNamespace("tidyr", quietly = TRUE)) suppressMessages(install.packages("tidyr"))
if (!requireNamespace("stringr", quietly = TRUE)) suppressMessages(install.packages("stringr"))

# datacutr is the Pharmaverse package for data cuts
if (!requireNamespace("datacutr", quietly = TRUE)) suppressMessages(install.packages("datacutr"))

library(dplyr)
library(lubridate)
library(tidyr)
library(stringr)
library(datacutr)

4 Understanding datacutr

4.1 What is datacutr?

datacutr is a Pharmaverse package designed to apply clinical data cutoff dates to SDTM datasets in a standardized, reproducible way. It handles the complex logic of:

  • Identifying which date variable to use for each domain
  • Applying patient-level vs. record-level cuts
  • Handling special cases (ongoing events, missing dates)

4.2 How datacutr Works

The package uses a datacut metadata approach:

  1. You define a patient-level cutoff date for each subject
  2. You specify which date variable to cut on for each domain
  3. datacutr applies the cut rules and returns clean datasets
# What functions does datacutr provide?
cat("Key datacutr functions:\n\n")
Key datacutr functions:
cat("1. create_dcut()        - Create the patient-level datacut dataset\n")
1. create_dcut()        - Create the patient-level datacut dataset
cat("2. date_cut()           - Flag records for removal (adds DCUT_TEMP_REMOVE)\n")
2. date_cut()           - Flag records for removal (adds DCUT_TEMP_REMOVE)
cat("3. apply_cut()          - Remove flagged records\n")
3. apply_cut()          - Remove flagged records
cat("4. special_dm_cut()     - Special handling for DM domain\n")
4. special_dm_cut()     - Special handling for DM domain
cat("5. impute_sdtm()        - Impute partial dates before cutting\n")
5. impute_sdtm()        - Impute partial dates before cutting
cat("6. impute_dcutdtc()     - Impute partial datacut dates\n")
6. impute_dcutdtc()     - Impute partial datacut dates
cat("7. pt_cut()             - Patient-level cut\n")
7. pt_cut()             - Patient-level cut
cat("8. process_cut()        - Wrapper function for entire workflow\n")
8. process_cut()        - Wrapper function for entire workflow

5 Setting Up Sample Data for Data Cuts

Let’s create a realistic multi-domain SDTM dataset to practice data cuts on:

set.seed(42)

# ---- Study Parameters ----
STUDYID <- "CDISC01"
cutoff_date <- as.Date("2024-06-15")

n_subjects <- 10
subjects <- paste0("CDISC01-001-", sprintf("%03d", 1:n_subjects))

# ---- DM Domain ----
dm_sample <- tibble(
  STUDYID  = STUDYID,
  DOMAIN   = "DM",
  USUBJID  = subjects,
  RFSTDTC  = as.character(as.Date("2024-01-01") + sample(0:120, n_subjects, replace = TRUE)),
  RFENDTC  = NA_character_,
  DTHDTC   = NA_character_,  # Required by special_dm_cut()
  ARM      = sample(c("Placebo", "Active 10mg", "Active 20mg"), n_subjects, replace = TRUE),
  SEX      = sample(c("M", "F"), n_subjects, replace = TRUE),
  AGE      = sample(30:70, n_subjects, replace = TRUE),
  AGEU     = "YEARS"
) %>%
  mutate(
    # Some subjects started AFTER the cutoff (to test patient-level cuts)
    RFSTDTC = case_when(
      row_number() == 9 ~ as.character(cutoff_date + 5),
      row_number() == 10 ~ as.character(cutoff_date + 15),
      TRUE ~ RFSTDTC
    )
  )

cat("DM domain:", nrow(dm_sample), "subjects\n")
DM domain: 10 subjects
cat("Cutoff date:", as.character(cutoff_date), "\n\n")
Cutoff date: 2024-06-15 
# Which subjects started after cutoff?
dm_sample %>%
  mutate(AFTER_CUTOFF = ymd(RFSTDTC) > cutoff_date) %>%
  select(USUBJID, RFSTDTC, ARM, AFTER_CUTOFF)
# A tibble: 10 × 4
   USUBJID         RFSTDTC    ARM         AFTER_CUTOFF
   <chr>           <chr>      <chr>       <lgl>       
 1 CDISC01-001-001 2024-02-18 Active 20mg FALSE       
 2 CDISC01-001-002 2024-04-10 Placebo     FALSE       
 3 CDISC01-001-003 2024-03-05 Placebo     FALSE       
 4 CDISC01-001-004 2024-01-25 Active 10mg FALSE       
 5 CDISC01-001-005 2024-03-14 Active 10mg FALSE       
 6 CDISC01-001-006 2024-04-09 Active 10mg FALSE       
 7 CDISC01-001-007 2024-01-18 Active 20mg FALSE       
 8 CDISC01-001-008 2024-02-18 Active 20mg FALSE       
 9 CDISC01-001-009 2024-06-20 Placebo     TRUE        
10 CDISC01-001-010 2024-06-30 Placebo     TRUE        
# ---- AE Domain ----
ae_sample <- tibble(
  STUDYID = STUDYID,
  DOMAIN  = "AE",
  USUBJID = sample(subjects[1:8], 20, replace = TRUE),
  AETERM  = sample(c("Headache", "Nausea", "Fatigue", "Dizziness", "Rash"), 20, replace = TRUE),
  AESEV   = sample(c("MILD", "MODERATE", "SEVERE"), 20, replace = TRUE, prob = c(0.6, 0.3, 0.1)),
  AESER   = sample(c("Y", "N"), 20, replace = TRUE, prob = c(0.1, 0.9))
) %>%
  mutate(AEDECOD = AETERM)

# Add dates - some AEs span across or start after cutoff
ae_sample <- ae_sample %>%
  left_join(dm_sample %>% select(USUBJID, RFSTDTC), by = "USUBJID") %>%
  rowwise() %>%
  mutate(
    start_offset = sample(1:180, 1),
    ae_start = ymd(RFSTDTC) + start_offset,
    AESTDTC = as.character(ae_start),
    # Some AEs are ongoing (no end date), some end after cutoff
    ae_end_offset = sample(c(NA, 3, 7, 14, 30, 60, 90), 1),
    AEENDTC = if_else(!is.na(ae_end_offset),
                      as.character(ae_start + ae_end_offset),
                      NA_character_)
  ) %>%
  ungroup() %>%
  group_by(USUBJID) %>%
  mutate(AESEQ = row_number()) %>%
  ungroup() %>%
  select(STUDYID, DOMAIN, USUBJID, AESEQ, AETERM, AEDECOD, AESEV, AESER,
         AESTDTC, AEENDTC)

cat("AE domain:", nrow(ae_sample), "records\n\n")
AE domain: 20 records
# Show which AEs cross the cutoff boundary
ae_sample %>%
  mutate(
    START_VS_CUT = case_when(
      ymd(AESTDTC) > cutoff_date ~ "AFTER cutoff",
      ymd(AESTDTC) <= cutoff_date ~ "BEFORE cutoff",
      TRUE ~ "UNKNOWN"
    ),
    END_VS_CUT = case_when(
      is.na(AEENDTC) ~ "ONGOING",
      ymd(AEENDTC) > cutoff_date ~ "AFTER cutoff",
      ymd(AEENDTC) <= cutoff_date ~ "BEFORE cutoff",
      TRUE ~ "UNKNOWN"
    )
  ) %>%
  count(START_VS_CUT, END_VS_CUT, name = "Count")
# A tibble: 5 × 3
  START_VS_CUT  END_VS_CUT    Count
  <chr>         <chr>         <int>
1 AFTER cutoff  AFTER cutoff      9
2 AFTER cutoff  ONGOING           1
3 BEFORE cutoff AFTER cutoff      4
4 BEFORE cutoff BEFORE cutoff     5
5 BEFORE cutoff ONGOING           1
# ---- LB Domain (simplified) ----
visits <- tibble(
  VISITNUM = c(1, 2, 3, 4, 5),
  VISIT    = c("BASELINE", "WEEK 4", "WEEK 8", "WEEK 12", "WEEK 16")
)

lb_sample <- expand_grid(
  USUBJID = subjects[1:8],
  tibble(
    LBTESTCD = c("ALT", "CREAT", "GLUC"),
    LBTEST   = c("Alanine Aminotransferase", "Creatinine", "Glucose")
  ),
  visits
) %>%
  left_join(dm_sample %>% select(USUBJID, RFSTDTC), by = "USUBJID") %>%
  mutate(
    STUDYID  = STUDYID,
    DOMAIN   = "LB",
    visit_offset = case_when(
      VISITNUM == 1 ~ 0,
      VISITNUM == 2 ~ 28,
      VISITNUM == 3 ~ 56,
      VISITNUM == 4 ~ 84,
      VISITNUM == 5 ~ 112
    ),
    LBDTC = as.character(ymd(RFSTDTC) + visit_offset),
    LBSTRESN = round(rnorm(n(), 50, 15), 1),
    LBSTRESU = "U/L"
  ) %>%
  group_by(USUBJID) %>%
  mutate(LBSEQ = row_number()) %>%
  ungroup() %>%
  select(STUDYID, DOMAIN, USUBJID, LBSEQ, LBTESTCD, LBTEST,
         LBSTRESN, LBSTRESU, VISITNUM, VISIT, LBDTC)

cat("LB domain:", nrow(lb_sample), "records\n")
LB domain: 120 records
# How many LB records are after cutoff?
cat("LB records after cutoff:", 
    sum(ymd(lb_sample$LBDTC) > cutoff_date, na.rm = TRUE), "\n")
LB records after cutoff: 18 

6 Applying Data Cuts Manually (Understanding the Logic)

Before using datacutr, let’s understand the logic by implementing cuts manually:

6.1 Patient-Level Cut

# ---- Patient-Level Cut ----
# Remove subjects whose first dose date is AFTER the cutoff

# Step 1: Identify subjects to keep
subjects_to_keep <- dm_sample %>%
  filter(ymd(RFSTDTC) <= cutoff_date) %>%
  pull(USUBJID)

cat("Patient-level cut:\n")
Patient-level cut:
cat("  Subjects before cut:", n_distinct(dm_sample$USUBJID), "\n")
  Subjects before cut: 10 
cat("  Subjects after cut:", length(subjects_to_keep), "\n")
  Subjects after cut: 8 
cat("  Subjects removed:", n_distinct(dm_sample$USUBJID) - length(subjects_to_keep), "\n\n")
  Subjects removed: 2 
# Step 2: Apply to all domains
dm_cut <- dm_sample %>% filter(USUBJID %in% subjects_to_keep)
ae_patient_cut <- ae_sample %>% filter(USUBJID %in% subjects_to_keep)
lb_patient_cut <- lb_sample %>% filter(USUBJID %in% subjects_to_keep)

cat("Records after patient-level cut:\n")
Records after patient-level cut:
cat("  DM:", nrow(dm_cut), "\n")
  DM: 8 
cat("  AE:", nrow(ae_patient_cut), "(from", nrow(ae_sample), ")\n")
  AE: 20 (from 20 )
cat("  LB:", nrow(lb_patient_cut), "(from", nrow(lb_sample), ")\n")
  LB: 120 (from 120 )

6.2 Record-Level Cut

# ---- Record-Level Cut ----
# Keep subjects, but remove individual records after cutoff

# AE: Remove AEs that STARTED after cutoff
ae_record_cut <- ae_patient_cut %>%
  filter(ymd(AESTDTC) <= cutoff_date)

cat("AE Record-level cut:\n")
AE Record-level cut:
cat("  Records before:", nrow(ae_patient_cut), "\n")
  Records before: 20 
cat("  Records after:", nrow(ae_record_cut), "\n")
  Records after: 10 
cat("  Records removed:", nrow(ae_patient_cut) - nrow(ae_record_cut), "\n\n")
  Records removed: 10 
# LB: Remove lab results collected after cutoff
lb_record_cut <- lb_patient_cut %>%
  filter(ymd(LBDTC) <= cutoff_date)

cat("LB Record-level cut:\n")
LB Record-level cut:
cat("  Records before:", nrow(lb_patient_cut), "\n")
  Records before: 120 
cat("  Records after:", nrow(lb_record_cut), "\n")
  Records after: 102 
cat("  Records removed:", nrow(lb_patient_cut) - nrow(lb_record_cut), "\n")
  Records removed: 18 
WarningWhat About Ongoing AEs at Cutoff?

An AE that started before the cutoff but has no end date (ongoing) needs special handling:

  • Keep the AE record (it started before cutoff)
  • Set AEENDTC to the cutoff date (or leave it blank depending on convention)
  • Flag it as “ongoing at data cutoff”

This is where datacutr really helps - it handles these edge cases automatically.

6.3 Handling Ongoing AEs at Cutoff

# Handle AEs that span the cutoff boundary
ae_final_cut <- ae_patient_cut %>%
  mutate(
    ae_start = ymd(AESTDTC),
    ae_end   = ymd(AEENDTC)
  ) %>%
  # Keep AEs that started on or before cutoff
  filter(ae_start <= cutoff_date) %>%
  mutate(
    # If AE ends AFTER cutoff, truncate the end date to cutoff
    AEENDTC_ORIG = AEENDTC,
    AEENDTC = case_when(
      # AE ends after cutoff → truncate
      !is.na(ae_end) & ae_end > cutoff_date ~ as.character(cutoff_date),
      # AE is ongoing → leave as is (still ongoing at cutoff)
      is.na(ae_end) ~ NA_character_,
      # AE ended before cutoff → keep original
      TRUE ~ AEENDTC
    ),
    # Flag for records modified by the cut
    DCUT_TEMP_REMOVE = "N",
    DCUT_TEMP_DTHCHANGE = if_else(
      !is.na(AEENDTC_ORIG) & AEENDTC != AEENDTC_ORIG,
      "Y", "N"
    )
  ) %>%
  select(-ae_start, -ae_end, -AEENDTC_ORIG)

cat("AE handling at cutoff boundary:\n")
AE handling at cutoff boundary:
cat("  Records kept:", nrow(ae_final_cut), "\n")
  Records kept: 10 
cat("  Records with modified end dates:", 
    sum(ae_final_cut$DCUT_TEMP_DTHCHANGE == "Y", na.rm = TRUE), "\n")
  Records with modified end dates: 4 

7 Using datacutr

Now let’s use datacutr to do this properly:

7.1 Step 1: Create the Datacut Dataset

# The datacut dataset defines per-patient cutoff dates
# In simple cases, all subjects have the same cutoff date
# In complex cases (e.g., rolling enrollment), dates may differ

# For special_dm_cut, we need RFSTDTC and DTHDTC in the dcut dataset
dcut <- dm_sample %>%
  select(USUBJID, RFSTDTC, DTHDTC) %>%
  mutate(
    DCUTDTC = as.character(cutoff_date),  # Same cutoff for all
    DCUTDTM = as.POSIXct(paste0(cutoff_date, " 23:59:59"))
  )

cat("Datacut metadata:\n")
Datacut metadata:
print(dcut)
# A tibble: 10 × 5
   USUBJID         RFSTDTC    DTHDTC DCUTDTC    DCUTDTM            
   <chr>           <chr>      <chr>  <chr>      <dttm>             
 1 CDISC01-001-001 2024-02-18 <NA>   2024-06-15 2024-06-15 23:59:59
 2 CDISC01-001-002 2024-04-10 <NA>   2024-06-15 2024-06-15 23:59:59
 3 CDISC01-001-003 2024-03-05 <NA>   2024-06-15 2024-06-15 23:59:59
 4 CDISC01-001-004 2024-01-25 <NA>   2024-06-15 2024-06-15 23:59:59
 5 CDISC01-001-005 2024-03-14 <NA>   2024-06-15 2024-06-15 23:59:59
 6 CDISC01-001-006 2024-04-09 <NA>   2024-06-15 2024-06-15 23:59:59
 7 CDISC01-001-007 2024-01-18 <NA>   2024-06-15 2024-06-15 23:59:59
 8 CDISC01-001-008 2024-02-18 <NA>   2024-06-15 2024-06-15 23:59:59
 9 CDISC01-001-009 2024-06-20 <NA>   2024-06-15 2024-06-15 23:59:59
10 CDISC01-001-010 2024-06-30 <NA>   2024-06-15 2024-06-15 23:59:59

7.2 Step 2: Define the Cut Strategy

# datacutr uses a two-step process:
# 1. date_cut() flags records (adds DCUT_TEMP_REMOVE)
# 2. apply_cut() removes flagged records

# For DM: special_dm_cut() handles patient-level cut based on RFSTDTC
cat("Cut strategy:\n")
Cut strategy:
cat("  DM: Patient-level cut on RFSTDTC (first dose date)\n")
  DM: Patient-level cut on RFSTDTC (first dose date)
cat("  AE: Record-level cut on AESTDTC (AE start date)\n")
  AE: Record-level cut on AESTDTC (AE start date)
cat("  LB: Record-level cut on LBDTC (lab collection date)\n\n")
  LB: Record-level cut on LBDTC (lab collection date)

7.3 Step 3: Apply the Cut Using datacutr Functions

# ---- Step 3a: DM Domain ----
# special_dm_cut handles the patient-level cut for DM
# It compares RFSTDTC (from dcut) against DCUTDTC
dm_after_cut <- special_dm_cut(
  dataset_dm = dm_sample,
  dataset_cut = dcut
)
[1] "At least 1 patient with missing datacut date, all records will be kept."
cat("DM after cut:", nrow(dm_after_cut), "subjects (from", nrow(dm_sample), ")\n")
DM after cut: 10 subjects (from 10 )
# ---- Step 3b: AE Domain ----
# date_cut handles record-level cuts and adds DCUT_TEMP_REMOVE flag
# First, we need to know which subjects passed the DM cut
subjects_after_dm_cut <- dm_after_cut$USUBJID

ae_for_cut <- ae_sample %>%
  filter(USUBJID %in% subjects_after_dm_cut)

ae_after_cut_temp <- date_cut(
  dataset_sdtm = ae_for_cut,
  sdtm_date_var = AESTDTC,
  dataset_cut = dcut,
  cut_var = DCUTDTM
)
[1] "At least 1 patient with missing datacut date, all records will be kept."
# Apply the cut using apply_cut which removes flagged records
ae_after_cut <- apply_cut(
  dsin = ae_after_cut_temp,
  dcutvar = DCUT_TEMP_REMOVE,
  dthchangevar = DCUT_TEMP_DTHCHANGE
)

cat("AE after cut:", nrow(ae_after_cut), "records (from", nrow(ae_sample), ")\n")
AE after cut: 10 records (from 20 )
# ---- Step 3c: LB Domain ----
lb_for_cut <- lb_sample %>%
  filter(USUBJID %in% subjects_after_dm_cut)

lb_after_cut_temp <- date_cut(
  dataset_sdtm = lb_for_cut,
  sdtm_date_var = LBDTC,
  dataset_cut = dcut,
  cut_var = DCUTDTM
)
[1] "At least 1 patient with missing datacut date, all records will be kept."
# Apply the cut using apply_cut which removes flagged records
lb_after_cut <- apply_cut(
  dsin = lb_after_cut_temp,
  dcutvar = DCUT_TEMP_REMOVE,
  dthchangevar = DCUT_TEMP_DTHCHANGE
)

cat("LB after cut:", nrow(lb_after_cut), "records (from", nrow(lb_sample), ")\n")
LB after cut: 102 records (from 120 )

7.4 Step 4: Summary of Data Cut

cat("=== DATA CUT SUMMARY ===\n")
=== DATA CUT SUMMARY ===
cat("Cutoff date:", as.character(cutoff_date), "\n\n")
Cutoff date: 2024-06-15 
summary_table <- tibble::tribble(
  ~Domain, ~Before_Cut, ~After_Cut, ~Removed,
  "DM", nrow(dm_sample), nrow(dm_after_cut), nrow(dm_sample) - nrow(dm_after_cut),
  "AE", nrow(ae_sample), nrow(ae_after_cut), nrow(ae_sample) - nrow(ae_after_cut),
  "LB", nrow(lb_sample), nrow(lb_after_cut), nrow(lb_sample) - nrow(lb_after_cut)
)

print(summary_table)
# A tibble: 3 × 4
  Domain Before_Cut After_Cut Removed
  <chr>       <int>     <int>   <int>
1 DM             10        10       0
2 AE             20        10      10
3 LB            120       102      18

8 Real-World Data Cut Considerations

8.1 Partial Dates

# In real data, you'll encounter partial dates
cat("Common partial date scenarios:\n\n")
Common partial date scenarios:
partial_examples <- tibble::tribble(
  ~AESTDTC,       ~Problem,                          ~Solution,
  "2024-06",       "Day missing",                    "Impute to 1st or last of month",
  "2024",          "Month and day missing",          "Impute based on SAP rules",
  "",              "Completely missing",             "Flag for medical review",
  "2024-06-15",    "Complete date",                  "No imputation needed",
  "2024-06-UN",    "Day unknown (UN notation)",      "Impute per convention"
)

print(partial_examples)
# A tibble: 5 × 3
  AESTDTC      Problem                   Solution                      
  <chr>        <chr>                     <chr>                         
1 "2024-06"    Day missing               Impute to 1st or last of month
2 "2024"       Month and day missing     Impute based on SAP rules     
3 ""           Completely missing        Flag for medical review       
4 "2024-06-15" Complete date             No imputation needed          
5 "2024-06-UN" Day unknown (UN notation) Impute per convention         
Tipdatacutr and Partial Dates

The datacutr::impute_sdtm() function can handle partial date imputation before applying the cut. Common imputation strategies:

  • Conservative (for safety): Impute missing day to the 1st of the month - this maximizes the chance of including the record (more conservative for safety data)
  • Conservative (for efficacy): Impute missing day to the last of the month - interpretation depends on the analysis
  • SAP-driven: Always follow the Statistical Analysis Plan’s imputation rules

8.2 Multiple Cutoff Dates

# In some studies, cutoff dates differ by subject (rolling enrollment)
# or by analysis (interim vs final)

cat("Example: Different cutoffs per analysis:\n\n")
Example: Different cutoffs per analysis:
analysis_cuts <- tibble::tribble(
  ~Analysis,          ~Cutoff_Date,     ~Purpose,
  "DSMB Review 1",    "2024-03-31",     "Safety review at 50% enrollment",
  "Interim Analysis",  "2024-06-15",     "Pre-specified interim for efficacy",
  "Final Analysis",    "2024-12-31",     "Primary analysis for submission"
)

print(analysis_cuts)
# A tibble: 3 × 3
  Analysis         Cutoff_Date Purpose                           
  <chr>            <chr>       <chr>                             
1 DSMB Review 1    2024-03-31  Safety review at 50% enrollment   
2 Interim Analysis 2024-06-15  Pre-specified interim for efficacy
3 Final Analysis   2024-12-31  Primary analysis for submission   

9 Best Practices for Data Cuts

9.1 DO

cat("=== DATA CUT BEST PRACTICES ===\n\n")
=== DATA CUT BEST PRACTICES ===
best_practices <- tibble::tribble(
  ~Rule,                                              ~Why,
  "Document the cutoff date in TS domain",            "Traceability for regulators",
  "Apply cuts BEFORE creating ADaM datasets",         "ADaM should be based on cut data",
  "Keep both pre-cut and post-cut datasets",          "Audit trail and reproducibility",
  "Handle ongoing events explicitly",                 "AEs without end dates need special logic",
  "Use datacutr for consistency",                     "Avoids manual errors across domains",
  "Log all records removed by the cut",               "Quality control and transparency",
  "Validate record counts before and after",          "Ensure no data is accidentally lost"
)

print(best_practices)
# A tibble: 7 × 2
  Rule                                     Why                                  
  <chr>                                    <chr>                                
1 Document the cutoff date in TS domain    Traceability for regulators          
2 Apply cuts BEFORE creating ADaM datasets ADaM should be based on cut data     
3 Keep both pre-cut and post-cut datasets  Audit trail and reproducibility      
4 Handle ongoing events explicitly         AEs without end dates need special l…
5 Use datacutr for consistency             Avoids manual errors across domains  
6 Log all records removed by the cut       Quality control and transparency     
7 Validate record counts before and after  Ensure no data is accidentally lost  

9.2 The Data Cut in the Submission Package

cat("Where the data cut fits in the submission process:\n\n")
Where the data cut fits in the submission process:
cat("1. Raw data collected continuously\n")
1. Raw data collected continuously
cat("2. DATABASE LOCK (data cleaning complete)\n")
2. DATABASE LOCK (data cleaning complete)
cat("3. DATA CUT applied (datacutr)\n")
3. DATA CUT applied (datacutr)
cat("4. SDTM datasets created from cut data\n")
4. SDTM datasets created from cut data
cat("5. ADaM datasets derived from SDTM\n")
5. ADaM datasets derived from SDTM
cat("6. TFLs (Tables, Figures, Listings) generated from ADaM\n")
6. TFLs (Tables, Figures, Listings) generated from ADaM
cat("7. Clinical Study Report written\n")
7. Clinical Study Report written
cat("8. Submission package assembled (eCTD)\n\n")
8. Submission package assembled (eCTD)
cat("The cutoff date is recorded in:\n")
The cutoff date is recorded in:
cat("  - TS domain (TSPARMCD = 'DCUTDTC')\n")
  - TS domain (TSPARMCD = 'DCUTDTC')
cat("  - Study report\n")
  - Study report
cat("  - define.xml\n")
  - define.xml

10 Deliverable Summary

Today you completed the following:

Task Status
Understood what data cuts are and why they matter ✓ Done
Implemented patient-level and record-level cuts manually ✓ Done
Handled ongoing AEs at the cutoff boundary ✓ Done
Used datacutr functions: special_dm_cut(), date_cut(), apply_cut() ✓ Done
Learned about partial date imputation ✓ Done
Reviewed best practices for data cuts ✓ Done

11 Key Takeaways

  1. Data cuts freeze the data - Only data before the cutoff is included in the analysis
  2. Patient-level cuts exclude entire subjects; record-level cuts exclude individual records
  3. Ongoing events need special handling - AEs that span the cutoff require careful logic
  4. datacutr standardizes the process - Avoids manual errors and ensures reproducibility
  5. Cuts happen BEFORE SDTM/ADaM - The cut data is the foundation for all downstream datasets
  6. Document everything - The cutoff date must be recorded in TS and the study report

12 Resources

  • datacutr Documentation - Official datacutr package documentation
  • datacutr GitHub - Source code and examples
  • CDISC Implementation Guide - Data Cuts - SDTM guidance
  • Pharmaverse.org - R packages for clinical data
  • ICH E9(R1) - Estimands and data handling

13 What’s Next?

In Day 13, we will focus on SDTM Validation with sdtmchecks:

  • Why validation is essential before creating ADaM datasets
  • Running FDA business rules against your SDTM domains
  • Understanding validation reports and resolving findings
  • Common SDTM issues and how to fix them

 

30 Days of Pharmaverse  ·  Disclaimer  ·  Indraneel Chakraborty  ·  © 2026