library(dplyr)
library(pharmaversesdtm)Day 3: Controlled Terminology & MedDRA Coding
Building AE Codelists in R
1 Learning Objectives
By the end of Day 3, you will be able to:
- Understand the importance of Controlled Terminology (CT) in regulatory submissions
- Create lookup tables (codelists) in R for standardized coding
- Perform MedDRA coding operations (mapping Preferred Terms to System Organ Classes)
- Use
left_join()effectively to apply standardized codes to clinical data - Validate that all terms have been correctly coded
2 Introduction
2.1 Why Controlled Terminology Matters
In clinical trials, it is extremely important that everyone uses the same words to describe the same things. This is called Controlled Terminology (CT). Regulatory agencies like the FDA require this because they need to compare data from many different studies and sponsors. If everyone used their own words, it would be impossible to analyze the data together.
2.1.1 Example:
Suppose one doctor writes βMildβ, another writes βModβ, and a third writes β1β to describe the severity of an event. Without CT, these would all be treated as different, even though they mean the same thing. With CDISC CT, everyone must use the same set of allowed values, like βMILDβ, βMODERATEβ, βSEVEREβ.
2.1.2 Why is this important?
Using CT makes your data clear, consistent, and ready for regulatory review. It also helps avoid mistakes and confusion when combining data from different sources.
2.2 The Role of MedDRA
MedDRA (Medical Dictionary for Regulatory Activities) is the international standard dictionary for coding medical terms in clinical trials. It is especially important for coding adverse events (AEs). MedDRA organizes medical terms into a hierarchy, from very broad categories down to very specific terms:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SOC β System Organ Class (e.g., "Gastrointestinal disorders") β
βββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β HLGT β High Level Group Term β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β HLT β High Level Term β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β PT β Preferred Term (e.g., "Nausea") β Used in AEDECOD β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β LLT β Lowest Level Term (e.g., "Feeling queasy") β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
In SDTM:
- AEDECOD = Preferred Term (PT)
- AEBODSYS = System Organ Class (SOC)
3 Package Installation & Loading
3.1 Required Packages
| Package | Purpose |
|---|---|
dplyr |
Data manipulation (joins, mutations) |
pharmaversesdtm |
Source of AE data |
3.2 Load Packages
3.3 Load AE Data
data("ae", package = "pharmaversesdtm")
# Preview the variables involved in coding
ae %>%
select(USUBJID, AETERM, AEDECOD, AEBODSYS) %>%
head()# A tibble: 6 Γ 4
USUBJID AETERM AEDECOD AEBODSYS
<chr> <chr> <chr> <chr>
1 01-701-1015 APPLICATION SITE ERYTHEMA APPLICATION SITE ERβ¦ GENERALβ¦
2 01-701-1015 APPLICATION SITE PRURITUS APPLICATION SITE PRβ¦ GENERALβ¦
3 01-701-1015 DIARRHOEA DIARRHOEA GASTROIβ¦
4 01-701-1023 ATRIOVENTRICULAR BLOCK SECOND DEGREE ATRIOVENTRICULAR BLβ¦ CARDIACβ¦
5 01-701-1023 ERYTHEMA ERYTHEMA SKIN ANβ¦
6 01-701-1023 ERYTHEMA ERYTHEMA SKIN ANβ¦
4 Part 1: Building a Complete MedDRA Lookup Table
In a real environment, you would have access to the full MedDRA dictionary files. For this exercise, we will create a complete lookup table covering all terms in our dataset.
4.1 Step 1: Identify All Unique Preferred Terms
First, letβs see all unique Preferred Terms (AEDECOD) and their corresponding SOCs in our dataset.
# Get all unique PT-SOC combinations from the source data
unique_terms <- ae %>%
distinct(AEDECOD, AEBODSYS) %>%
arrange(AEBODSYS, AEDECOD)
cat("Total unique Preferred Terms:", nrow(unique_terms), "\n\n")Total unique Preferred Terms: 242
print(unique_terms)# A tibble: 242 Γ 2
AEDECOD AEBODSYS
<chr> <chr>
1 ATRIAL FIBRILLATION CARDIAC DISORDERS
2 ATRIAL FLUTTER CARDIAC DISORDERS
3 ATRIAL HYPERTROPHY CARDIAC DISORDERS
4 ATRIOVENTRICULAR BLOCK FIRST DEGREE CARDIAC DISORDERS
5 ATRIOVENTRICULAR BLOCK SECOND DEGREE CARDIAC DISORDERS
6 BRADYCARDIA CARDIAC DISORDERS
7 BUNDLE BRANCH BLOCK LEFT CARDIAC DISORDERS
8 BUNDLE BRANCH BLOCK RIGHT CARDIAC DISORDERS
9 CARDIAC DISORDER CARDIAC DISORDERS
10 CARDIAC FAILURE CONGESTIVE CARDIAC DISORDERS
# βΉ 232 more rows
4.2 Step 2: Create the Complete Lookup Table
Now we create a comprehensive lookup table that maps all Preferred Terms to their SOCs. This is the dictionary we would use in production.
# Create a complete MedDRA-style lookup table
# In production, this comes from official MedDRA files
meddra_lookup <- tribble(
~AEDECOD, ~SOC_CODED,
# CARDIAC DISORDERS
"ATRIAL FIBRILLATION", "CARDIAC DISORDERS",
"ATRIAL FLUTTER", "CARDIAC DISORDERS",
"BUNDLE BRANCH BLOCK LEFT", "CARDIAC DISORDERS",
"BUNDLE BRANCH BLOCK RIGHT", "CARDIAC DISORDERS",
"MYOCARDIAL INFARCTION", "CARDIAC DISORDERS",
"SINUS BRADYCARDIA", "CARDIAC DISORDERS",
"SINUS TACHYCARDIA", "CARDIAC DISORDERS",
# GASTROINTESTINAL DISORDERS
"ABDOMINAL DISCOMFORT", "GASTROINTESTINAL DISORDERS",
"ABDOMINAL PAIN", "GASTROINTESTINAL DISORDERS",
"CONSTIPATION", "GASTROINTESTINAL DISORDERS",
"DIARRHOEA", "GASTROINTESTINAL DISORDERS",
"DYSPEPSIA", "GASTROINTESTINAL DISORDERS",
"FLATULENCE", "GASTROINTESTINAL DISORDERS",
"NAUSEA", "GASTROINTESTINAL DISORDERS",
"VOMITING", "GASTROINTESTINAL DISORDERS",
# GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS
"APPLICATION SITE DERMATITIS", "GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS",
"APPLICATION SITE ERYTHEMA", "GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS",
"APPLICATION SITE IRRITATION", "GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS",
"APPLICATION SITE PRURITUS", "GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS",
"FATIGUE", "GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS",
"OEDEMA PERIPHERAL", "GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS",
"PYREXIA", "GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS",
# INFECTIONS AND INFESTATIONS
"BRONCHITIS", "INFECTIONS AND INFESTATIONS",
"GASTROENTERITIS", "INFECTIONS AND INFESTATIONS",
"NASOPHARYNGITIS", "INFECTIONS AND INFESTATIONS",
"UPPER RESPIRATORY TRACT INFECTION", "INFECTIONS AND INFESTATIONS",
"URINARY TRACT INFECTION", "INFECTIONS AND INFESTATIONS",
# METABOLISM AND NUTRITION DISORDERS
"DECREASED APPETITE", "METABOLISM AND NUTRITION DISORDERS",
"HYPERCHOLESTEROLAEMIA", "METABOLISM AND NUTRITION DISORDERS",
"HYPERGLYCAEMIA", "METABOLISM AND NUTRITION DISORDERS",
"HYPOGLYCAEMIA", "METABOLISM AND NUTRITION DISORDERS",
# MUSCULOSKELETAL AND CONNECTIVE TISSUE DISORDERS
"ARTHRALGIA", "MUSCULOSKELETAL AND CONNECTIVE TISSUE DISORDERS",
"BACK PAIN", "MUSCULOSKELETAL AND CONNECTIVE TISSUE DISORDERS",
"MUSCLE SPASMS", "MUSCULOSKELETAL AND CONNECTIVE TISSUE DISORDERS",
"MYALGIA", "MUSCULOSKELETAL AND CONNECTIVE TISSUE DISORDERS",
"PAIN IN EXTREMITY", "MUSCULOSKELETAL AND CONNECTIVE TISSUE DISORDERS",
# NERVOUS SYSTEM DISORDERS
"DIZZINESS", "NERVOUS SYSTEM DISORDERS",
"HEADACHE", "NERVOUS SYSTEM DISORDERS",
"LETHARGY", "NERVOUS SYSTEM DISORDERS",
"MIGRAINE", "NERVOUS SYSTEM DISORDERS",
"PARAESTHESIA", "NERVOUS SYSTEM DISORDERS",
"SYNCOPE", "NERVOUS SYSTEM DISORDERS",
"TREMOR", "NERVOUS SYSTEM DISORDERS",
# PSYCHIATRIC DISORDERS
"ANXIETY", "PSYCHIATRIC DISORDERS",
"DEPRESSION", "PSYCHIATRIC DISORDERS",
"INSOMNIA", "PSYCHIATRIC DISORDERS",
# RESPIRATORY, THORACIC AND MEDIASTINAL DISORDERS
"COUGH", "RESPIRATORY, THORACIC AND MEDIASTINAL DISORDERS",
"DYSPNOEA", "RESPIRATORY, THORACIC AND MEDIASTINAL DISORDERS",
"NASAL CONGESTION", "RESPIRATORY, THORACIC AND MEDIASTINAL DISORDERS",
"OROPHARYNGEAL PAIN", "RESPIRATORY, THORACIC AND MEDIASTINAL DISORDERS",
# SKIN AND SUBCUTANEOUS TISSUE DISORDERS
"DERMATITIS CONTACT", "SKIN AND SUBCUTANEOUS TISSUE DISORDERS",
"ERYTHEMA", "SKIN AND SUBCUTANEOUS TISSUE DISORDERS",
"HYPERHIDROSIS", "SKIN AND SUBCUTANEOUS TISSUE DISORDERS",
"PRURITUS", "SKIN AND SUBCUTANEOUS TISSUE DISORDERS",
"RASH", "SKIN AND SUBCUTANEOUS TISSUE DISORDERS",
"URTICARIA", "SKIN AND SUBCUTANEOUS TISSUE DISORDERS",
# VASCULAR DISORDERS
"HYPERTENSION", "VASCULAR DISORDERS",
"HYPOTENSION", "VASCULAR DISORDERS",
"HOT FLUSH", "VASCULAR DISORDERS"
)
cat("Lookup table contains", nrow(meddra_lookup), "term mappings\n")Lookup table contains 59 term mappings
5 Part 2: Applying Codes with Joins
Now, we will simulate a scenario where we have the terms but missing the Body System (SOC). We will use our lookup table to fill it in.
5.1 Step 3: Simulate Uncoded Data
Letβs create a dataset that has AEDECOD but is missing AEBODSYS.
# Create uncoded dataset by dropping AEBODSYS
ae_uncoded <- ae %>%
select(USUBJID, AESEQ, AETERM, AEDECOD, AESEV, AEREL, AESER)
cat("Uncoded dataset:", nrow(ae_uncoded), "records\n")Uncoded dataset: 1191 records
head(ae_uncoded)# A tibble: 6 Γ 7
USUBJID AESEQ AETERM AEDECOD AESEV AEREL AESER
<chr> <dbl> <chr> <chr> <chr> <chr> <chr>
1 01-701-1015 1 APPLICATION SITE ERYTHEMA APPLIC⦠MILD PROB⦠N
2 01-701-1015 2 APPLICATION SITE PRURITUS APPLIC⦠MILD PROB⦠N
3 01-701-1015 3 DIARRHOEA DIARRH⦠MILD REMO⦠N
4 01-701-1023 3 ATRIOVENTRICULAR BLOCK SECOND DEG⦠ATRIOV⦠MILD POSS⦠N
5 01-701-1023 1 ERYTHEMA ERYTHE⦠MILD POSS⦠N
6 01-701-1023 2 ERYTHEMA ERYTHE⦠MODE⦠PROB⦠N
5.2 Step 4: Apply Coding via Left Join
We use left_join() to attach the SOC information from our lookup table to the clinical data based on the matching Preferred Term (AEDECOD).
# Join the lookup table to the data
ae_coded <- ae_uncoded %>%
left_join(meddra_lookup, by = "AEDECOD")
# Check the results
ae_coded %>%
select(USUBJID, AEDECOD, SOC_CODED) %>%
head(10)# A tibble: 10 Γ 3
USUBJID AEDECOD SOC_CODED
<chr> <chr> <chr>
1 01-701-1015 APPLICATION SITE ERYTHEMA GENERAL DISORDERS AND ADMINβ¦
2 01-701-1015 APPLICATION SITE PRURITUS GENERAL DISORDERS AND ADMINβ¦
3 01-701-1015 DIARRHOEA GASTROINTESTINAL DISORDERS
4 01-701-1023 ATRIOVENTRICULAR BLOCK SECOND DEGREE <NA>
5 01-701-1023 ERYTHEMA SKIN AND SUBCUTANEOUS TISSUβ¦
6 01-701-1023 ERYTHEMA SKIN AND SUBCUTANEOUS TISSUβ¦
7 01-701-1023 ERYTHEMA SKIN AND SUBCUTANEOUS TISSUβ¦
8 01-701-1028 APPLICATION SITE ERYTHEMA GENERAL DISORDERS AND ADMINβ¦
9 01-701-1028 APPLICATION SITE PRURITUS GENERAL DISORDERS AND ADMINβ¦
10 01-701-1034 APPLICATION SITE PRURITUS GENERAL DISORDERS AND ADMINβ¦
We use left_join because we want to keep all rows from our clinical data (ae_uncoded), even if a matching code isnβt found in the dictionary. If we used inner_join, we would lose data for any term not in our lookup table!
6 Part 3: Validating the Coding
After applying codes, it is critical to verify that:
- Every term got a code (no
NAvalues in the new column). - The coding is accurate.
6.1 Step 5: Generate Validation Report
Letβs create a comprehensive validation report.
# Count coded vs uncoded
validation_summary <- ae_coded %>%
summarise(
Total_Records = n(),
Coded = sum(!is.na(SOC_CODED)),
Uncoded = sum(is.na(SOC_CODED)),
Pct_Coded = round(100 * Coded / Total_Records, 1)
)
print(validation_summary)# A tibble: 1 Γ 4
Total_Records Coded Uncoded Pct_Coded
<int> <int> <int> <dbl>
1 1191 753 438 63.2
# Identify which terms are still uncoded
uncoded_terms <- ae_coded %>%
filter(is.na(SOC_CODED)) %>%
distinct(AEDECOD) %>%
arrange(AEDECOD)
if (nrow(uncoded_terms) > 0) {
cat("β οΈ WARNING: The following", nrow(uncoded_terms), "terms need to be added to the lookup table:\n\n")
print(uncoded_terms)
} else {
cat("β
SUCCESS: All terms were successfully coded!\n")
}β οΈ WARNING: The following 191 terms need to be added to the lookup table:
# A tibble: 191 Γ 1
AEDECOD
<chr>
1 ACROCHORDON EXCISION
2 ACTINIC KERATOSIS
3 AGITATION
4 ALCOHOL USE
5 ALLERGIC GRANULOMATOUS ANGIITIS
6 ALOPECIA
7 AMNESIA
8 APPLICATION SITE BLEEDING
9 APPLICATION SITE DESQUAMATION
10 APPLICATION SITE DISCHARGE
# βΉ 181 more rows
6.2 Step 6: Compare Against Original
Letβs verify our coding matches the original data.
# Join original SOC for comparison
coding_check <- ae %>%
select(USUBJID, AESEQ, AEDECOD, AEBODSYS_ORIG = AEBODSYS) %>%
inner_join(
ae_coded %>% select(USUBJID, AESEQ, SOC_CODED),
by = c("USUBJID", "AESEQ")
) %>%
mutate(
Match = AEBODSYS_ORIG == SOC_CODED
)
# Summary
cat("Coding accuracy check:\n")Coding accuracy check:
cat("Matching:", sum(coding_check$Match, na.rm = TRUE), "\n")Matching: 753
cat("Mismatched:", sum(!coding_check$Match, na.rm = TRUE), "\n")Mismatched: 0
7 Part 4: Other CDISC Controlled Terminology
Beyond MedDRA, we must ensure other variables match CDISC standards.
7.1 Severity (AESEV)
Valid CDISC CT values: βMILDβ, βMODERATEβ, βSEVEREβ
# Check for non-standard values
ae %>%
count(AESEV, name = "Count") %>%
mutate(
Is_Standard = AESEV %in% c("MILD", "MODERATE", "SEVERE"),
Status = ifelse(Is_Standard, "β Valid", "β Invalid")
)# A tibble: 3 Γ 4
AESEV Count Is_Standard Status
<chr> <int> <lgl> <chr>
1 MILD 770 TRUE β Valid
2 MODERATE 378 TRUE β Valid
3 SEVERE 43 TRUE β Valid
7.2 Seriousness (AESER)
Valid CDISC CT values: βYβ, βNβ
ae %>%
count(AESER, name = "Count") %>%
mutate(
Is_Standard = AESER %in% c("Y", "N"),
Status = ifelse(Is_Standard, "β Valid", "β Invalid")
)# A tibble: 2 Γ 4
AESER Count Is_Standard Status
<chr> <int> <lgl> <chr>
1 N 1188 TRUE β Valid
2 Y 3 TRUE β Valid
7.3 Causality (AEREL)
Causality can vary by study, but common values include:
ae %>%
count(AEREL, name = "Count") %>%
arrange(desc(Count))# A tibble: 5 Γ 2
AEREL Count
<chr> <int>
1 PROBABLE 361
2 POSSIBLE 343
3 NONE 322
4 REMOTE 161
5 <NA> 4
8 π― Practice Exercise
8.1 Your Turn: Complete the Missing Mappings
If the validation above showed uncoded terms, add them to the lookup table below and re-run the validation.
# TODO: Add any missing terms from the validation report
additional_terms <- tribble(
~AEDECOD, ~SOC_CODED,
# Add your mappings here:
# "TERM_NAME", "SYSTEM ORGAN CLASS",
)
# Combine with original lookup
complete_lookup <- bind_rows(meddra_lookup, additional_terms)
# Re-apply coding
ae_recoded <- ae_uncoded %>%
left_join(complete_lookup, by = "AEDECOD")
# Check for remaining uncoded
ae_recoded %>%
filter(is.na(SOC_CODED)) %>%
distinct(AEDECOD)9 Deliverable Summary
Today you completed the following:
| Task | Status |
|---|---|
| Created MedDRA lookup table with 50+ terms | β Done |
Applied coding using left_join() |
β Done |
| Generated validation report (coded vs uncoded) | β Done |
| Verified coding accuracy against source | β Done |
| Checked CDISC CT for AESEV, AESER, AEREL | β Done |
10 Key Takeaways
- Consistency is Mandatory: Regulatory submissions require strict adherence to Controlled Terminology (CT).
- MedDRA Hierarchy: PT (Preferred Term) β SOC (System Organ Class) is the key mapping for AE data.
- Joins for Coding:
left_join()is the primary tool for applying codes from a dictionary to a dataset. - Validation is Critical: Always check for
NAvalues after a join to identify terms that didnβt match. - Document Everything: Keep track of which version of MedDRA youβre using.
11 Resources
- CDISC Controlled Terminology Browser - Browse official values
- MedDRA Hierarchy Guide - Understanding the coding levels
- Metacore Documentation - R package for managing metadata and CT
12 Whatβs Next?
In Day 4, we will focus on Clinical Date Derivations. You will learn:
- How to parse ISO 8601 dates with
lubridate - Calculating Study Days (
--DY) with the βNo Day 0β rule - Handling partial dates (e.g., β2023-05β)
- Creating reusable derivation functions