| Title: | Privacy-Preserving Data Anonymization |
|---|---|
| Description: | Tools for anonymizing sensitive patient and research data. Helps protect privacy while keeping data useful for analysis. Anonymizes IDs, names, dates, locations, and ages while maintaining referential integrity. Methods based on: Sweeney (2002) <doi:10.1142/S0218488502001648>, Dwork et al. (2006) <doi:10.1007/11681878_14>, El Emam et al. (2011) <doi:10.1371/journal.pone.0028071>, Fung et al. (2010) <doi:10.1145/1749603.1749605>. |
| Authors: | Vikrant Dev Rathore [aut, cre] |
| Maintainer: | Vikrant Dev Rathore <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.0.1 |
| Built: | 2026-05-24 09:26:19 UTC |
| Source: | https://github.com/vikrant31/privacyr |
Groups ages into buckets for privacy protection. Default uses 10-year buckets (0-9, 10-19, etc.) which are useful for research. Ages 90+ are grouped together.
anonymize_age(x, method = c("10year", "hipaa"), custom_buckets = NULL)anonymize_age(x, method = c("10year", "hipaa"), custom_buckets = NULL)
x |
A numeric vector of ages to anonymize |
method |
Character string specifying bucketing method: "10year" (default) uses 10-year buckets: 0-9, 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90+ "hipaa" uses HIPAA-compliant buckets: 0-17, 18-64, 65-89, 90+ |
custom_buckets |
Optional named numeric vector for custom buckets. Format: c("0-9" = 9, "10-19" = 19, "20-29" = 29, "90+" = Inf) |
A character vector of age buckets
ages <- c(25, 45, 67, 92, 15, 78) anonymize_age(ages) # Uses 10-year buckets by default anonymize_age(ages, method = "hipaa") # Use HIPAA bucketsages <- c(25, 45, 67, 92, 15, 78) anonymize_age(ages) # Uses 10-year buckets by default anonymize_age(ages, method = "hipaa") # Use HIPAA buckets
Main function to anonymize patient data in a data frame or data.table. Automatically detects and anonymizes columns based on data types and naming patterns, or you can manually specify columns. Different datasets get different anonymized values for better privacy.
anonymize_dataframe( data, id_cols = NULL, name_cols = NULL, date_cols = NULL, location_cols = NULL, age_cols = NULL, auto_detect = TRUE, detect_by_type = TRUE, date_method = "shift", date_granularity = "month", location_method = "generalize", age_method = "10year", use_uuid = TRUE, seed = NULL, dataset_specific = TRUE )anonymize_dataframe( data, id_cols = NULL, name_cols = NULL, date_cols = NULL, location_cols = NULL, age_cols = NULL, auto_detect = TRUE, detect_by_type = TRUE, date_method = "shift", date_granularity = "month", location_method = "generalize", age_method = "10year", use_uuid = TRUE, seed = NULL, dataset_specific = TRUE )
data |
A data frame or data.table containing patient data |
id_cols |
Character vector of column names containing patient IDs |
name_cols |
Character vector of column names containing patient names |
date_cols |
Character vector of column names containing dates |
location_cols |
Character vector of column names containing locations |
age_cols |
Character vector of column names containing ages |
auto_detect |
Logical, if TRUE (default), automatically detects columns based on data types and common naming patterns |
detect_by_type |
Logical, if TRUE (default), detects columns by their R data types (Date, character, etc.) in addition to name patterns |
date_method |
Method for date anonymization: "shift" or "round" (default: "shift"). Use "round" to enable granularity options including "month_year" (YYYYMM format). |
date_granularity |
For date rounding (when date_method = "round"): "day", "week", "month", "month_year" (returns YYYYMM format, e.g., "202005"), "quarter", or "year" (default: "month") |
location_method |
Method for location anonymization: "remove" or "generalize" |
age_method |
Method for age anonymization: "10year" (default) uses 10-year buckets (0-9, 10-19, 20-29, ..., 80-89, 90+) for better research utility, or "hipaa" for HIPAA-compliant buckets (0-17, 18-64, 65-89, 90+) |
use_uuid |
Logical, if TRUE uses short UUIDs for IDs, names, and locations instead of sequential identifiers (default: TRUE). Dates and ages are not affected. |
seed |
An optional seed for reproducible anonymization. Different datasets will still get different anonymized values even with the same seed. |
dataset_specific |
Logical, if TRUE (default), generates dataset-specific seeds so different datasets get different anonymized values |
A data frame with anonymized patient data (preserves data.table class if input was data.table)
# Basic usage with auto-detection patient_data <- data.frame( patient_id = c("P001", "P002", "P003"), name = c("John Doe", "Jane Smith", "Bob Johnson"), dob = as.Date(c("1980-01-15", "1975-03-20", "1990-06-10")), location = c("New York, NY", "Los Angeles, CA", "Chicago, IL"), diagnosis = c("A", "B", "A") ) anonymize_dataframe(patient_data, seed = 123) # With month_year date granularity (YYYYMM format) anonymize_dataframe(patient_data, date_method = "round", date_granularity = "month_year") # Works with data.table if (requireNamespace("data.table", quietly = TRUE)) { dt <- data.table::as.data.table(patient_data) anonymize_dataframe(dt) } # With UUID anonymization (default) anonymize_dataframe(patient_data, seed = 123) # Without UUID (sequential IDs) anonymize_dataframe(patient_data, use_uuid = FALSE, seed = 123)# Basic usage with auto-detection patient_data <- data.frame( patient_id = c("P001", "P002", "P003"), name = c("John Doe", "Jane Smith", "Bob Johnson"), dob = as.Date(c("1980-01-15", "1975-03-20", "1990-06-10")), location = c("New York, NY", "Los Angeles, CA", "Chicago, IL"), diagnosis = c("A", "B", "A") ) anonymize_dataframe(patient_data, seed = 123) # With month_year date granularity (YYYYMM format) anonymize_dataframe(patient_data, date_method = "round", date_granularity = "month_year") # Works with data.table if (requireNamespace("data.table", quietly = TRUE)) { dt <- data.table::as.data.table(patient_data) anonymize_dataframe(dt) } # With UUID anonymization (default) anonymize_dataframe(patient_data, seed = 123) # Without UUID (sequential IDs) anonymize_dataframe(patient_data, use_uuid = FALSE, seed = 123)
Anonymizes dates by shifting them by a random offset or rounding to a specified granularity. Shifting preserves relative time differences.
anonymize_dates( x, method = c("shift", "round"), days_shift = NULL, granularity = "month", seed = NULL )anonymize_dates( x, method = c("shift", "round"), days_shift = NULL, granularity = "month", seed = NULL )
x |
A vector of dates (Date, POSIXct, or character that can be coerced to Date) |
method |
Character string specifying anonymization method: "shift" (default) shifts all dates by a random offset, "round" rounds dates to specified granularity |
days_shift |
For "shift" method: number of days to shift (default: random between -365 and 365) |
granularity |
For "round" method: "day", "week", "month", "month_year", "quarter", or "year" (default: "month"). "month_year" returns character strings in "YYYYMM" format (e.g., "202005" for May 2020). |
seed |
An optional seed for reproducible anonymization |
A Date vector of anonymized dates (or character vector for "month_year" granularity)
dates <- as.Date(c("2020-01-15", "2020-03-20", "2020-06-10")) anonymize_dates(dates, method = "shift", seed = 123) anonymize_dates(dates, method = "round", granularity = "month") anonymize_dates(dates, method = "round", granularity = "month_year")dates <- as.Date(c("2020-01-15", "2020-03-20", "2020-06-10")) anonymize_dates(dates, method = "shift", seed = 123) anonymize_dates(dates, method = "round", granularity = "month") anonymize_dates(dates, method = "round", granularity = "month_year")
Replaces patient identifiers with anonymized versions while maintaining referential integrity (same IDs get the same anonymized value).
anonymize_id(x, prefix = "ID", seed = NULL, use_uuid = TRUE)anonymize_id(x, prefix = "ID", seed = NULL, use_uuid = TRUE)
x |
A vector of identifiers to anonymize (character, numeric, or factor) |
prefix |
A character string to prefix anonymized IDs (default: "ID") |
seed |
An optional seed for reproducible anonymization |
use_uuid |
Logical, if TRUE uses short UUIDs instead of sequential IDs (default: TRUE). |
A character vector of anonymized identifiers
ids <- c("P001", "P002", "P003", "P001") anonymize_id(ids) anonymize_id(ids, prefix = "PAT", seed = 123) anonymize_id(ids, use_uuid = FALSE, seed = 123) # Use sequential IDsids <- c("P001", "P002", "P003", "P001") anonymize_id(ids) anonymize_id(ids, prefix = "PAT", seed = 123) anonymize_id(ids, use_uuid = FALSE, seed = 123) # Use sequential IDs
Anonymizes geographic locations by removing them or replacing with generic labels. Maintains referential integrity (same locations get the same value).
anonymize_locations( x, method = c("remove", "generalize"), prefix = "Location", seed = NULL, use_uuid = TRUE )anonymize_locations( x, method = c("remove", "generalize"), prefix = "Location", seed = NULL, use_uuid = TRUE )
x |
A character vector of locations to anonymize |
method |
Character string specifying anonymization method: "remove" (default) removes location information, "generalize" replaces with generic location labels |
prefix |
For "generalize" method: prefix for generic locations (default: "Location") |
seed |
An optional seed for reproducible anonymization |
use_uuid |
Logical, if TRUE uses short UUIDs instead of sequential IDs (default: TRUE). Only applies when method = "generalize". |
A character vector of anonymized locations
locations <- c("New York, NY", "Los Angeles, CA", "Chicago, IL") anonymize_locations(locations, method = "remove") anonymize_locations(locations, method = "generalize", seed = 123) anonymize_locations(locations, method = "generalize", use_uuid = FALSE, seed = 123) # Use sequential IDslocations <- c("New York, NY", "Los Angeles, CA", "Chicago, IL") anonymize_locations(locations, method = "remove") anonymize_locations(locations, method = "generalize", seed = 123) anonymize_locations(locations, method = "generalize", use_uuid = FALSE, seed = 123) # Use sequential IDs
Replaces patient names with anonymized identifiers while maintaining referential integrity (same names get the same anonymized value).
anonymize_names(x, prefix = "Patient", seed = NULL, use_uuid = TRUE)anonymize_names(x, prefix = "Patient", seed = NULL, use_uuid = TRUE)
x |
A character vector of names to anonymize |
prefix |
A character string to prefix anonymized names (default: "Patient") |
seed |
An optional seed for reproducible anonymization |
use_uuid |
Logical, if TRUE uses short UUIDs instead of sequential IDs (default: TRUE). |
A character vector of anonymized names
names <- c("John Doe", "Jane Smith", "Bob Johnson") anonymize_names(names) anonymize_names(names, prefix = "PAT", seed = 123) anonymize_names(names, use_uuid = FALSE, seed = 123) # Use sequential IDsnames <- c("John Doe", "Jane Smith", "Bob Johnson") anonymize_names(names) anonymize_names(names, prefix = "PAT", seed = 123) anonymize_names(names, use_uuid = FALSE, seed = 123) # Use sequential IDs