An Introduction to Data Privacy in Practice

Torus Talk - MacEwan University, March 2026

Katie Burak

About Me

Katie Burak
Assistant Professor of Teaching, Department of Statistics, UBC https://katieburak.github.io/

https://katieburak.github.io/torus-talk-2026/

Even something as simple as your Facebook “likes” can reveal a lot more than you think…
Researchers at Cambridge showed that algorithms could predict:
- Sexual orientation with up to 88% accuracy
- Race with 95% accuracy
- Political affiliation with 85% accuracy
All from analyzing the pages and posts you “liked” (no profile bio or messages needed)!

https://www.cam.ac.uk/research/news/digital-records-could-expose-intimate-details-and-personality-traits-of-millions

What Happens to Your Data?

Every time you use an app, visit a website, click on a link, fill out a survey or even just scroll on your device, your data is being:

Collected
Analyzed
Shared or Sold

Why Does This Matter?

You may be targeted with ads, content and potentially misinformation
You could be judged or profiled based on your data
You don’t always know who has your data (or what they’re doing with it)

Source: CBC News

Personally Identifiable Information (PII)

PII refers to any data that can be used to identify a specific individual.
Direct identifiers: These clearly and uniquely point to a person.
- Examples: name, social security number, patient ID
Indirect identifiers: These don’t identify someone on their own, but could when combined.
- Examples: age, DOB, postal code, race, sex

Personal Data

Data can be identifiable when:

They contain directly identifying information
It’s possible to single out an individual
It’s possible to infer information about an individual based on information in your dataset
It’s possible to link records relating to an individual
De-identification is still reversible

Scenario: Can This Data Identify You?

A fitness app shares anonymized data with researchers. The dataset includes:

Step count per day
General location (postal code)
Age
Time of day the user exercises
Health conditions

Separately, a publicly available dataset includes information from a local running club: names, age groups and 5K race times.

The Mosaic Effect

The “Mosaic Effect” can happens when separate pieces of data, which alone don’t identify anyone, are combined from different sources to reveal personal information or identify an individual.
In 2000, 87% of the United States population was found to be identifiable using a combination of their ZIP code, gender and date of birth.

https://dataprivacylab.org/projects/identifiability/paper1.pdf

Pseudonymization and Anonymization

Pseudonymization and anonymization are techniques to de-identify personal data
Goal: reduce linkability of data to individuals
We will now define each of these terms

Pseudonymization

Reduces linkability of data to individuals
Data cannot identify individuals without additional information
Often done by replacing direct identifiers with pseudonyms
Link between real identifiers and pseudonyms is stored separately
Re-identification remains possible!

Anonymization

Data are anonymized when no individual is identifiable (directly or indirectly)
This applies even to the data controller
Fully anonymized data are no longer personal data
Anonymization is difficult to achieve in practice

Identifiability Spectrum

Identifiability is a spectrum
More de-identified data = closer to anonymized
Lower identifiability = lower re-identification risk

https://www.kdnuggets.com/2020/08/anonymous-anonymized-data.html

De-identification Techniques

First, let’s generate some (synthetic) data we can use to help illustrate these concepts.

library(tidyverse)

df <- tibble(
  name = c("Ada Lovelace", "Sophie Germain", "Carl Gauss", "Leonhard Euler"),
  age = c(36, 55, 61, 76),          
  height_cm = c(160, 173, 182, 185)  
)

df

# A tibble: 4 × 3
  name             age height_cm
  <chr>          <dbl>     <dbl>
1 Ada Lovelace      36       160
2 Sophie Germain    55       173
3 Carl Gauss        61       182
4 Leonhard Euler    76       185

Suppression

Remove entire variables, values or records
Used to eliminate highly identifying or unnecessary data
Examples:
- Names, contact details, social security numbers
- GPS metadata, IP addresses, neuroimaging facial features
- Outliers or unique participants

Suppression Example

df_suppressed <- df |>
  select(-name)

df_suppressed

# A tibble: 4 × 2
    age height_cm
  <dbl>     <dbl>
1    36       160
2    55       173
3    61       182
4    76       185

Generalization

Reduces detail or granularity in the data
Makes individuals harder to single out
Examples:
- Convert date of birth to age, or group into ranges
- Replace address with town or region
- Recategorize rare labels into “other” or “missing”

Here we will show an example of generalization on the age column:

df_generalized <- df |>
  mutate(age_group = case_when(
    age < 60 ~ "under 60",
    TRUE     ~ "60+"
  ))|>
  select(-age)

df_generalized

# A tibble: 4 × 3
  name           height_cm age_group
  <chr>              <dbl> <chr>    
1 Ada Lovelace         160 under 60 
2 Sophie Germain       173 under 60 
3 Carl Gauss           182 60+      
4 Leonhard Euler       185 60+

Replacement

Swap identifying info with less informative alternatives
Examples:
- Use pseudonyms for names (with securely stored keyfile)
- Replace with placeholders (e.g., “[redacted]”)
- Rounding numeric values

Creating Pseudonyms

Pseudonyms should reveal nothing about the subject
Good pseudonyms:
- Are random or meaningless strings/numbers
- Are securely managed (e.g., encrypted keyfile)
Can be generated using tools in Excel, R, Python, SPSS

Replacement with Pseudonyms

df_pseudonymized <- df |>
  mutate(pseudonym = paste0("ID", row_number())) |>
  select(pseudonym, everything(), -name)

df_pseudonymized

# A tibble: 4 × 3
  pseudonym   age height_cm
  <chr>     <dbl>     <dbl>
1 ID1          36       160
2 ID2          55       173
3 ID3          61       182
4 ID4          76       185

Hashing

Hashing converts names into fixed-length, irreversible strings.
Unlike pseudonyms, hashed values cannot be easily reversed.
In R, we can use the digest package (and function) to hash.

library(digest) 

df_hashed <- df |>
  rowwise() |>
  mutate(name_hash = digest(name)) |>
  select(name_hash, everything(), -name)

df_hashed

# A tibble: 4 × 3
# Rowwise: 
  name_hash                          age height_cm
  <chr>                            <dbl>     <dbl>
1 a137c27c4882477833f58a18f1f91a7a    36       160
2 8a772a499a9d3dcea329a7ce037c953f    55       173
3 35d593f80c5060f502f5cdbfc5876a16    61       182
4 e14f21d48648eca198760d84bed1734b    76       185

Top- and Bottom-Coding

Limits extreme values in quantitative data
Recode all values above or below a threshold
Example: all incomes above $150,000 become $150,000
Preserves much of the dataset, but distorts distribution tails

Top-coding example

Suppose 6ft (182.88cm) is considered our maximum height threshold:

df_top_coded <- df |>
  mutate(height_cm = if_else(height_cm > 182.88, 182.88, height_cm))

df_top_coded

# A tibble: 4 × 3
  name             age height_cm
  <chr>          <dbl>     <dbl>
1 Ada Lovelace      36      160 
2 Sophie Germain    55      173 
3 Carl Gauss        61      182 
4 Leonhard Euler    76      183.

Adding Noise

Introduces randomness to protect sensitive info
Examples:
- Add a small random amount to numeric values
- Blur images or alter voices

Adding Noise to Height

set.seed(200) 

df_noisy <- df |>
  mutate(height_cm_noisy = height_cm + rnorm(n(), mean = 0, sd = 2)) |>
    select(-height_cm)

df_noisy

# A tibble: 4 × 3
  name             age height_cm_noisy
  <chr>          <dbl>           <dbl>
1 Ada Lovelace      36            160.
2 Sophie Germain    55            173.
3 Carl Gauss        61            183.
4 Leonhard Euler    76            186.

Permutation

Swap values between individuals
Makes linking variables across a record more difficult
Maintains distributions, but breaks correlations
Can limit the types of analyses possible

Permutation of Height Values

set.seed(200)

df_permuted <- df |>
  mutate(height_cm_permuted = sample(height_cm)) |>
    select(-height_cm)

df_permuted

# A tibble: 4 × 3
  name             age height_cm_permuted
  <chr>          <dbl>              <dbl>
1 Ada Lovelace      36                173
2 Sophie Germain    55                185
3 Carl Gauss        61                160
4 Leonhard Euler    76                182

Privacy vs. Utility Tradeoff

https://www.researchgate.net/figure/Trade-off-between-privacy-level-and-utility-level-of-data_fig1_357987903

Case Study: Brogan Inc. and NIHB Data

The Non-Insured Health Benefits (NIHB) database contains sensitive health data on First Nations use of services like prescriptions, dental care, and medical devices.
In 2001, Health Canada began releasing de-identified NIHB pharmacy claims data to Brogan Inc. (a private health consulting firm).
Though personal identifiers were removed, community identifiers remained, and First Nations were not informed until 2007.
Brogan sold the data to pharmaceutical companies for commercial research and marketing
Health Canada justified the release by claiming no privacy interests remained since personally identifying information had been removed.

Kukutai, T., & Taylor, J. (2016). Indigenous data sovereignty: Toward an agenda. ANU press.

Discussion

What are the limits of simply removing names and IDs from a dataset?
How can we measure whether a dataset is truly “safe” to release?
Should de-identified data still require community consent before being shared or sold?

Why basic deidentification isn’t always enough

Individuals can often be re-identified using other information.
As datasets become more detailed and linkable, privacy risks increase.
De-identification may protect individuals but not groups and even “anonymous” data can reveal sensitive patterns about communities.
More advanced statistical methods are often needed to ensure meaningful deidentification while preserving data utility.

Statistical approaches to deidentification

$k$-anonymity
$l$-diversity
Differential privacy (advanced)

Overview of privacy models

$k$-anonymity and $l$-diversity are statistical approaches that quantify the level of identifiability within a tabular dataset.
They focus on how variables combined can lead to identification.
They work best on relatively large datasets, where enough observations are present to preserve useful detail while still protecting privacy.

Identifiers, Quasi-Identifiers, and Sensitive Attributes

Privacy models distinguish between three types of variables:

Identifiers: Direct identifiers such as names, student numbers, email addresses.
Quasi-Identifiers: Indirect identifiers that can lead to identification when combined with other quasi-identifiers or external data.
- Examples: age, sex, place of residence, physical characteristics, timestamps, etc.
Sensitive Attributes: Variables of interest that need protection and cannot be altered as they are key outcomes.
- Examples: Medical condition, Income, etc.
Correctly categorizing variables into identifiers, quasi-identifiers, and sensitive attributes is crucial in determining how to de-identify your dataset effectively.

$k$-anonymity

A data set is $k$-anonymous if each observation cannot be distinguished from at least $k-1$ other observations based on the quasi-identifiers.
This can be achieved through generalization, suppression and sometimes top- or bottom-coding of data values.
Applying $k$-anonymity makes it more difficult for an attacker to single out or re-identify specific individuals.
It also helps reduce the risk of the mosaic effect, where combining data points could lead to identification.

Making a data set $k$-anonymous

Identify variables as identifiers, quasi-identifiers and sensitive attributes.
Choose a value for $k$.
Aggregate or transform the data so each combination of quasi-identifiers occurs at least k times.

Choosing $k$

There is no single correct value for $k$!
Higher $k$ increases privacy, but reduces data detail and utility.
The choice depends on promises made to data subjects and acceptable risk levels.

Source: k2view.com

Example data

Age and city are quasi-identifiers, and salary is considered a sensitive attribute.

Age	City	Salary
38	Calgary	90,000–99,999
37	Toronto	90,000–99,999
31	Vancouver	80,000–89,999
48	Calgary	110,000–119,999
39	Vancouver	110,000–119,999
37	Calgary	90,000–99,999
34	Toronto	90,000–99,999
33	Vancouver	80,000–89,999
32	Toronto	100,000–109,999
45	Calgary	90,000–99,999

$k=2$

Age Range	City	Salary Range
30–39	Calgary	90,000–99,999
30–39	Toronto	90,000–99,999
30–39	Vancouver	80,000–89,999
40–49	Calgary	110,000–119,999
30–39	Vancouver	110,000–119,999
30–39	Calgary	90,000–99,999
30–39	Toronto	90,000–99,999
30–39	Vancouver	80,000–89,999
30–39	Toronto	100,000–109,999
40–49	Calgary	90,000–99,999

Given the data and treating condition as the sensitive attribute, which field(s) could you generalize to help achieve $k = 3$ anonymity?

Age	ZIP Code	Condition
29	13053	Flu
27	13068	Flu
28	13068	Cold
45	14853	Diabetes
46	14853	Diabetes
47	14853	Cancer

A. Generalize Age into age ranges (e.g., 20–29, 40–49)
B. Suppress Condition entirely
C. Generalize ZIP Code to first 3 digits (e.g., 130, 148)
D. Generalize Age into age ranges (e.g., 20–29, 40–49) and ZIP code to first 3 digits (e.g., 130, 148)
E. It’s already $k=3$ anonymous

Age	ZIP Code	Condition
20–29	130	Flu
20–29	130	Flu
20-29	130	Cold
40–49	148	Diabetes
40–49	148	Diabetes
40–49	148	Cancer

$l$-diversity

But what if all individuals within a group share the same sensitive value?

Age Range	ZIP	Condition
20–29	130	Flu
20–29	130	Flu
20–29	130	Flu

$l$-diversity is an extension of $k$-anonymity that ensures sufficient variation in a sensitive attribute.

Although these data are $2$-anonymous, we can still infer that any 30-39 year old from Calgary who participated earns between 90-99k.

Age Range	City	Salary Range
30–39	Calgary	90,000–99,999
30–39	Toronto	90,000–99,999
30–39	Vancouver	80,000–89,999
40–49	Calgary	110,000–119,999
30–39	Vancouver	110,000–119,999
30–39	Calgary	90,000–99,999
30–39	Toronto	90,000–99,999
30–39	Vancouver	80,000–89,999
30–39	Toronto	100,000–109,999
40–49	Calgary	90,000–99,999