Skip to content

mddunlap924/PII-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PII Detection and BIO Synthetic Data Generation

This repository fine-tunes a state of the art PII detection system and enhances performance with synthetic PII data generation.

Introduction  •  Highlights  •  Synthetic PII Data  •  PII Entity Detection Systems  •  Issues  • 

Introduction

Personal identifiable information (PII) is sensitive data used to identify, locate, or contact an individual. PII entity detection systems can identify, categorize, and redact sensitive information in unstructured text. Improving PII detection systems help maintain the privacy and security of individuals, comply with legal and regulatory requirements, and prevent identity theft, fraud, or other types of harm. Figure 1 provides an example PII entities using inside, outside, beginning (IOB) format.


Figure 1: Example of PII Data in IOB Format [Source].

The work in this repository was derived during the Kaggle competition The Learning Agency Lab - PII Data Detection. Using the techniques in this repository will provide solutions in the Top 1% for the competition.

Highlights

  1. Synthetic PII datasets with BIO formatting.
    • Meta-Llama3-8B-Instruct is used to generate synthetic essays.
    • Prompting best techniques from Prompt Engineering and using PII PlaceHolders instead of directly putting PII data into a prompt.
    • Faker to create custom PII data that is injected into unstructured text.
    • Decoupling LLM domain specific generate text with PII placeholder and Faker PII data is useful for efficient experimentation in creating synthetic PII datasets.
  2. PII Entity Detection Systems

Synthetic PII Data

The IOB format, also commonly referred to as the BIO format, is a common tagging format for tagging tokens in a chunking task such as Named Entity Recognition (NER) applications. Creating labeled BIO datasets can be time and labor intensive for domain specific datasets. An alternative approach is to synthetically generate PII datasets that closely represent your real-life application. Refer to the gen-data directory for code to create domain specific PII data. The below files would be executed sequentially because they each represent a different task in the synthetic PII data creation.

Synthetic PII data was created using Faker and custom functions to create PII information. This data was placed into LLM generated unstructured text that was created in the next step.

Generative LLMs (e.g., Llama3) were used to generate unstructured text that resembles domain specific text. In this repository the data mimics essays from students in an online course. Refer to the various prompts for prompting examples used in this work.

NOTICE: a helpful insight learned during this work was to prompt the LLM to create placeholders for the PII data. The data from Step #1 will be injected into the placeholders. The PII placeholder approach provided the following anecdotal benefits:

  • Certain LLM models (both open and closed source) may refuse to generate text if PII is mentioned due to censorship.
  • Placeholders are easy to locate using simple techniques (e.g., regex).
  • LLM's hallucinated less when writing placeholders as opposed to PII data directly.

Two examples of prompting strategys are given which demonstrate the differences between prompting with PII directly (Example 1) and then with placeholders (Example 2).

Example 1: LLM Prompting with Direct PII Data Injection

Model Prompt:
Write an sentence introducing yourself and include only your personal information provided below:
  - FULL_NAME: John Doe
  - PHONE_NUM: 555-123-4567

Model Response:
My name is John Doe and you can reach me at 555-125-4567 or by email at john.doe@gmail.com.

This prompting technique routinely introduced a few mistakes from the LLMs that were difficult to programmatically recognize and lead to mis-labeling of PII in the BIO format. Example 1 asks the model to directly insert PII data into the text and notice the few errors:

  • The PHONE_NUM was off by a single digit (123 vs. 125). This could occur with names, phone numbers, street address, or any of the PII entities.
  • The LLM would inject extra PII not asked for such as the john.doe@gmail.com email.

Example 2: LLM Prompting with PII PlaceHolders

Model Prompt:
Write an sentence introducing yourself and include only your personal information using the placeholders provided below:
  - {FULL_NAME}: First and Last Name
  - {PHONE_NUM}: Personal phone number

Model Response:
My name is {FULL_NAME} and you can reach me at {PHONE_NUM}.

With Example 2 the LLM did not have the opportunity to transpose any errors in the PHONE_NUM or with the FULL_NAME. Also, it seemed the LLM would less frequently incorporate extra PII entities.

The above two steps decoupled PII data and domain specific text generation. In Step #3 the Step #1 PII data is inserted into Step #2's LLM domain specific generated text. This is useful because you can easily experiment with different combinations of PII data and domain specific text generation data.

PII Entity Detection Systems

The best performing LLM model for PII entity detection was Microsoft's Decoding-enhanced BERT with Disentangled Attention V3 model. This model consistently performs well for encoder model tasks such as named entity recognition (NER), question and answer, and classification.

A good starting point for training a Deberta-V3 model is with the Baseline Deberta-V3 Fine-Tuning module. In this module a custom Hugging Face Trainer was created to train with either Focal Loss or CE loss to account for class imbalance.

class CustomTrainer(Trainer):
    def __init__(
            self,
            focal_loss_info: SimpleNamespace,
            *args,
            class_weights=None,
            **kwargs):
        super().__init__(*args, **kwargs)
        # Assuming class_weights is a Tensor of weights for each class
        self.class_weights = class_weights
        self.focal_loss_info = focal_loss_info

    def compute_loss(self, model, inputs, return_outputs=False):
        # Extract labels
        labels = inputs.pop("labels")

        # Forward pass
        outputs = model(**inputs)
        logits = outputs.logits

        # Loss calculation
        if self.focal_loss_info.apply:
            loss_fct = FocalLoss(alpha=5, gamma=2, reduction='mean')
            loss = loss_fct(logits.view(-1, self.model.config.num_labels),
                            labels.view(-1))
        else:
            loss_fct = CrossEntropyLoss(weight=self.class_weights)
            if self.label_smoother is not None and "labels" in inputs:
                loss = self.label_smoother(outputs, inputs)
            else:
                loss = loss_fct(logits.view(-1, self.model.config.num_labels),
                                labels.view(-1))
        return (loss, outputs) if return_outputs else loss

Further tricks and tips to help fine-tune PII detection systems that are contained in the training directory are:

  • Masked Language Modeling (MLM) with Hugging Face Trainer for Domain Adaptation can utilize unlabeled datasets to expose a model to domain-specific language patterns and terminology. Fine-tuning a model that underwent additional pre-training on a specific task or domain, beginning with an initial checkpoint tailored for the task and data distribution at hand, typically yields better performance compared to fine-tuning models that start from a generic initial checkpoint [Sources: 1, 2].
  • Weights and Biases was used for experiment tracking in this source code. The below link is an excellent reference to follow on setting up W&B.
  • Single or Dual GPU Training: three modules were prepared to experiment with a fine-tuning model with single or dual GPUs. There was a balance between token lengths, model size, and training times.

NOTE: This workflow presented here can be adapted for many Hugging Face deep learning applications, not just LLMs.

Issues

This repository is will do its best to be maintained. If you face any issue or want to make improvements please raise an issue or submit a Pull Request. 😃

About

Personal Identifiable Information (PII) entity detection and performance enhancement with synthetic data generation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published