Privacy Shield

Design decision used while developing a tool to redact PII from text and documents

Linux Penguin

Privacy Shield

This document outlines the design decisions I make as I flesh out the Privacy Shield API and its functionality.

Privacy Shield will allow developers to redact personal identifiable information from text supplied through one of the API endpoints.

Developers can use the API service to redact PII from log files, chat logs, and clinical correspondence, among other use cases.

Privacy Shield will use AWS Comprehend to detect personally identifiable tokens and employ text compaction techniques to reduce the number of tokens sent to AWS Comprehend, thereby cutting your AWS spend by up to 70%.

The Privacy Shield software will be a single binary that organisations can set up within their existing infrastructure. I’ll provide documentation on how to set up the software on standard web services like IIS and Caddy.

Text Token Reduction.

Amazon’s Comprehend service charges by the number of characters within the sample text. Comprehend has a minimum charge of 300 characters. If the text supplied to Privacy Shield is less than this minimum, the entire text is used in the analysis.

If the text being analysed exceeds the minimum length, compaction will be applied to reduce the number of text tokens needing to be processed.

Removing Normal Words

The first step in the text reduction is to remove all common words. To achieve this, we can use a spell checker’s dictionary of common words to remove most words used in the English language.

Removing the Majority of Names

If your organisation, such as a healthcare provider, maintains a database of people, you’ll already have a large cohort of forenames and surnames that can be plugged into the system. Using these provided names, we can then further reduce the amount of text that Comprehend needs to process.

Removing NHS Numbers

For clinical use cases, we can pre-detect a sequence of ten digits and then pass them through the modulus 11 algorithm to see if it’s a valid NHS number. If the number looks like an NHS number, then we can remove that from the text that Comprehend needs to process.

Remembering Results From Comprehend for Future Documents