PII Data Anonymization

Nishtha Jain & Carl Brenssell | Data Science
March 9, 2023

tl;dr

This post explains one of the ways we tackle data privacy at Spoke and how we redact personally identifiable information (PII) from our users’ data before passing it to any in-house models or pre-trained LLMs – in addition to generally encrypting all user data with the highest security-standard available.

Our process involves first anonymizing / pseudonymizing all message data, subsequently generating summaries and outcomes using that redacted data and finally de-anonymizing the output for only the end users’ eyes. For this specific task, we leverage, among other models, Presidio – a data protection and de-identification SDK by Microsoft.

Objectives & User Value

Across a wide range of NLP tasks, at Spoke we’re pragmatically combining different technologies, with a long-term focus on working with resource-efficient, unbiased LLMs and developing task-specific models in-house. Our summarization is currently powered by a combination of fine-tuned pre-trained language models, self-hosted open-source technology, and custom models trained in-house (e.g. for Named Entity Recognition, PII Detection, Data Pseudonymization, Question Identification, Semantic Search).

We believe that in a space where the core technology will become more and more commoditized, it is still possible and crucial to differentiate. We think that differentiation and creating user value in our space mainly comes via building with a clear focus on data privacy, responsible, human-centred AI, and augmentation instead of automation. Building trust with users will be paramount and security- and user-experience-enhancing data pre- and post-processing will play a crucial role. (Will try and keep it slightly lighter on the hyphens from hereon out… 😉)

Spoke’s Slack Summarization App generates powerful summaries for Slack threads using the latest technology in AI and NLP. Spoke does not pull any user-identifiable information proactively (only anonymous Slack user IDs), but naturally people often disclose personally identifiable information (PII) in Slack conversations and so it’s of utmost importance to us that we use and store this data responsibly. Therefore, one crucial step we take in preserving data privacy at Spoke is to redact all PII from our user data before passing it (as training data) to in-house models or to pre-trained LLMs such as Luminous or GPT-3.5 to generate summaries.

Redacting these text snippets needs to happen on two levels, on the one hand with explicit PII such as a user being tagged – here we can predictably pseudonymize these explicit names (e.g. “@Carl” → “James”), and on the other hand with more implicit mentions of e.g. names, emails or links within a message – a more complex task and the topic of this documentation.

This process involves identifying various PII data in text-based messages and anonymizing / pseudonymizing it in such a way that the conversation’s context is preserved and valuable summaries can be generated.

Challenges & Tools

Identifying PII is not a straightforward task. It is usually a mix of different components, such as Named Entity Recognition, regex patterns, and a set of self-defined business rules to identify additional kinds of sensitive information.

This post focuses on the first part, Named Entity Recognition (NER), for which we are leveraging the presidio-analyzer and presidio-anonymizer. NER includes identifying, e.g. person names, company names, emails, URLs, locations, phone numbers (cf. Presidio’s list of supported entities) – under the hood Presidio uses a spaCy model to perform NER.

Once the PII is identified in a given message, the next step is to redact it in a way such that the context of the entire conversation isn’t lost – otherwise it is basically impossible for a language model to generate valuable summaries. We iterated through a few approaches, before finding the best solution to this challenge:

First, we just removed the PII from the conversation – in this case, we were losing a lot of the context of the conversation and the summary outputs generated were very poor.

As a second approach, we went for encrypting the PII with a key – in this case, a lot of extra alphanumeric characters and, in turn, tokens were added to the message/text provided to the language model. This also disturbed the context of the conversation and often increased the total number of tokens in the prompt to the language model (hence also not resource-effective).

Finally, we landed on our current approach, replacing the PII with its entity type – here we replace the PII with the entity type as recognised by the tool, as visualized in detail below. This helped not only in preserving context of the conversation, but also with reducing the overall number of tokens, hence optimizing the efficiency of our summarization. We are separately working on additional approaches which identify confidential data classes, workspace names, specific keywords, etc. and a set of defined business rules to redact such information.

Outcomes

As explained above, we decided to implement a solution that redacts PII by replacing the recognised data with its entity type. Below you can see a screenshot of the Spoke.ai Data Sandbox, a (for now) internal frontend to test our in-house models for data pre- and post-processing. We are working on making that sandbox available to our users soon, so they can see in full detail how we mask their data.

Screenshot from Spoke.ai's Data Sandbox exemplifying how PII data is masked.

Conclusion

By implementing these measures, we are able to process our user’s data not only efficiently but also responsibly, while keeping their privacy as our top priority. We are continuously working on improving this solution, catching additional edge cases, and making sure all PII and sensitive information is redacted. If you have any open questions or feedback, always feel free to reach out to us directly via LinkedIn or email.

View all Blog Posts