How to maximize data utility while protecting data privacyMay 10, 2022 | By Kelly Grayson and Ananya Wanchoo
Organizations are finding ways to respect data privacy while gaining the insights they need to innovate, the value of which is known as data utility. Historically, there were limited options available in the use of data. Privacy-enhancing technologies and other emerging technologies like distributed artificial intelligence have presented companies with new options to improve data utility while protecting individual privacy.
The term PET refers to any technical method that protects the privacy or confidentiality of personal and sensitive information. Companies could use certain PETs and other technologies to create best-in-class solutions while protecting personal information. These solutions can also reduce the risk of potential data compromise in the face of increasingly complex cyberattacks. Below we explore five approaches that companies can use: anonymization, differential privacy, synthetic data, homomorphic encryption and artificial intelligence.
In general, anonymization is a powerful tool since it allows to innovate with data while providing a high level of privacy protection as the data is no longer personally identifiable. However, anonymization methodologies must meet increasingly strict requirements under privacy laws as well as high expectations from privacy regulators. Techniques, such as k-anonymity, can lower the risk of reidentification while maintaining data utility.
As an alternative or supplement to existing anonymization techniques, differential privacy has been gaining momentum as a tool for protecting privacy while enabling analysis. Differential privacy allows programmers to work with decrypted, aggregated results while protecting personal information. Differential privacy works when so-called “noise” is added to the data in a way that obfuscates information about any one record while preserving the accuracy of aggregate statistics. The more noise, the greater the privacy protection. Additionally, data users have a “privacy budget” that limits the risk of re-identification of an individual. After the budget is exhausted, further querying is impossible. This trade-off in noise and the size of the privacy budget can be configured using a parameter called epsilon. This ensures that individual records remain private from direct querying and cyberattacks aimed to re-identify data.
Rather than sharing an original dataset, companies are also increasingly turning to synthetic data generation, which utilizes varying mathematical approaches, including artificial intelligence, to create a statistically similar dataset. Synthetic data has close to the same statistical properties as the original without sharing specific values from the original dataset. Thus, the beauty of synthetic data is that it can mimic the values and attributes of the original dataset and its underlying properties and utility without including any of the underlying personal data. Synthetic data uses have advanced beyond model creation and can be used in production applications because of their increased utility.
Another exciting new development in the PET world is homomorphic encryption. Before, if you wanted to gain any utility from encrypted data, it would first need to be decrypted to be processed. In contrast, homomorphic encryption enables computation directly on encrypted data. In other words, analytics can be performed on encrypted data (using cryptographic techniques) and to produce a usable result without exposing the encrypted data during processing. If there is a breach of the third party, no unencrypted data is exposed. Today, the biggest limitation is processing power, as homomorphic encryption is computationally intensive. However, companies are already leveraging the technology to create commercially viable applications.
Historically, training AI has relied upon a two-tier process involving: (1) the data collection from multiple sources and storage at a central location and then (2) the learning itself. If the proper protections aren’t in place, this two-tier process risks exposing any personal information used to train the AI at both stages. New methods do not use the central storage and training model, addressing this exposure concern. One such method is on-device AI, which moves intelligence to smart devices (phones, automobiles, watches, speakers, etc.) at the edge of a given network. By providing AI functionality directly on-device, the risk to personal data is limited since it does not extract information from the user to a location outside of their control. Another method is to train AI on the devices themselves — known as federated learning. This approach allows edge devices to learn collaboratively within a shared framework while keeping all the training data on the device, providing additional confidence to data subjects that their data will not be transmitted.
Moving ahead: What the future holds
As data-driven innovations continue to grow, organizations will be increasingly expected to deliver enhanced privacy alongside their new technologies. Indeed, whether it is regulators calling for the management of safe and secure processing of personal data, organizations prioritizing data protection for virtuous reasons, or individuals themselves demanding that their personal information is being handled in ways that are ethical, compliant, and to their benefit, PETs are here to stay. New privacy threats will continue to surface, requiring new solutions. Meanwhile, organizations will continue to innovate, leveraging a combination of various PET solutions and other emerging technologies to respect the confidentiality and protection of personal data while also driving product innovation.
With continued proliferation of data, how do we safeguard the privacy of individuals and keep them at the center of our product and solution design? Learn more in the latest edition of Mastercard Foundry's thought leadership series Signals.Read