Privacy Enhancing Technologies in Practice
The idea of decisions being driven by data is not new: it has been common practice in science, journalism, and marketing, to name a few, for decades already. What is new is the dimension that data-driven decision-making has reached in the era of big data. With data on behaviour, preferences, location and times, being collected ubiquitously and being linked together, there is enormous potential for a new data-driven society. Regulations on data protection, such as the GDPR, come in place to ensure this potential is explored responsibly.
While some industries resisted the implementation of strict data protection regulations, seeing them as hampers to progress, others saw them as a driver for a market change. These market changes stem from scientific research: many Privacy Enhancing Technologies (PETs) have been a topic of research for a long time, but it was only recently that they became a product with practical implementations.
In the Netherlands, several companies offer PETs services, and TNO spoke to four of them to find out about the latest developments.
Virtual data lake
Roseman Labs’ main product is a virtual data lake that enables organisations to collaborate without the need to expose sensitive data to each other. Niek Bouman, CTO and co-founder of Roseman Labs, explains that their product can be seen as a combination of a relational database, which allows for the application of common operations such as join, filtering, and data analytics, with a range of Machine Learning models. The virtual data lake has the special property that all its content exists in “secret-shared” form: this means that data is never concentrated in one place, it is instead encrypted and distributed over servers that run the system. The secret-shared data is processed in a Secure Multi-Party Computation (MPC) protocol, where involved parties can authorize beforehand certain analyses that may be run on the combined data. This means that data owners remain in control over how their data is processed, and this helps to achieve GDPR-driven goals like data minimisation and proportionality.
With multiple stories of successful applications of their product, Roderick Rodenburg, CEO and co-founder, highlights the case with the National Cyber-Security Centre (part of the Dutch Ministry of Justice and Security) where MPC is used in SecureNed, a platform to enable sharing sensitive cyber-threat intelligence between organisations anonymously. One of the main benefits of using MPC here is that it excludes the need to trust other parties (NCSC and participating organisations) to use data solely for the agreed purpose: the control on data usage is built in the MPC solution, and only allowed computations can be executed. As a bonus, this reduces the complexity of legal agreements between collaborating organisations, with purpose limitation being technically enforced.
Although Roseman Labs is actively addressing technical open challenges to better support the needs of their customers, such as adding new Machine Learning models and making better use of CPU parallelism, they also identify organisational challenges to be of high relevance. MPC brings a paradigm shift, where organisations can run computations on encrypted data that is not directly controlled by them, in a legal framework that is new to them. Communication and implementation in this new paradigm are challenging, and Roseman Labs is working on helping customers overcome it.
Private computation and governance
"Share insights, not data” is the motto of Linksight, a 2021 TNO spin-off company. Martine van de Gaar, Linksight’s CEO and co-founder explains that their focus is on facilitating privacy-by-design data collaborations. Organisations can install a Linksight “data station” and connect to other organisations running similar data stations. MPC protocols run between the data stations like trains between train stations, using (fully and partially) homomorphic encryption. This allows for computations on data while the data itself stays fully encrypted and thus unreadable.
Before any joint computation however, the involved organisations select and agree on a set of rules that governs all actions within a data collaboration. The data stations check each computation request for compliance with the agreed upon rules and log everything to a shared audit log. This allows each organisation to control and avoid unwanted disclosure. For example, when computing an average over too small a population, the data station detects this without learning the actual content and aborts before a result is revealed.
Pieter Verhagen, CCO and co-founder believes this approach fits well with the healthcare sector, which is fragmented, with data collaborations that evolve constantly. As an example of their work, Pieter mentions the elderly care sector, where healthcare organisations cooperate with municipalities, “zorgkantoren” and healthcare insurers.
The sector faces several challenges due to an aging population and staff shortages. Using Linksight software, care organisations in the Delft, Zeeland and Achterhoek region are currently setting up regional data collaborations, where they create real-time insights based on combined data. This helps them to understand their regional care ecosystem on a system level, and design and monitor the cooperative interventions that are needed to solve these challenges.
Next to helping its clients implement this technology, a main challenge according to Linksight is a lack of awareness of the potential of PETs. The company believes that by 2023 no organisation should be involved in data sharing that is not private-by-design. Linksight remarks there are still situations where excel files are being exchanged by email, or - perhaps worse - where impactful data collaborations fail due to privacy concerns that could have been tackled by PETs. Linksight is on a mission to change that. The first step is showcasing best practices.
Synthetic data generation
Synthetic data generation is a solution for organisations that work with sensitive data and cannot share them. Dutch company Syntho provides an engine that is optimized to generate a plurality of data types (such as financial, or healthcare-related), and can generate data in any language and alphabet (with customers in UE, US, and Japan), as well as datasets with multiple tables, time-series, and geographical data.
The engine starts from original data, and through Artificial Intelligence technology it generates synthetic datasets that maintain the same relational and statistical properties as the original one. Syntho does not process sensitive data by itself; instead, they offer a self-service platform that can be deployed in the own safe and secure environment so that clients can generate synthetic data by themselves.
Synthetic data generation can be seen as an enabler of privacy-preserving analysis: in comparison with other PETs, synthetic data generation takes place in an earlier stage of data collaboration, allowing organisations to conduct a range of exploratory analyses, test hypotheses, and work as if with real data. Synthetic data enables this type of analysis, which is performed before technologies like MPC or Federated Machine Learning (FML) can be used. Syntho works with a diverse range of clients, in domains such as healthcare, pharmaceutics, and banking. What these sectors have in common is the need and ambition to innovate, for which they need sensitive data. Wim Kees Janssen, CEO and founder of Syntho, highlights the role synthetic data plays in data minimisation, in line with the recommendations from AP, the Dutch data protection authority.
Syntho invests in knowledge sharing, for instance by hosting webinars and participating in conferences. Their ambition is to expand the reach of synthetic data beyond Europe, helping organisations comply with privacy regulations while innovating and staying competitive against countries with less stringent data protection. They emphasise that while organisations still see the field as new and experimental, synthetic data is operational already, and challenges lie in making it scalable and more widely adopted.
Federated Machine Learning
TNO also talked to BranchKey, a provider of FML technology and services. Founder and head of product, Diarmuid Kelly, agreed to provide us with a description of their solution.
FML is a method for distributed training of machine learning models: it allows bringing algorithms to data, without any need to transfer data off-site. Models that are trained under FML are trained on location, and then sent for aggregation with models from other distributed locations or organisations. When aggregating, a democratic model is constructed from the distributed and weighted contributions. This ensures no model is ever left exposed.
FML is applicable across datasets belonging to separate organisations that have a common goal but do not wish or cannot share their raw data. As well as inside organisations where usage of data might be restricted due to logistical, privacy, or access issues.
The technology behind BranchKey’s solution is a combination of techniques such as FML with differential privacy (and in an experimental phase, MPC), whereby noise is added to the individual model before it is transferred in order to enhance privacy. In this setting, data is left in place, and models are adjusted to variations in local data with added noise. Once the models are aggregated with several others, the individual data points used for training are far from the final state of the model. BranchKey has experience in projects for financial transaction monitoring across organisations, federating energy management systems for building assets (in collaboration with TNO), and most recently has started working with industrial marine equipment manufacturers in predictive maintenance and anomaly detection in the Netherlands and Germany. Collaboration while preserving data sovereignty is crucial to the functioning of a FML system on a European scale. Trust management and data sharing are cumbersome activities that FML helps to streamline.
Learn more about PETs
TNO has been conducting research on PETs and has been a partner from the beginning of this market change. We look forward to continuing to contribute to a more secure data-driven society. TNO operates as a consortium builder and forms new national and international ecosystems with organisations that benefit from data collaboration. We focus on technical challenges such as improving computational scalability, and we advise on the application and combination of PETs to optimise the strengths of each technology.
PETs have evolved in the last few years and are now ready to be used in daily practice. Encountering data sharing challenges? Reach out to TNO for more information, or contact Roseman Labs, Linksight, Syntho, or BranchKey for their market-ready products and services.