Synthetic Data: A New Step Forward in Data Availability at Lifelines in Collaboration with Syntho


- by Saskia

Recently at Lifelines, we have been working on a new innovation to make our data more accessible for research while strengthening the privacy of our participants. Using synthetic data from Syntho, we can now generate a dataset that has the same statistical properties as the collected data, without including the exact data of our participants. The technique uses the real data as a base and takes the patterns from this to generate completely new, artificial data.

Synthetic data generation is a Privacy Enhancing Technique (PET) which aims to protect and enhance the privacy of individuals. Such techniques help minimize the amount of personal information exposed and reduce the risk of privacy violations. For each data request from an investigator, we can generate synthetic data using tooling from Syntho, so each investigator receives their own unique synthetic dataset. In addition, we use differential privacy and k-anonymity, which allow us to fine-tune the required privacy level.

We evaluate the synthetic data generated based on three properties: fidelity, utility, and privacy. These outcomes provide us with information about the statistical similarities between the real data and the synthetic data, the conserved relationships between variables (can't men suddenly get pregnant?), and whether the data is still traceable to real participants. We do this based on hard numbers, but also visualizations, as shown in the figure.

Together with other experts and pioneers, we have gone through many developments. With the help of our partner Syntho, we successfully conducted the first explorations into the possibilities for data synthesis at Lifelines. With their far-reaching knowledge of this technique, we collaborated on the first synthetic datasets. 

Having successfully completed the initial phase and exploration, Lifelines will continue the further deployment and adoption of synthetic data. Therefore, from now on it will be possible for researchers and other stakeholders to work with synthetic Lifelines data. So, have you become interested, or are you a researcher and would like to learn more about what synthetic data can do for your research? If so, let us know and we will be happy to help!