Tabular Synthetic Data Generation using CTGAN

Checkout mdedit.ai, AI powered Markdown Editor for tech writers

In this post we will talk about generating synthetic data from tabular data using Generative adversarial networks(GANs). We will be using the default implementation of CTGAN [1] model.

png

Introduction

In the last post on GANs we saw how to generate synthetic data on Synthea dataset. Here’s a link to the post for a refresher:

https://www.maskaravivek.com/post/gan-synthetic-data-generation/

Similar to the last post, we would be working with the Synthea dataset which is publicly available.

https://synthetichealth.github.io/synthea/

In this post, we will be working on the patients.csv file and will only be using continious and categorical fields. We will remove the other fields like name, email ID etc which contains a lot of unique values and will thus will be difficult to learn.

Data Preprocessing

Firstly, download the publicly available synthea dataset and unzip it.