Tabular Synthetic Data Generation using CTGAN
Checkout mdedit.ai, AI powered Markdown Editor for tech writers
In this post we will talk about generating synthetic data from tabular data using Generative adversarial networks(GANs). We will be using the default implementation of CTGAN  model.
In the last post on GANs we saw how to generate synthetic data on Synthea dataset. Here’s a link to the post for a refresher:
Similar to the last post, we would be working with the Synthea dataset which is publicly available.
In this post, we will be working on the
patients.csv file and will only be using continious and categorical fields. We will remove the other fields like name, email ID etc which contains a lot of unique values and will thus will be difficult to learn.
Firstly, download the publicly available synthea dataset and unzip it.
In this post, we will be using the default implementation of CTGAN which is available here.
To use CTGAN do a pip install. Also, we will be installing the
link) which will help us in comparing the results with the original data.
Remove unnecessary columns and encode all data
Next, we read the data into a dataframe and drop the unnecessary columns.
Index(['MARITAL', 'RACE', 'ETHNICITY', 'GENDER', 'BIRTHPLACE', 'CITY', 'STATE', 'COUNTY', 'ZIP', 'HEALTHCARE_EXPENSES', 'HEALTHCARE_COVERAGE'], dtype='object')
Next, we define a list with column names for categorical variables. This list will be passed to the model so that the model can decide how to process these fields.
Training the model
Next, we simply define an instance of
CTGANSynthesizer and call the
fit method with the dataframe and the list of categorical variables.
We train the model for 300 epochs only as the discriminator and generator loss becomes quite low after these many epochs.
Next, we simply call model’s
sample function to generate samples based on the learned model. In this example we generate 1000 samples.
MARITAL RACE ... HEALTHCARE_EXPENSES HEALTHCARE_COVERAGE 0 S asian ... 7.331230e+05 8940.917593 1 NaN white ... 1.540945e+06 3099.605568 2 NaN asian ... 1.517647e+06 11947.241606 3 NaN white ... 1.516137e+06 14091.349082 4 S native ... 1.534122e+06 5103.408672 [5 rows x 11 columns]
Now let’s try to do a feature by feature comparision between the generated data and the actual data. We will use python’s
table_evaluator library to compare the features.
We call the
visual_evaluation method to compare the actual data(
data) and the generated data(
(1171, 11) (1000, 11)
As its apparent from the visualizations, the similarity between the original data and the synthetic data is quite high. The results give a lot of confidence as we took a random dataset and applied the default implementation without any tweaks or any data preprocessing.
The model can be used in various scenarios where data augmentation is required. Its worthwhile to highlight a few caveats:
- In this dataset we just had categorical and continuous variables and the results were quite good.
- It would be useful to try it on datasets with date time values
- Also this model won’t be able to handle relational datasets by default. For eg. there’s no way of specifiying primary key foreign key constraints.
- Moreover, it cannot handle contraints by default. For eg. a particular state should belong to a single country but there’s no way of specifying this constraint. The generated dataset can contain new combinations of (state, country) which is not present in the original dataset.
There’s a framework to mitigate some of the above issues. Checkout SDV if you are interested. I will try to write a post about it in future.
Here’s the link to the Google colab notebook with the complete source code.
 Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular data using Conditional GAN. NeurIPS, 2019