Image description

Synthetic Data Generation

Using Machine Learning Algorithm in Cyber Physical Systems

Funding source: Department of Science and Technology

Nodal Centre: C3iHub, IIT Kanpur

Funding Amount: INR 21,36,980.00

Status: Ongoing (2021-23)

Description:

Privacy laws such as the (GDPR) and the obligations imposed on organizations by these laws acts as a hurdle in the use of data in Cyber Physical Systems, especially when it contains personal data and sensitive personal data. To ensure that synthetic data can be used where synthetic data generation tools play an important role. Another problem that organizations face is that real data may not be available at all. This is the precise problem which is addressed in our proposed solution named as SynD. SynD is an AI based platform that will be used to generate synthetic data. Synthetic data is data that is created in such a manner that it retains the same statistical value as the real data but it removes any personal data or sensitive personal data. This helps organizations use the data that they would otherwise be barred from using because it contains personal data and sensitive personal data. This data is differentially private.

Image 1
Image 2

Thus, there will be no additional privacy obligations that will have to be complied with. This will also help reduce overhead costs of an organization that would otherwise be incurred in attempting to anonymize the data or in compliance with privacy laws. SynD will also help organizations innovate as they can freely share data that is generated from the use of SynD. Sharing of data that contains personal data is subject to numerous restrictions. However synthetic data is not subject to these restrictions

Research Gaps and What are we doing here:

Many synthetic data generation tools have been designed like CTGAN/ CTabGAN to generate tabular data. These models have limitations in handling small, medium and large data sets having mixed data type, non-Gaussian distribution, multi-modal distribution, multimodal skewed data, highly imbalanced categorical columns, long tailed/ skewed data, data with outlier, data with missing values and semantic integrity. Similarly, Time-GAN/ Transformer-based GAN etc. are the state-of-the-art models to generate time series data. But these models still have limitations in handling complex relation among temporal long sequences, complex dependencies in the heterogeneous data, multi-modal time-series data. The project handles tabular and time series data using proposed novel modelsnamely, SynD-Lite and SynD-Pro Data respectively. SynD-Lite handles tabular data (single table) of varied nature and size including personal data and sensitive personal data and SynD-Pro handles the generation of time series data. These are the precise problems which are addressed in this project. SynD-Lite and SynD-Pro are the AI based models that will be used to generate synthetic data. Further SynD is deployable on any cloud or on prem system with a single click and provides a UI to the users over the local network.

Image 1

Showcasing our findings and results:

State of the art models:

Image

Tabluar Data

  • CTGAN
  • CTAB
  • CTAB++
  • Image

    Time-Series Data

  • Time-GAN
  • MTS-TGAN
  • Publications:

    Parul Yadav, Manish Gaur, Nishat Fatima, Saqib Sarwar, Quantitative and Qualitative Comparative Evaluation of Small, Medium and Large Scale Tabular Synthetic Data Generated using TVAE and CTGAN, TBA

    Patent:

    In Process

    Product:

    Applications:

    Members:

    PI:

    Dr Parul Yadav

    ML Researcher & Assistant Professor
    Department of Computer Science and Engineering
    Institute of Engineering and Technology, Lucknow
    (+91) 9838252188
    [email protected]

    CO PI:

    Prof Manish Gaur

    PhD University of Sussex
    UK Fellow
    University of Glasgow
    UK Pro
    Vice Chancellor, AKTU Lucknow, India
    [email protected]

    Project Associates:

    Mr. Rahul Madhukar

    Project Associate-II

    Interns:

    Raj Yadav (CSE 2025)
    Vivek Kuamr (CSE 2025)
    Gaurav Verma(CSE 2025)

    Contact Us

    DR PARUL YADAV
    ML Researcher & Assistant Professor
    Department of Computer Science and Engineering
    Institute of Engineering and Technology, Lucknow
    (https://ietlucknow.ac.in)
    (+91) 9838252188
    [email protected], [email protected]
    https://www.ietlucknow.ac.in/people/pyadav
    https://www.linkedin.com/in/dr-parul-yadav-3b89b276/