Sections: On Pure Synthetic Data Generation for Databases | Seminar Presentation Online Event | Background on the Topic of this Seminar | Bibliography | Tools | Databases | pgsynthdata Tool | Open Questions | Organisation | APPENDIX |

Seminar Database Systems for Master of Science in Engineeering (MSE) at HSR Rapperswil - "the most beautiful campus of Switzerland" (Quicklinks: SeminarDatenbanksysteme, SeminarDatenbanksystemeHS1920, SeminarDatenbanksystemeHS2020)

On Pure Synthetic Data Generation for Databases ^

This seminar (spring semester 2020) offers an introduction to the methods for generating synthetic data. The seminar focuses on pure synthetic data generation, databases and some open source tools by the example of PostgreSQL. Advisor: Prof. Stefan Keller (HSR, Advisor Computer Science and Data Science). Keywords: synthetic data, artificial data, data anonymization, privacy, open data, databases, data engineering, performance tuning, software testing, open source, PostgreSQL. Contact: sfkeller(ät)hsr(dot)ch.

Seminar Presentation Online Event ^

Tu June 9, 2020, 17:00 - 18:00 (CEST/UTC+2) (Timezone Converter).

Program of the event:

  1. Welcome and Introduction - Stefan Keller (full professor at HSR)
  2. Short Overview of Synthetic Data Generation - Raphael Das Gupta (staff member of IFS HSR)
  3. Pure Synthetic Data Generation by the Example of a PostgreSQL- and Python-based Tool - Labian Gashi (master student MSE in Computer Science)
  4. Q&A and Discussion

This event is being organized by Prof. Stefan Keller (HSR) and it's coordinated with the agenda of SwissPUG . Many thanks to Digitale Gesellschaft for hosting BigBlueButton.

>> Slides to download (~1.3MB, 3 PDFs) <<

Background on the Topic of this Seminar ^

Synthetic data production is a process which can range from data anonymization of real data on one hand, up to generation of pure synthetic sample data on the other hand. Data anonymization - as part synthetic data generation - can for example be applied to open (government) data publishing or to the training phase in machine learning. Data anonymization can be tackled by syntactic models of anonymity and differential privacy and became a statistics research field on it's own (see e.g. Clifton & Tassa, 2013).

In addition to the afore mentioned applications, more and more synthetic data generation methods are also being applied to computer-supported software functionality testing (benchmarking and evaluation) and software load testing (software engineering), as well as to database performance tuning (database engineering).

While data anonymization became a specialized area, this expertise is not necessarily accessible to data scientists and data engineers. But what is obvious is that masking or obfuscating too few data points, while leaving everything else intact, does potentially not protect against de-anonymization. It's therefore suspect to not preserve privacy and to not fulfill regulations. To exemplify these doubts the following observations mentioned in a blog post of Mostly.ai are cited:

This deciphering became possible through recent machine learning and big data analytics achievements which challenge privacy. Now, pure synthetic data can be a solution to the mentioned challenges and to the growing need to achieve better software development and optimized database engineering.

Three types of synthetic data can be distinguished according to Surendra & Mohan (2017):

  1. Hybrid synthetic data: The data is generated using both original and synthetic data.
  2. Partially synthetic data: Only values of the selected sensitive attribute are replaced with synthetic data.
  3. Fully synthetic data: The data is completely artificially generated and doesn't contain original data.

Similarly, according to a blog post from the company 'Synthesized' (source 2018) following computer-supported data generation types are defined:

  1. Anonymized data, produced by a 1-to-1 transformation from original data. Examples include noise obfuscation, masking, or encryption.
  2. Artificial data, produced by an explicit probabilistic model via data sampling.
  3. Synthetic data, produced by a model (configuration, rules) which in turn can be learned by statistics from original data.

This last classification fits well to what this seminar aims to investigate. And to conclude this short seminar description let's note again, that it focusses on fully synthetic data in database engineering.

Figure 1: Three approaches to synthetic data (image by Synthesized 2018).

Bibliography ^

Main literature for this seminar:

Generator software oriented literature:

Additional literature:

Blog-Posts:

Tools ^

List of selected data synthetization tools and generators (open source, without claim to be complete):

Notes:

Databases ^

In order to evaluate the generated data databases (i.e. datasets in the sense of set of rows) are required with certain properties as follows regarding selected tables:

Datasets/databases:

Not suited datasets/databases:

pgsynthdata Tool ^

pgsynthdata - A Synthetic Data Generation Tool for PostgreSQL written in Python used to 'show-and-tell' and used as a proof-of-concept.

Requirements:

Workflow of the tool:

Design & implementation decisions:

Sprints:

Open Questions ^

What to consider in random generator code:

Configuration:

Rejected issues and alternatives:

Ideas - probably out of scope:

Organisation ^

Schedule (milestones):

  1. First or second week of semester: Kickoff => Mo. 24. Februar 2020, 16:05 - ca. 17:00 Uhr; Raum HSR-Forschungsgebäude 8.225.
  2. Send paper outline to advisor => Week 4 (comments by Stefan)
  3. Send document draft to advisor and buddy (if any) before mid term presentation (Zwischenpräsentation) => Week 8+1
  4. Middle of semester: Mid Term Presentation all. => (Week 9+1).
  5. Communication with buddy student if available/possible and on demand, peer-review of seminar thesis draft.
  6. Delivery of seminar thesis draft to advisor (two weeks before final presentation). => May 26, 2020
  7. After end of semester: Final Presentation: Tuesday, June 9, 2020, 17:00-18:00 (Di. 9. Juni 2020, 17 - 18 Uhr) as webinar
  8. Students and advisors give feedback to thesis.
  9. Final Delivery (see Deliverables below) to be sent to advisor => according to agreement, but prior to issuing notes. tbd. (before new semester start)

General set-up:

Approach of Labian Gashi (proposal):

  1. Collect papers. Start documentation (intro, goal and theory part).
  2. Test and evaluate existing tools DataSynthesizer and "Random database/dataframe generator" (all from category "Anonymised Data" eventually "Artificial Data")
  3. Implement tool as category "Synthetic Data". Python 3.6 (for example with numpy, pandas and/or dateutil)
  4. Write documentation

Deliverables:

  1. Mid term drafts (PDF).
  2. Seminar Thesis as PDF and in original format inc. figures.
  3. Other artifacts: Code, extended queries and data, configurations etc.
  4. Presentation slides as PDF and in original format inc. figures.
  5. Other activities:
    1. Communications with buddy or if not available with advisor and support staff.
    2. Participation at discussions (mid term and final presentation).

APPENDIX ^

Misc:

Tools to be investigated (tbc.):

More tools: