Seminar Database Systems for Master of Science in Engineeering (MSE) at HSR Rapperswil - "the most beautiful campus of Switzerland" (Quicklinks: SeminarDatenbanksysteme, SeminarDatenbanksystemeHS1920, SeminarDatenbanksystemeHS2020)
This seminar (spring semester 2020) offers an introduction to the methods for generating synthetic data. The seminar focuses on pure synthetic data generation, databases and some open source tools by the example of PostgreSQL. Advisor: Prof. Stefan Keller (HSR, Advisor Computer Science and Data Science). Keywords: synthetic data, artificial data, data anonymization, privacy, open data, databases, data engineering, performance tuning, software testing, open source, PostgreSQL. Contact: sfkeller(ät)hsr(dot)ch.
Tu June 9, 2020, 17:00 - 18:00 (CEST/UTC+2) (Timezone Converter).
Program of the event:
- Welcome and Introduction - Stefan Keller (full professor at HSR)
- Short Overview of Synthetic Data Generation - Raphael Das Gupta (staff member of IFS HSR)
- Pure Synthetic Data Generation by the Example of a PostgreSQL- and Python-based Tool - Labian Gashi (master student MSE in Computer Science)
- Q&A and Discussion
This event is being organized by Prof. Stefan Keller (HSR) and it's coordinated with the agenda of SwissPUG . Many thanks to Digitale Gesellschaft for hosting BigBlueButton.
>> Slides to download (~1.3MB, 3 PDFs) <<
Synthetic data production is a process which can range from data anonymization of real data on one hand, up to generation of pure synthetic sample data on the other hand. Data anonymization - as part synthetic data generation - can for example be applied to open (government) data publishing or to the training phase in machine learning. Data anonymization can be tackled by syntactic models of anonymity and differential privacy and became a statistics research field on it's own (see e.g. Clifton & Tassa, 2013).
In addition to the afore mentioned applications, more and more synthetic data generation methods are also being applied to computer-supported software functionality testing (benchmarking and evaluation) and software load testing (software engineering), as well as to database performance tuning (database engineering).
While data anonymization became a specialized area, this expertise is not necessarily accessible to data scientists and data engineers. But what is obvious is that masking or obfuscating too few data points, while leaving everything else intact, does potentially not protect against de-anonymization. It's therefore suspect to not preserve privacy and to not fulfill regulations. To exemplify these doubts the following observations mentioned in a blog post of Mostly.ai are cited:
- "Many mobile phone owners can be re-identified by few tracking points (indicating home and work)".
- "80% of credit card owners can be re-identified by 3 transactions, even when only merchand and the date of transaction is revealed".
- "87% of all people can be re-identified merely by their date-of-birth, gender and ZIP code of residence" (also cited by Clifton & Tassa, 2013).
This deciphering became possible through recent machine learning and big data analytics achievements which challenge privacy. Now, pure synthetic data can be a solution to the mentioned challenges and to the growing need to achieve better software development and optimized database engineering.
Three types of synthetic data can be distinguished according to Surendra & Mohan (2017):
- Hybrid synthetic data: The data is generated using both original and synthetic data.
- Partially synthetic data: Only values of the selected sensitive attribute are replaced with synthetic data.
- Fully synthetic data: The data is completely artificially generated and doesn't contain original data.
Similarly, according to a blog post from the company 'Synthesized' (source 2018) following computer-supported data generation types are defined:
- Anonymized data, produced by a 1-to-1 transformation from original data. Examples include noise obfuscation, masking, or encryption.
- Artificial data, produced by an explicit probabilistic model via data sampling.
- Synthetic data, produced by a model (configuration, rules) which in turn can be learned by statistics from original data.
This last classification fits well to what this seminar aims to investigate.
And to conclude this short seminar description let's note again, that it focusses on fully synthetic data in database engineering.
Figure 1: Three approaches to synthetic data (image by Synthesized 2018).
Main literature for this seminar:
- Surendra, H., & Mohan, H. S. (2017). A review of synthetic data generation methods for privacy preserving data publishing. Intl. Journal Science in Technology Research, 6(3), 95-101. Online resource (visited May 2020) https://www.ijstr.org/final-print/mar2017/A-Review-Of-Synthetic-Data-Generation-Methods-For-Privacy-Preserving-Data-Publishing.pdf
- Hittmeir, M., Ekelhart, A., & Mayer, R. (2019). On the Utility of Synthetic Data: An Empirical Evaluation on Machine Learning Tasks. In Proceedings of the 14th International Conference on Availability, Reliability and Security (pp. 1-6). Online resource (visited June 2020) https://doi.org/10.1145/3339252.3339281 (PDF)
Generator software oriented literature:
- Ayala-Rivera, V., McDonagh, P., Cerqueus, T., & Murphy, L. (2013). Synthetic data generation using benerator tool. arXiv preprint arXiv:1311.3312. Online resource (visited May 2020) https://arxiv.org/pdf/1311.3312.pdf
- Ping, H., Stoyanovich, J., & Howe, B. (2017, June). DataSynthesizer: Privacy-preserving synthetic datasets. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management (pp. 1-5). Online resource (visited May 2020) https://faculty.washington.edu/billhowe/publications/pdfs/ping17datasynthesizer.pdf
- Hoag, J. E., & Thompson, C. W. (2009). A parallel general-purpose synthetic data generator. In Data Engineering (pp. 103-117). Springer, Boston, MA. (PSDG). Online resource (visited May 2020) https://dl.acm.org/doi/pdf/10.1145/1276301.1276305
- Dahmen, J., & Cook, D. (2019). SynSys: A synthetic data generation system for healthcare applications. Sensors, 19(5), 1181. Online resource (visited May 2020) https://www.mdpi.com/1424-8220/19/5/1181/pdf
- Gray, J., Sundaresan, P., Englert, S., Baclawski, K., & Weinberger, P. J. (1994, May). Quickly generating billion-record synthetic databases. In Proceedings of the 1994 ACM SIGMOD international conference on Management of data (pp. 243-252). Online resource (visited May 2020) https://jimgray.azurewebsites.net/papers/SyntheticDataGen.pdf
- Clifton, C., & Tassa, T. (2013). On syntactic anonymity and differential privacy. In 2013 IEEE 29th International Conference on Data Engineering Workshops (ICDEW) (pp. 88-93). IEEE. Online resource (visited May 2020) https://www.profsandhu.com/cs5323_s17/clifton-2013.pdf (visited 2020-03-30)
- Han, R., Lu, X., & Xu, J. (2014, March). On big data benchmarking. In Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware (pp. 3-18). Springer, Cham. Online resource (visited May 2020) https://arxiv.org/ftp/arxiv/papers/1402/1402.5194.pdf
- "Practical Synthetic Data Generation: Balancing Privacy and the Broad Availability of Data" by Khaled, Mosquera, Hoptroff, 1st Ed. Paperback. Web: https://www.amazon.com/Practical-Synthetic-Data-Generation-Availability/dp/1492072745
List of selected data synthetization tools and generators (open source, without claim to be complete):
- DataSynthesizer (Python, Open Source, requires some third-party Python modules including Numpy, Scipy, Pandas, and dateutil): DataSynthesizer on Github
- "pydbgen - Random database/dataframe generator" by Dr. Tirthajyoti Sarkar, USA (Python, Open Source, requires some third-party Python modules...): https://github.com/tirthajyoti/pydbgen
- "Databene Benerator": Software framework in Java for creating realistic and valid high-volume test data, used for load and performance testing and showcase setup. Data is generated from an easily configurable metadata model and exported to databases, XML, CSV or flat files. https://sourceforge.net/projects/benerator/ and http://databene.org/download/databene-benerator-manual-0.7.6.pdf
In order to evaluate the generated data databases (i.e. datasets in the sense of set of rows) are required with certain properties as follows regarding selected tables:
- Tables need to be larger than 1000 rows but smaller than 10 Mio..
- All basic attribute types should be covered (within one of the tables of the selected dataset): Boolean, Character types char() and varchar(), Numeric types int and float8, Temporal types date, time and timestamp.
- A table containing correlating values where one has exceptions, e.g. "status!=married inexistent when age < 18".
- A table containing correlating values where one is partially functional dependent from another, e.g. "zip code -> zip name".
Not suited datasets/databases:
pgsynthdata - A Synthetic Data Generation Tool for PostgreSQL written in Python used to 'show-and-tell' and used as a proof-of-concept.
- Given a (readonly) user has access to a real database/dataset in a PG 12 instance/cluster.
- User prepares a new PG database (to be filled with synthethic data).
Workflow of the tool:
- Step 1: Connect to database from tool.
- Step 2: Extract and show statistic data from PG catalog (pg_class and pg_stats).
- Step 3: Create new database and generate synthetic data as shown in Step 2.
Design & implementation decisions:
- Implement using Python 3 and PostgreSQL 12 (PG) as CLI.
- Generation of new db and new tables only (i.e. not doing ANALYZE on DBNAMEIN, not overwiring any existing DBNAMEOUT)
- Sprint 1: If required create new DB (optional), create same table def., and plain use of random() for type INTEGER and BIGINTEGER.
- Sprint 2: Add random gen for data types FLOAT, DATE, TIME, TIMESTAMP, MD5
- Sprint 3: Add VARCHAR / TEXT ("own names" => lookup name list as separate data input file in same dir?)
- Sprint 4: Add eventually foreign key (FK) relationships
- Sprint 5: Add stuff related to pg_statistic_ext / pg_stats_ext (relates to user input CREATE STATISTICS)
- Sprint 6: Add more stuff, like "-r" option (which creates db/schema) and data type GEOMETRY POINT (see Stackexchange)
What to consider in random generator code:
- For config. parameters related to the original DB use 'COMMENT' in SQL schema (see https://www.postgresql.org/docs/current/sql-comment.html and below).
- Use a table commment if it can be copied 1:1 instead of being filled with synthesized values (e.g. a lookup table with few values), use: # COMMENT ON TABLE "foo" IS 'PROPERTY=insensitive_copyable';
Rejected issues and alternatives:
- Save schema information. Syntax of config. file: 1. TOML format, or 2. as Python file (config.py) - or none, reading directly out of original database (pg_class and pg_stats) eventually with COMMENTS?
Ideas - probably out of scope:
- Check "Last Analyze" (meaning last time an ANALYZE has been performed) then issue a warning that original database needs up-to-date statistics (see cataöpg queries and "CODE SNIPPETS" below)?
- Use a column comment, to indicate it's specific language (e.g. 'en' for englisch, used as language stats) and if it's an own name (e.g. first name) use: # COMMENT ON COLUMN foo.name IS 'PROPERTY=en;own_name';).
- Add a comment to columns of type varchar and text if it's own names.
- Do "CREATE STATISTICS" in order to indicate 'dependencies'; managed in table 'pg_statements_ext' (probably further work)
- First or second week of semester: Kickoff => Mo. 24. Februar 2020, 16:05 - ca. 17:00 Uhr; Raum HSR-Forschungsgebäude 8.225.
- Send paper outline to advisor => Week 4 (comments by Stefan)
- Send document draft to advisor and buddy (if any) before mid term presentation (Zwischenpräsentation) => Week 8+1
- Middle of semester: Mid Term Presentation all. => (Week 9+1).
- Communication with buddy student if available/possible and on demand, peer-review of seminar thesis draft.
- Delivery of seminar thesis draft to advisor (two weeks before final presentation). => May 26, 2020
- After end of semester: Final Presentation: Tuesday, June 9, 2020, 17:00-18:00 (Di. 9. Juni 2020, 17 - 18 Uhr) as webinar
- Students and advisors give feedback to thesis.
- Final Delivery (see Deliverables below) to be sent to advisor => according to agreement, but prior to issuing notes. tbd. (before new semester start)
Approach of Labian Gashi (proposal):
- Collect papers. Start documentation (intro, goal and theory part).
- Test and evaluate existing tools DataSynthesizer and "Random database/dataframe generator" (all from category "Anonymised Data" eventually "Artificial Data")
- Implement tool as category "Synthetic Data". Python 3.6 (for example with numpy, pandas and/or dateutil)
- Write documentation
- Mid term drafts (PDF).
- Seminar Thesis as PDF and in original format inc. figures.
- Other artifacts: Code, extended queries and data, configurations etc.
- Presentation slides as PDF and in original format inc. figures.
- Other activities:
- Communications with buddy or if not available with advisor and support staff.
- Participation at discussions (mid term and final presentation).
Tools to be investigated (tbc.):