DataHub

Synthetic data generation

DataHub is a set of python libraries dedicated to the production of synthetic data to be used in tests, machine learning training, statistical analysis, and other use cases wiki. DataHub uses existing datasets to generate synthetic models. If no existing data is available it will use user-provided scripts and data rules to generate synthetic data using out-of-the-box helper datasets.

Synthetic datasets are simply artificiality manufactured sets, produced to a desired degree of accuracy. Real Data does play a part in synthetic generation, all depending on the realism you require. The product roadmaps details out the functionality planned in this respect.

DataHub's core is predominantly based around pandas data frames and object generation. A common question: Now that I have a data frame of synthetic-data, what do I do with it? The Pandas library comes with an array of options here - so for the time being sinking to databases is out of the scope of the core library, however see that examples in the test folder for some common patterns.

note As we build out a config based synthetic spec generator, we will bring this back into scope - please see our roadmap/issue list and get involved in the discussion.

Key documents

For information on how to get started with DataHub see our Getting Started Guide
For more technical information about DataHub and how to customize it, see the Developer Guide
For a high-level road map see Road Map
This project uses Gravizo for all diagrams and charts as highlighted in DataHub Issue 41.

Overview of Synthetic data

Synthetic data is information that's is artificially manufactured rather than generated by *real-world events.
Synthetic data is created algorithmically, and can be used as a stand-in for test datasets of production data
Real data does play a part in synthetic data generation - depending on how realistic you want the output

License

Distributed under the Apache License, Version 2.0.

SPDX-License-Identifier: Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
.github		.github
datahub_core		datahub_core
docs		docs
examples/ISDA		examples/ISDA
tests		tests
website		website
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.pylintrc		.pylintrc
.whitesource		.whitesource
Jenkinsfile		Jenkinsfile
LICENSE		LICENSE
LICENSE.spdx		LICENSE.spdx
MANIFEST.in		MANIFEST.in
NOTICE		NOTICE
README.md		README.md
doc_build.py		doc_build.py
mkdocs.yml		mkdocs.yml
package-lock.json		package-lock.json
pydocmd.yml		pydocmd.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run.py		run.py
setup.py		setup.py
train_data.csv		train_data.csv
train_data_output.csv		train_data_output.csv
train_data_output.json		train_data_output.json
train_data_output3.csv		train_data_output3.csv
train_data_output3.json		train_data_output3.json
whitesource.config		whitesource.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataHub

Key documents

Overview of Synthetic data

License

About

Releases

Packages

Languages

License

mcleo-d/datahub

Folders and files

Latest commit

History

Repository files navigation

DataHub

Key documents

Overview of Synthetic data

License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages