Datasets project solves problem with organizing data sets. It also tries to ensure experiment consistency and repeatability by data set immutability, unique identification, usage and change logs.
This project is inspired by: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45390.pdf
Data set discovery and identification is based on presence of the file dataset .yaml
.
id
- UUIDname
- Human readable namemaintainer
- Email to person responsible for the data settags
- Data set tags for simple identificationinternal
- Denotes if the data set is not publicly availabledata
- Paths to folders with data (inside the data set path)url
- Public url for the data setfrom
- id of the parent data set
Generated:
type
- "fs" for the filesystemchangelog
- Changes detected in the data setusages
- Reported usages (from the lib)
Generated from the fs:
Fields starting with _
are paths in the container (changed based on
storage_replace
to final fields - path
...)
paths
,_paths
- Path to data setlinks
,_links
- Symlinks pointing to the data setmarkdowns
,_markdowns
- Markdown files found in the data setcharacteristics
- Generated statistics of the data set (size, number of files, extensions)
database_path
- Where the LMDB should be storediter_file_limit
- When searchingdataset.yaml
folders with more then this count won't be scanneddatasets
- paths to folders used for scanningstorage_replace
- Replace the container paths with the real ones
Data sets may be added trough the API or with the file system analysis. Other sources like HDFS or databases may be added.
The system is currently used with distributed FS (MooseFS - similar to GFS or Ceph) mounted with FUSE. Local FS will also work great.
Any key-value database is ok. Right now local LMDB is used.
Other database may be used by adding connector with storage.Storage
interface.
Aerospike will be officially supported soon.
- data set monitoring + email notifications
docker-compose up dev
Feel free to contribute.
© 2016 Vít Listík
Released under MIT license