Skip to content

Commit b14cc29

Browse files
authored
Add pds tags and ManagedBy fields (awslabs#1400)
* Update allenai-tqa.yaml * Update gatk-sv-data.yaml * Update ibl-brain-wide-map.yaml * Update dandiarchive.yaml * Update 3kricegenome.yaml * Update aws-igenomes.yaml * Update giab.yaml * Update gmsdata.yaml * Update hcp-openaccess.yaml * Update hpgp-data.yaml * Update human-microbiome-project.yaml * Update ihart.yaml * Update icgc.yaml * Update mimiciii.yaml * Update nanopore.yaml * Update ucsc-genome-browser.yaml * Update physionet.yaml * Update openneuro.yaml * Update 1000-genomes.yaml * Update dandiarchive.yaml * Update 3kricegenome.yaml * Update cell-painting-image-collection.yaml * Update physionet.yaml * Update 3kricegenome.yaml * Update aws-igenomes.yaml * Update dandiarchive.yaml * Update giab.yaml * Update gmsdata.yaml * Update hcp-openaccess.yaml * Update hpgp-data.yaml * Update human-microbiome-project.yaml * Update icgc.yaml * Update ihart.yaml * Update mimiciii.yaml * Update openneuro.yaml * Update physionet.yaml
1 parent b2ae7e8 commit b14cc29

20 files changed

+38
-6
lines changed

datasets/1000-genomes.yaml

+3
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,15 @@ Name: 1000 Genomes
22
Description: The 1000 Genomes Project is an international collaboration which has established the most detailed catalogue of human genetic variation, including SNPs, structural variants, and their haplotype context. The final phase of the project sequenced more than 2500 individuals from 26 different populations around the world and produced an integrated set of phased haplotypes with more than 80 million variants for these individuals.
33
Documentation: https://github.com/awslabs/open-data-docs/tree/main/docs/1000genomes
44
Contact: http://www.internationalgenome.org/contact
5+
ManagedBy: National Institutes of Health
56
UpdateFrequency: Not updated
67
Tags:
78
- aws-pds
89
- genetic
910
- genomic
1011
- life sciences
12+
- whole genome sequencing
13+
- fastq
1114
License: Data from the 1000 Genomes Project is now available without embargo, following the final publication from the project. Use of the data should be cited in the usual way, with current details available at http://www.internationalgenome.org/faq/how-do-i-cite-1000-genomes-project.
1215
Resources:
1316
- Description: http://www.internationalgenome.org/formats

datasets/3kricegenome.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ Name: 3000 Rice Genomes Project
22
Description: The 3000 Rice Genome Project is an international effort to sequence the genomes of 3,024 rice varieties from 89 countries.
33
Documentation: https://github.com/awslabs/open-data-docs/tree/main/docs/3kricegenome
44
Contact: http://iric.irri.org/contact-us
5+
ManagedBy: '[International Rice Research Institute](https://www.irri.org/)'
56
UpdateFrequency: Not updated
67
Tags:
78
- agriculture

datasets/allenai-tqa.yaml

+3-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,9 @@ Documentation: https://allenai.org/data/tqa
44
55
ManagedBy: '[Allen Institute for AI](https://allenai.org)'
66
UpdateFrequency: Not updated
7-
Tags: []
7+
Tags:
8+
- aws-pds
9+
- machine learning
810
License: '[CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0/)'
911
Resources:
1012
- Description: Project data files in a public bucket

datasets/aws-igenomes.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ Name: AWS iGenomes
22
Description: Common reference genomes hosted on AWS S3. Can be used when aligning and analysing raw DNA sequencing data.
33
Documentation: https://ewels.github.io/AWS-iGenomes/
44
Contact: https://github.com/ewels/AWS-iGenomes/issues
5+
ManagedBy: '[SciLifeLab](https://opensource.scilifelab.se/)'
56
UpdateFrequency: New data are added when available.
67
Tags:
78
- aws-pds

datasets/cell-painting-image-collection.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ Description: |
1414
applications.
1515
Documentation: https://github.com/cytodata/cytodata-hackathon-2018
1616
Contact: Post on https://forum.image.sc/ and tag with "cellpainting"
17+
ManagedBy: The Broad Institute
1718
UpdateFrequency: irregularly
1819
Tags:
1920
- aws-pds

datasets/dandiarchive.yaml

+2-1
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,8 @@ Description: >
99
[NIDM - Neuro Imaging Data Model](http://nidm.nidash.org/). Development of DANDI is
1010
supported by the National Institute of Mental Health.
1111
Documentation: http://dandiarchive.org
12-
Contact: Support form at https://www.dandiarchive.org
12+
Contact: '[DANDI Archive Help Desk](https://github.com/dandi/helpdesk/issues/new/choose)'
13+
ManagedBy: '[DANDI Archive](https://www.dandiarchive.org/team)'
1314
UpdateFrequency: New datasets deposited every month
1415
Tags:
1516
- aws-pds

datasets/gatk-sv-data.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ Contact: [email protected]
77
ManagedBy: "[Loka Inc.](https://loka.com/)"
88
UpdateFrequency: Every 3 months
99
Tags:
10+
- aws-pds
1011
- biology
1112
- bioinformatics
1213
- genetic

datasets/giab.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ Name: Genome in a Bottle on AWS
22
Description: "Several reference genomes to enable translation of whole human genome sequencing to clinical practice. On 11/12/2020 these data were updated to reflect the most [up to date GIAB release](https://www.nist.gov/programs-projects/genome-bottle)."
33
Documentation: https://github.com/awslabs/open-data-docs/tree/main/docs/giab
44
Contact: http://genomeinabottle.org/
5+
ManagedBy: '[National Institute of Standards and Technology](https://www.nist.gov/)'
56
UpdateFrequency: New data are added as soon as they are available.
67
Tags:
78
- aws-pds

datasets/gmsdata.yaml

+2-1
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,14 @@ Name: The Genome Modeling System
22
Description: The Genome Institute at Washington University has developed a high-throughput, fault-tolerant analysis information management system called the Genome Modeling System (GMS), capable of executing complex, interdependent, and automated genome analysis pipelines at a massive scale. The GMS framework provides detailed tracking of samples and data coupled with reliable and repeatable analysis pipelines. GMS includes a full system image with software and services, expandable from one workstation to a large compute cluster.
33
Documentation: https://github.com/genome/gms/wiki
44
Contact: https://github.com/genome/gms/issues
5+
ManagedBy: Genome Institute at the Washington University School of Medicine in St. Louis
56
UpdateFrequency: Not updated
67
Tags:
78
- aws-pds
89
- genetic
910
- genomic
1011
- life sciences
11-
License: GNU Lesser General Public License v3.0 https://github.com/genome/gms/blob/ubuntu-12.04/LICENSE
12+
License: '[GNU Lesser General Public License v3.0](https://github.com/genome/gms/blob/ubuntu-12.04/LICENSE)'
1213
Resources:
1314
- Description: https://gmsdata.s3.amazonaws.com/index.html
1415
ARN: arn:aws:s3:::gmsdata

datasets/hcp-openaccess.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ Name: "The Human Connectome Project"
22
Description: The Human Connectome Project aims to provide an unparalleled compilation of neural data, an interface to graphically navigate this data and the opportunity to achieve never before realized conclusions about the living human brain.
33
Documentation: http://www.humanconnectomeproject.org
44
5+
ManagedBy: '[Connectome Coordination Facility](https://www.humanconnectome.org/ccf-staff)'
56
UpdateFrequency: Uncertain
67
Tags:
78
- aws-pds

datasets/hpgp-data.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ Name: Human PanGenomics Project
22
Description: This dataset includes sequencing data, assemblies, and analyses for the offspring of ten parent-offspring trios.
33
Documentation: https://github.com/human-pangenomics/hpgp-data
44
Contact: https://github.com/human-pangenomics/hpgp-data/issues
5+
ManagedBy: '(Human Pangenome Reference Consortium)[https://humanpangenome.org/]'
56
UpdateFrequency: Data will be added and updated as technologies improve or new data uses are encountered.
67
Tags:
78
- aws-pds

datasets/human-microbiome-project.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ Name: The Human Microbiome Project
22
Description: The NIH-funded Human Microbiome Project (HMP) is a collaborative effort of over 300 scientists from more than 80 organizations to comprehensively characterize the microbial communities inhabiting the human body and elucidate their role in human health and disease. To accomplish this task, microbial community samples were isolated from a cohort of 300 healthy adult human subjects at 18 specific sites within five regions of the body (oral cavity, airways, urogenital track, skin, and gut). Targeted sequencing of the 16S bacterial marker gene and/or whole metagenome shotgun sequencing was performed for thousands of these samples. In addition, whole genome sequences were generated for isolate strains collected from human body sites to act as reference organisms for analysis. Finally, 16S marker and whole metagenome sequencing was also done on additional samples from people suffering from several disease conditions.
33
Documentation: https://commonfund.nih.gov/hmp
44
Contact: https://commonfund.nih.gov/hmp/related_activities
5+
ManagedBy: '[The National Institutes of Health Office of Strategic Coordination - The Common Fund](https://commonfund.nih.gov/hmp)'
56
UpdateFrequency: Uncertain
67
Tags:
78
- aws-pds

datasets/ibl-brain-wide-map.yaml

+2
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@ Contact: [email protected]
55
ManagedBy: "[IBL](https://www.internationalbrainlab.com)"
66
UpdateFrequency: TBD
77
Tags:
8+
- aws-pds
9+
- life sciences
810
- neuroscience
911
- neurophysiology
1012
- open source software

datasets/icgc.yaml

+4
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,16 @@ Name: ICGC on AWS
22
Description: The International Cancer Genome Consortium (ICGC) coordinates projects with the common aim of accelerating research into the causes and control of cancer. The PanCancer Analysis of Whole Genomes (PCAWG) study is an international collaboration to identify common patterns of mutation in whole genomes from ICGC. More than 2,400 consistently analyzed genomes corresponding to over 1,100 unique ICGC donors are now freely available on Amazon S3 to credentialed researchers subject to ICGC data sharing policies.
33
Documentation: https://dcc.icgc.org/icgc-in-the-cloud/aws
44
5+
ManagedBy: '[International Cancer Genome Collaboratory](https://dcc.icgc.org/)'
56
UpdateFrequency: New data is added as soon as it is available.
67
Tags:
78
- aws-pds
89
- cancer
10+
- genetic
911
- genomic
1012
- life sciences
13+
- bam
14+
- vcf
1115
License: Data use is subject to the access and publication polices of the source. Distribution of the data is subject to ICGC Trusted Partner Approval. More information on terms of use is available at https://icgc.org/daco
1216
Resources:
1317
- Description: BAM and VCF files from the The PanCancer Analysis of Whole Genomes (PCAWG) study.

datasets/ihart.yaml

+4
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,17 @@ Name: iHART Whole Genome Sequencing Data Set
22
Description: iHART is the [Hartwell Foundation](http://www.thehartwellfoundation.org/)’s Autism Research and Technology Initiative. This release contains whole genome data from over 1000 families with 2 or more children with autism, of which biomaterials were provided by the Autism Genetic Resource Exchange ([AGRE](http://research.agre.org/)).
33
Documentation: http://www.ihart.org/data
44
5+
ManagedBy: '[Stanford University](https://wall-lab.stanford.edu/projects/ihart/)'
56
UpdateFrequency: The dataset may be updated with additional or corrected data on a need-to-update basis.
67
Tags:
78
- aws-pds
89
- autism spectrum disorder
10+
- genetic
911
- genomic
1012
- life sciences
1113
- whole genome sequencing
14+
- bam
15+
- vcf
1216
License: Data use is subject to the access and publication polices of the iHART. More information on terms of use is available at [iHART website](http://www.ihart.org/)
1317
Resources:
1418
- Description: BAM and VCF files from The iHART whole genome sequencing study.

datasets/mimiciii.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ Description: |
1616
[request access to the MIMIC-III Clinical Database on AWS](https://physionet.org/projects/mimiciii/1.4/request_access/2).
1717
Documentation: https://mimic.physionet.org/
1818
Contact: https://mimic.physionet.org/help/
19+
ManagedBy: '[MIT Laboratory for Computational Physiology](https://lcp.mit.edu/)'
1920
UpdateFrequency: Not updated
2021
Tags:
2122
- aws-pds

datasets/nanopore.yaml

+4-1
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,14 @@ Name: Nanopore Reference Human Genome
22
Description: This dataset includes the sequencing and assembly of a reference standard human genome (GM12878) using the MinION nanopore sequencing instrument with the R9.4 1D chemistry.
33
Documentation: https://github.com/nanopore-wgs-consortium/NA12878
44
Contact: https://github.com/nanopore-wgs-consortium/NA12878/issues
5+
ManagedBy: Nanopore Whole Genome Sequencing Consortium
56
UpdateFrequency: Data will be added as methodology improves or new data uses are encountered.
67
Tags:
78
- aws-pds
9+
- genetic
810
- genomic
911
- life sciences
12+
- whole genome sequencing
1013
License: Nanopore Human Reference data is released under the Creative Commons CC-BY license and allows free, full and open access to all. For more details please refer to the data reuse and license section of the documentation.
1114
Resources:
1215
- Description: Nanopore Reference Human Genome
@@ -18,4 +21,4 @@ Resources:
1821
DataAtWork:
1922
Tutorials:
2023
Tools & Applications:
21-
Publications:
24+
Publications:

datasets/openneuro.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ Name: OpenNeuro
22
Description: OpenNeuro is a database of openly-available brain imaging data. The data are shared according to a Creative Commons CC0 license, providing a broad range of brain imaging data to researchers and citizen scientists alike. The database primarily focuses on functional magnetic resonance imaging (fMRI) data, but also includes other imaging modalities including structural and diffusion MRI, electroencephalography (EEG), and magnetoencephalograpy (MEG). OpenfMRI is a project of the [Center for Reproducible Neuroscience at Stanford University](http://reproducibility.stanford.edu). Development of the OpenNeuro resource has been funded by the National Science Foundation, National Institute of Mental Health, National Institute on Drug Abuse, and the Laura and John Arnold Foundation.
33
Documentation: http://openneuro.org
44
Contact: Support form at https://openneuro.org
5+
ManagedBy: '[Stanford University Center for Reproducible Neuroscience](https://reproducibility.stanford.edu/)'
56
UpdateFrequency: New datasets deposited every 4-6 days
67
Tags:
78
- aws-pds

datasets/physionet.yaml

+3-2
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
Name: Physionet
22
Description: PhysioNet offers free web access to large collections of recorded physiologic signals (PhysioBank) and related open-source software (PhysioToolkit).
33
Documentation: https://physionet.org/
4-
Contact: https://physionet.org/faq.shtml
4+
5+
ManagedBy: '[MIT Laboratory for Computational Physiology](https://lcp.mit.edu/)'
56
UpdateFrequency: Not updated
67
Tags:
78
- aws-pds
@@ -16,4 +17,4 @@ Resources:
1617
DataAtWork:
1718
Tutorials:
1819
Tools & Applications:
19-
Publications:
20+
Publications:

datasets/ucsc-genome-browser.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ Name: UCSC Genome Browser Sequence and Annotations
22
Description: "The UCSC Genome Browser is an online graphical viewer for genomes, a genome browser, hosted by the University of California, Santa Cruz (UCSC). The interactive website offers access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. This dataset is a copy of the MySQL tables in MyISAM binary and tab-sep format and all binary files in custom formats, sometimes referred as 'gbdb'-files. Data from the UCSC Genome Browser is free and open for use by anyone. However, every genome annotation track has been created by an academic research group, or, in a few cases, by commercial companies. Please acknowledge them by citing them. The information can be found by going to https://genome.ucsc.edu, selecting the respective genome assembly and clicking on the data track. At the end of the documentation, we provide a list of references and acknowledgements."
33
Documentation: https://hgdownload.soe.ucsc.edu/downloads.html
44
Contact: https://genome.ucsc.edu/contacts.html
5+
ManagedBy: University of California Santa Cruz Genome Institute
56
UpdateFrequency: Monthly
67
Tags:
78
- aws-pds

0 commit comments

Comments
 (0)