Skip to content

Commit

Permalink
Merge pull request #6 from CBIIT/local-dataloader
Browse files Browse the repository at this point in the history
Added Config Files
  • Loading branch information
jonkiky authored Jul 10, 2024
2 parents 6f35801 + 499d27d commit 68cef16
Show file tree
Hide file tree
Showing 7 changed files with 3,690 additions and 0 deletions.
88 changes: 88 additions & 0 deletions config/aboutPagesContent.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
- page: '/submit'
title: "Data Submission"
primaryContentImage: https://raw.githubusercontent.com/CBIIT/datacommons-assets/ctdc_Assets/ctdc/images/aboutPages/About_CRDC.png
content:
- paragraph: "CTDC is not accepting external data submissions at this time. For more information on how to submit data to other data repositories within the Cancer Research Data Commons, please see $$[here](type:internal url:https://datacommons.cancer.gov/data/submit-data target:_blank )$$."
- page: '/developers'
title: "For Developers"
primaryContentImage: https://raw.githubusercontent.com/CBIIT/datacommons-assets/ctdc_Assets/ctdc/images/aboutPages/About_Developers.png
content:
- paragraph: "Users can query the CTDC data via Graphical User Interface (GUI) or Application Programming Interface (API). The CTDC GitHub repo is also available for those interested in accessing our codebase and documentation."
- paragraph: "$$#CTDC GUI#$$"
- paragraph: "The GUI provides users a distilled set of parameters (faceted querying) they can use to explore a subset of the CTDC data model. "
- paragraph: "$$#CTDC API#$$"
- paragraph: "CTDC is based on a Graph database, featuring a GraphQL API (Java) and a React front-end (JavaScript). Each tier in the application stack is designed to be modular and adaptable for a variety of use-cases and scenarios. $$[A GraphQL API](type:internal url:/#/graphql target:_blank)$$ enables querying of the entire data model. The API is provided “as is:” there are no warranties or conditions arising out of usage of these services."
- paragraph: "$$#GITHUB#$$"
- paragraph: "The $$[ CTDC GitHub repo](https://github.com/CBIIT/crdc-ctdc-ui)$$ is available for research, usage, forking, and pull requests. The codebase is intended for sharing and building frameworks for related initiatives and projects. The CTDC GitHub repo has documentation about how to access the system, including endpoints and recommendations for tools and example queries. Both the project and documentation are maintained and updated in accordance with major and minor releases."
- page: '/purpose'
title: "Purpose"
primaryContentImage: https://raw.githubusercontent.com/CBIIT/datacommons-assets/ctdc_Assets/ctdc/images/aboutPages/About_Purpose.png
content:
- paragraph: "The goals of the CTDC are to advance cancer research and accelerate the development of innovative therapies by improving access to data from NCI-sponsored clinical studies, including genomic panel assay and clinical data. The CTDC does this through: "
- paragraph: "$$*Graphical User Interface (GUI)*$$ – The CTDC’s GUI includes an Explore dashboard with search filters to help users visualize, explore, and navigate complex metadata without the need for coding or specialized technical skills. "
- paragraph: "$$*Data consolidation:*$$ The CTDC consolidates data from clinical studies funded by the NCI. This allows researchers to analyze data collectively, leading to deeper insights and a better understanding of cancer’s complexities. "
- paragraph: "$$*Data harmonization:*$$ Data harmonization ensures that data across studies within the CTDC are standardized and organized in a consistent manner to improve data compatibility, integration, and meta-analysis. "
- paragraph: "$$*Integration with NCI Cloud Resources:*$$ Users can easily transfer selected CTDC data to the $$[ Velsera Seven Bridges Cancer Genomics Cloud](https://datacommons.cancer.gov/analytical-resource/seven-bridges-cancer-genomics-cloud)$$ (SB-CGC), a cloud-based platform for cancer research funded by the NCI. Here, researchers can integrate mutli-omic data across sources and leverage access to a multitude of tools and workflows for computation and analysis. "
- paragraph: "$$*Fueling collaborative research:*$$ By centralizing data and making them available through NCI’s Cloud Resources, the CTDC promotes secure collaboration among distributed research groups, fostering interdisciplinary partnerships. "
- paragraph: "$$*Democratizing Data access:*$$ Data in the CTDC are made available through various access restrictions including open (no registration required) and controlled access (registration required). The CTDC aims to make each dataset as open as possible while protecting participant privacy and adhering to regulations, agreements, and other considerations specific to each study. "
- paragraph: "$$*Alignment to F.A.I.R data principles:*$$ The CTDC adheres to Findable, Accessible, Interoperable, and Reusable ($$[FAIR]( https://www.go-fair.org/fair-principles/)$$) principles for scientific data management and stewardship. CTDC seeks to provide clearly organized data and guidance enabling end users to search for, find, and access data of interest. The emphasis on harmonization described above promotes the interoperability of data within and across the CRDC ecosystem and beyond and promotes reusability of data beyond the primary publication."
- page: '/support'
title: "Support"
primaryContentImage: https://raw.githubusercontent.com/CBIIT/datacommons-assets/ctdc_Assets/ctdc/images/aboutPages/About_Support.png
content:
- paragraph: "If you have any questions, please contact us at $$[[email protected]]([email protected])$$."
- page: '/cloud-computing'
title: "Cloud computing"
primaryContentImage: https://raw.githubusercontent.com/CBIIT/datacommons-assets/ctdc_Assets/ctdc/images/aboutPages/About_CRDC.png
content:
- paragraph: "$$#CTDC and NCI’s Cloud Resources#$$ "
- paragraph: "The CTDC supports analysis via the $$[Seven Bridges Cancer Genomics Cloud](https://datacommons.cancer.gov/analytical-resource/seven-bridges-cancer-genomics-cloud-developed-velsera#)$$(SB-CGC). SB-CGC supports data access through a web-based user interface, programmatic access to analytic tools and workflows, and collaborative data analysis and sharing pipelines. Users can transfer data of interest from the CTDC directly to SB-CGC, eliminating the need to download and store extremely large datasets. Through the SB-CGC, researchers can bring analysis tools to the data in the cloud, instead of the traditional process of bringing the data to the tools on local hardware. Analyzing data through the cloud offers many benefits including: "
- listWithDots :
- "State of the art analysis using high-performance computing"
- "Remote access and flexibility for nationally or globally distributed teams"
- "On-demand computational capacity to scale resources as needed "
- paragraph: "Data brought to the SB-CGC can be analyzed using more than 200 preinstalled, curated bioinformatics tools and workflows. Researchers can also extend the functionality of the platform by adding their own data and tools via an intuitive software development kit. "
- paragraph: "$300 in credits are available to new users who want to test out the platform. "
- paragraph: "For more information on getting started with SB-CGC including onboarding videos and more, visit: $$[https://www.cancergenomicscloud.org/getting-started](https://www.cancergenomicscloud.org/getting-started )$$."
- page: '/data-use'
title: "CTDC Data Terms of Use"
primaryContentImage: https://raw.githubusercontent.com/CBIIT/datacommons-assets/ctdc_Assets/ctdc/images/aboutPages/About_CRDC.png
content:
- paragraph: "CTDC’s data terms of use are consistent with applicable international, national, tribal, and state laws and regulations, as well as institutional policies for data submission, access, and sharing to help enable broad data access to the extent possible."
- paragraph: "$$#DATA ACCESS#$$"
- paragraph: "Data is made available through open-access, registered access, and controlled access tiers. Visit our $$[Request Access](type:internal url:/#/request-access target:_blank )$$ page for more information. Access to controlled data is restricted to authorized users. Users and Users’ institutions are responsible for understanding terms of use and adhering to study-specific Data Use Agreement(s) (DUAs), Institutional Review Board policies, and other relevant guidelines. A signed specific DUA may be required to access controlled-access tier data for individual trials or studies. If a DUA is required, it will be provided as part of the data request process. "
- paragraph: "$$#RE-IDENTIFICATION#$$"
- paragraph: "Data available within the CTDC includes de-identified clinical study data subject to both general and dataset-specific data use policies. Users of any data provided by CTDC, whether open, registered or controlled access, agree not to attempt to reidentify any individual participant in any study represented by CTDC data, for any purpose whatever. This includes, but is not limited to, the use of analytical techniques of reidentification on genomic or clinical data. "
- paragraph: "$$#INTELLECTUAL PROPERTY#$$"
- paragraph: "NIH considers CTDC data as pre-competitive and discourages users from making IP claims derived directly from the available dataset(s). NIH-provided data, and conclusions derived thereof, shall remain freely available, without requirement for licensing. However, the NIH also recognizes the importance of the subsequent development of IP on downstream discoveries, especially in therapeutics, which will be necessary to support full investment in products that the public needs. "
- paragraph: "For more information about the CTDC and or questions regarding intellectual property, please contact us at $$[[email protected]]([email protected])$$. "
- paragraph: "$$#CITING CTDC IN PUBLICATIONS:#$$"
- paragraph: "Whenever using CTDC data in a publication, please cite: "
- paragraph: "1. CTDC resource or individual study "
- paragraph: "           ● To cite the resource, cite the CTDC website ( $$[clinical.datacommons.cancer.gov](type:internal url:/#/ target:_blank )$$ ) "
- paragraph: "           ● To cite an individual study, either cite the CTDC study id (e.g., NCT04314401) "
- paragraph: "OR"
- paragraph: "           ● Cite the study URL: (e.g., $$[https://clinical.datacommons.cancer.gov/#/study/NCT04314401](type:internal url:/#/study/NCT04314401 target:_blank )$$). "
- paragraph: "2. Primary publication of the data (when applicable) "
- paragraph: "           ● The primary publication from the original data producers is available on the individual study summary pages. "
- paragraph: "$$#QUESTIONS#$$"
- paragraph: "CTDC strongly encourages investigators to contact $$[[email protected]]([email protected])$$ with any questions or concerns related to publication of their analyses."
- page: '/data-harmonization'
title: "Data Harmonization"
primaryContentImage: https://raw.githubusercontent.com/CBIIT/datacommons-assets/ctdc_Assets/ctdc/images/aboutPages/About_CRDC.png
content:
- paragraph: "CTDC data elements have been aligned against NCI’s $$[ Data Standards Services ](https://datascience.cancer.gov/data-commons/data-standards-services)$$(DSS) common data elements (CDEs) curated through the CRDC in $$[ the Cancer Data Standards Registry and Repository](https://cadsr.cancer.gov/onedata/dmdirect/NIH/NCI/CO/CDEDD?filter=Administered%20Item%20%28Data%20Element%20CO%29.CDEDD%20Classification.P_ITEM_ID_VER=10466051v1)$$(caDSR). A list of the current CRDC Standard Data Elements can be found $$[here](https://cadsr.cancer.gov/onedata/clsc_datamanager_container.do?s=JTAycSUwNiUwQVQlMDUwelQrJTI0JTI5RyU3RiUyMkglMDAqc2IlMERxd3d1JTAwJTEzJTBEJTdDaiUyQ0VlJTJDcSUxNSUwNCUwRHIlMEElMDIlMDJ2c3IlMDIlN0MlMDk3JTJDYSUyNSUxRSUxMnclMDVyJTdCJTI3JTBEQnAlMjZRVC5NJTdDJTdGJTdDJTA5bEolMTQ )$$."
- page: "/data-model"
primaryContentImage: https://raw.githubusercontent.com/CBIIT/datacommons-assets/ctdc_Assets/ctdc/images/aboutPages/About_Model.png
title: "Data Model"
content:
- paragraph: "The CTDC data model is a representation of how data in the CTDC are arranged relative to each other. The data model is flexibly designed to accommodate an ever-expanding CTDC database."
- paragraph: "The SVG graphic below represents the current CTDC data model consisting of data nodes, node properties, and relationships (edges). It provides a comprehensive mapping of the system data, part of which may be viewed in the application and user interface. Additional nodes and properties beyond those presented on the front-end are available for inspection and querying via API at $$[https://clinical.datacommons.cancer.gov/#/graphql.](type:internal url:/#/graphql target:_blank )$$ "
- paragraph: "Information about the graphic data model, including the model description files, can be found on GitHub at $$[https://github.com/CBIIT/ctdc-model](https://github.com/CBIIT/ctdc-model)$$."
secondaryZoomImageTitle: "The CTDC Data Model"
secondaryZoomImage: 'https://raw.githubusercontent.com/CBIIT/ctdc-model/ctdc-model/model-desc/ctdc-model.svg'





43 changes: 43 additions & 0 deletions config/ctdc-local.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
Config:
temp_folder: tmp
backup_folder: /tmp/data-loader-backups
neo4j:
# Location of Neo4j server, e.g., bolt://127.0.0.1:7687

# Schema files' locations
schema:
- /Users/davenportaw/Downloads/cmb-data/ctdc_model_file.yml
- /Users/davenportaw/Downloads/cmb-data/ctdc_model_properties_file.yml

plugins:
# - module: loader_plugins.visit_creator
# class: VisitCreator
# - module: loader_plugins.individual_creator
# class: IndividualCreator

#Property file location
prop_file: /Users/davenportaw/Downloads/cmb-data/props-ctdc-cmb.yml

# Skip validations, aka. Cheat Mode
cheat_mode: false
# Validations only, skip loading
dry_run: false
# Wipe out database before loading, you'll lose all data!
wipe_db: true
# Skip backup step
no_backup: true
# Automatically confirm deletion and database wiping (without asking user to confirm)
no_confirmation: false
# Max violations to display, default is 10
max_violations: 10
# Split transactions
split-transactions: true

# S3 bucket name, if you are loading from an S3 bucket
s3_bucket:
# S3 folder for dataset
s3_folder:
# Loading mode, can be UPSERT_MODE, NEW_MODE or DELETE_MODE, default is UPSERT_MODE
loading_mode:
# Location of dataset
dataset:
Loading

0 comments on commit 68cef16

Please sign in to comment.