diff --git a/gen3/docs/gen3-resources/operator-guide/img/BRH_Discovery_Page.png b/gen3/docs/gen3-resources/operator-guide/img/BRH_Discovery_Page.png new file mode 100644 index 00000000..8d01e611 Binary files /dev/null and b/gen3/docs/gen3-resources/operator-guide/img/BRH_Discovery_Page.png differ diff --git a/gen3/docs/gen3-resources/operator-guide/submit-semi-structured-data.md b/gen3/docs/gen3-resources/operator-guide/submit-semi-structured-data.md new file mode 100644 index 00000000..b4347d99 --- /dev/null +++ b/gen3/docs/gen3-resources/operator-guide/submit-semi-structured-data.md @@ -0,0 +1,17 @@ + +# Semi-structured Data + +Semi-structured data is organized as unique identifiers with flexible key/value pairs (including nesting). The key/value pairs may be consistent between records, but are not required to be. This is typically used for storing publicly available metadata about available datasets or additional public metadata about samples. The MDS and AggMDS both include semi-structured data and power the Data Portal Discovery Page. + + +Because the structure of a commons' MDS and the Discovery Page configuration are very closely coupled, all content related to creating MDS records are included in the [Customize Gen3 Search Interface Section][Customize Gen3 Search Interface Section]. + +Instructions for the creation and modification of an MDS record can be found here as part of the [Gen3 SDK][Gen3 SDK Discovery Page] + +## Discovery Page +![BRH Discovery Page](/gen3-resources/operator-guide/img/BRH_Discovery_Page.png) + + + +[Customize Gen3 Search Interface Section]: (/gen3-resources/operator-guide/customize-search/) +[Gen3 SDK Discovery Page]: https://github.com/uc-cdis/gen3sdk-python/blob/master/gen3/cli/discovery.py diff --git a/gen3/docs/gen3-resources/operator-guide/submit-structured-data.md b/gen3/docs/gen3-resources/operator-guide/submit-structured-data.md index 63ee2b8f..1a4f09f0 100644 --- a/gen3/docs/gen3-resources/operator-guide/submit-structured-data.md +++ b/gen3/docs/gen3-resources/operator-guide/submit-structured-data.md @@ -28,7 +28,7 @@ Template TSVs are provided in each node's page in the data dictionary. The prepared TSV files must be submitted in a specific order due to node links. Referring back to the graphical data model, a record cannot be submitted without first submitting the record(s) to which it is linked upstream (its "parent"). If metadata are submitted out of order, such as submitting a TSV with links to parent records that don't yet exist, the validator will reject the submission on the basis that the dependency is not present with the error message, "INVALID_LINK". -In a Gen3 Data Commons, **programs** and **projects** are two administrative nodes in the graph database that serve as the most upstream nodes. A program must be created first, followed by a project. Any subsequent data submission and data access, along with control of access to data, is done through the project scope. In some projects only a subset of submitters may have access to create a program or project. +In a Gen3 Data Commons, `programs` and `projects` are two administrative nodes in the graph database that serve as the most upstream nodes. A `program` must be created first, followed by a `project`. Any subsequent data submission and data access, along with control of access to data, is done through the project scope. In some projects only a subset of submitters may have access to create a program or project. Before you create a program and a project or submit any data, you need to grant yourself permissions. First, you will need to grant yourself access to **create** a program and second, you need to grant yourself access to *see* the program. You can **create** the program before or after having access to *see* it. @@ -39,13 +39,14 @@ Make sure to update user privileges: docker exec -it fence-service fence-create sync --arborist http://arborist-service --yaml user.yaml ``` -To create a program, visit the URL where your Gen3 Commons is hosted and append `/_root`. If you are running the Docker Compose setup locally, then this will be `localhost/_root`. Otherwise, this will be whatever you set the `hostname` field to in the creds files for the services with `/_root` added to the end. Here, you can choose to either use form submission or upload a file. We will go through the process of using form submission here, as it will show you what your file would need to look like if you were using file upload. Choose form submission, search for "program" in the drop-down list and then fill in the "dbgap_accession_number" and "name" fields. As an example, you can use "123" as "dbgap accession number" and "Program1" as "name". Click 'Upload submission json from form' and then 'Submit'. If the message is green ("succeeded:200"), that indicates success, while a grey message indicates failure. More details can be viewed by clicking on the "DETAILS" button. If you don't see the green message, you can control the sheepdog logs for possible errors and check the Sheepdog database (`/datadictionary`), where programs and projects are stored. If you see your program in the data dictionary, neglect the fact that at this time the green message does not appear and continue to create a project. +To create a program, visit the URL where your Gen3 Commons is hosted and append `/_root`. If you are running the Docker Compose setup locally, then this will be `localhost/_root`. Otherwise, this will be whatever you set the `hostname` field to in the creds files for the services with `/_root` added to the end. Here, you can choose to either use form submission or upload a file. We will go through the process of using form submission here, as it will show you what your file would need to look like if you were using file upload. Choose form submission, search for "program" in the drop-down list and then fill in the "dbgap_accession_number" and "name" fields. As an example, you can use "123" as "dbgap accession number" and "Program1" as "name". Click 'Upload submission json from form' and then 'Submit'. If the message is green ("succeeded:200"), that indicates success, while a grey message indicates failure. More details can be viewed by clicking on the "DETAILS" button. If you don't see the green message, you can control the sheepdog logs for possible errors and check the Sheepdog database (`/datadictionary`), where programs and projects are stored. If you see your program in the data dictionary, neglect the fact that at this time the green message does not appear and continue to create a project. To create a project, visit the URL where your Gen3 Commons is hosted and append the name of the program you want to create the project under. For example, if you are running the Docker Compose setup locally and would like to create a project under the program "Program1", the URL you will visit will be `localhost/Program1`. You will see the same options to use form submission or upload a file. This time, search for "project" in the drop-down list and then fill in the fields. As an example, you can use "P1" as "code", "phs1" as "dbgap_accession_number", and "project1" as "name". If you use different entries, make a note of the dbgap_accession_number for later. Click 'Upload submission json from form' and then 'Submit'. Again, a green message indicates success while a grey message indicates failure, and more details can be viewed by clicking on the "DETAILS" button. You can control in the `/datadictionary` whether the program and project have been correctly stored. After that, you're ready to start submitting data for that project keeping in mind that you must submit from "top to bottom" in the data model to make sure each new node points to an existing node in the database. If metadata are submitted out of order, such as submitting a TSV with links to parent records that don't yet exist, the validator will reject the submission on the basis that the dependency is not present with the error message, "INVALID_LINK". -Alternatively,the [Gen3 Submission sdk](https://uc-cdis.github.io/gen3sdk-python/_build/html/_modules/gen3/submission.html) has a comprehensive set of tools to enable users to script submission of programs and projects. + +As an alternative to creating the program and project nodes in the Data Portal, you can instead use the [Gen3 Submission SDK](https://uc-cdis.github.io/gen3sdk-python/_build/html/_modules/gen3/submission.html), which has a comprehensive set of tools to enable users to script submission of programs and projects. Sample Code for submission of a Program and Project to a data commons: ``` diff --git a/gen3/docs/gen3-resources/operator-guide/submit-unstructured-data.md b/gen3/docs/gen3-resources/operator-guide/submit-unstructured-data.md index c92e7bd7..e7669a07 100644 --- a/gen3/docs/gen3-resources/operator-guide/submit-unstructured-data.md +++ b/gen3/docs/gen3-resources/operator-guide/submit-unstructured-data.md @@ -1,20 +1,19 @@ -# Submit Data and Control Access -* create program and project: https://gen3.org/resources/operator/#4-programs-and-projects -* submit data: https://gen3.org/resources/operator/#5-how-to-upload-and-control-file-access-via-authz -* control access to data with AuthZ: https://gen3.org/resources/operator/#5-how-to-upload-and-control-file-access-via-authz +# Unstructured Data (Data Files) +Unstructured data are simply data files that have do not necessarily conform to any particular schema or format. Some data files may be consistently structured (e.g .bam or .png), but Gen3 treats these simply as files and does not check whether they conform to a particular format or not. To make data available to end users you must first upload the files and associate with the appropriate node in the data dictionary. -## Unstructured Data (Data Files) -Unstructured data are simply data files that have do not necessarily conform to any particular schema or format. Some data files may be consistently structured (e.g .bam or .png), but Gen3 treats these simply as files and does not check whether they conform to a particular format or not. To make data available to end users you must first upload the files and associate with the appropriate node in the data dictionary. +## Standard Submission Process ### 1. Prepare Project in Submission Portal - -In order to upload data files, at least one record in the `core_metadata_collection` node must exist. If your project already has at least one record in this node, you can skip to step 2 below. + +In order to upload data files you must at minimum have a `program`, `project`, and at least one record in the `core_metadata_collection` node or other data containing node. To review how to submit the program and project nodes see [here](/gen3-resources/operator-guide/submit-structured-data/#the-order-of-node-submission-is-important). + +This documentation will utilize the core_metadata_collection node but other nodes can be used depending on your unique data model. If your project already has at least one record in a node of this type, you can skip to [step 2](#2-upload-data-files-to-object-storage). + +If your project already has at least one `core_metadata_collection` record you can skip to step 2 below. Do the following to create your first `core_metadata_collection` record: @@ -45,12 +44,12 @@ If you received any other message, then check the 'Details' to help determine th To view the records in the `core_metadata_collection` node in your project, you can go to: https://gen3.datacommons.io/example-training/search?node_type=core_metadata_collection -(replacing the first part of that URL with the URL of your actual project). +(replacing the `gen3.datacommons.io` with your commons base URL and `example-training` with an actual project name). ### 2. Upload Data Files to Object Storage -Adding files to your new Gen3 project can be done using one of two methods. The gen3-client tool (shown in steps below) offers users an easy way to upload files to Amazon s3 buckets while simultaneously indexing the files and assigning them each a unique GUID or object_id. Alternatively, if you are comfortable scripting and require something other than the default AWS bucket used by the gen3-client or your data files are already uploaded to their storage location in the cloud, we offer a [command line based file submission workflow](/resources/user/cli-submission). This method offers several other benefits including the possibility of using multiple cloud resources and submitting multiple batches of data set files at once. +Adding files to your new Gen3 project can be done using one of two methods. The gen3-client tool (shown in steps below) offers users an easy way to upload files to Amazon s3 buckets while simultaneously indexing the files and assigning them each a unique GUID or object_id. Alternatively, if you are comfortable scripting and require something other than the default AWS bucket used by the gen3-client or your data files are already uploaded to their storage location in the cloud, we offer an option for [indexing data already found in the cloud](#indexing-files-already-found-in-the-cloud). This method offers several other benefits including the possibility of using multiple cloud resources and submitting multiple batches of data set files at once. The following documentation will focus on using the [gen3-client](/resources/user/gen3-client) to upload data files, including spreadsheets, sequencing data (BAM, FASTQ), assay results, images, PDFs, etc., to Amazon s3 cloud storage. 1. Download the latest [compiled binary](https://github.com/uc-cdis/cdis-data-client/releases/latest) for your operating system. @@ -106,174 +105,635 @@ Once data files are successfully uploaded, the files must be mapped to the appro You should receive the message "# files mapped successfully!" upon success. -## Structured Data (Clinical or experimental data) -Data is structured if it is organized into records and fields, with each record consisting of one or more data elements (data fields). In biomedical data, data fields are often restricted to controlled vocabularies to make querying them easier. In Gen3 this would include clinical or experimental data submitted to the graph model, which is queriable via a GraphQL API. It can be flattened (via ETL) and the result viewable on the Data Portal Exploration Page. -After creating a data dictionary you are ready to submit structured data. These data are submitted in [tab-separated value (TSV)](https://gen3.org/resources/user/template-tsvs/) files for each node in the project, which can be downloaded from the "Dictionary" page of the data commons website. -It may be helpful to think of each TSV as a node in the graph of the data model. Column headers in the TSV are the properties stored in that node, and each row represents a "record" or "entity" in that node. When a TSV is successfully submitted, each row in that TSV becomes a single record in the node. -Properties in a node are either required or not, and this can be determined by referencing the data dictionary's viewer's "Required" column for a specific node. -There are a number of properties that deserve special mention: +* ** + +## Indexing files already found in the cloud + +### 1. Prepare Project with the Gen3 sdk tools + +Though not strictly required to be done as a first step, a Gen3 project must be present in the [Sheepdog microservice](resources/developer/microservice/) to associate data files to before file indexing can take place. To achieve this, the [Gen3 Submission sdk](https://uc-cdis.github.io/gen3sdk-python/_build/html/_modules/gen3/submission.html) has a comprehensive set of tools to enable users to script submission of programs and projects. Alternatively, the [GUI submission platform](/resources/user/submit-data#1) can be used to create a project. + +Sample Code for submission of a Program and Project to a data commons: +``` +import gen3 +from gen3.submission import Gen3Submission + +Gen3Submission.create_program(program_json) +Gen3Submission.create_project('test_program', project_json) +``` -* `submitter_id`: Each record in every node will have a `submitter_id`, which is a unique alphanumeric identifier (any combination of ASCII characters) for that record across the whole project and is specified by the data submitter. It is entirely up to the data contributor what the submitter_id will be for each record in a project, but the string chosen must be unique within that project. -* `type`: Every node has a `type` property, which is simply the name of the node. By providing the node name in the "type" property, the submission portal knows which node to put the data in. +### 2. Selection and Granting Gen3 Secure Access to Cloud Resources -* `id`: Every record in every node in a data commons has the unique property `id`, which is not submitted by the data contributor but rather generated on the backend. The value of the property `id` is a 128-bit UUID (a unique 32 character identifier). -* `project_id` and `code`: Every project record in a data commons is linked to a parent `program` node and has the properties `project_id` and a `code`. The property `project_id` is the dash-separated combination of `program` and the project's `code`. For example, if your project was named 'Experiment1', and this project was part of the 'Pilot' program, the project's `project_id` would be 'Pilot-Experiment1', and the project's `code` would be 'Experiment1'. Finally, just like every record in the data commons, the project has the unique property `id`, which is not to be confused with the project's `project_id`. +As Gen3 is considered "cloud agnostic", any or even multiple cloud resources can be configured to contain data for controlled end-user access. If your data is already located in the cloud, please see the following [section](#3-upload-files-to-object-storage-with-cloud-resource-command-line-interface) for considerations in the structure and permissions settings. + +End-user access to cloud resources is enabled by signed-urls with authorization checks within Gen3 to ensure valid and secure access. Policies within the respective cloud resources should be configured in the Gen3 Fence Microservice to allow the; [Gen3 Auth Service Bot - AWS](https://github.com/uc-cdis/fence/blob/master/fence/config-default.yaml#L656), [Gen3 Auth Service Bot - Azure](https://github.com/uc-cdis/fence/blob/master/docs/azure_architecture.md) or [Gen3 Auth Service Bot - Google](https://github.com/uc-cdis/fence/blob/master/docs/google_architecture.md) to have access for the end user. + +##### AWS S3 example bucket policy for READ access: +``` +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "AllowListLocation", + "Effect": "Allow", + "Principal": { + "AWS": [ + "arn:aws:iam::895962626746:user/fence_bot" + ] + }, + "Action": [ + "s3:GetBucketLocation", + "s3:ListBucket" + ], + "Resource": "arn:aws:s3:::" + }, + { + "Sid": "AllowGetObject", + "Effect": "Allow", + "Principal": { + "AWS": "arn:aws:iam::895962626746:user/fence_bot" + }, + "Action": "s3:GetObject", + "Resource": "arn:aws:s3:::/*" + } + ] +} +``` -Template TSVs are provided in each node's page in the data dictionary. +The location for the example AWS configuration posted above is available [here](https://github.com/uc-cdis/fence/blob/master/fence/config-default.yaml#L656). -![Template](/gen3-resources/operator-guide/img/Gen3_Dictionary_Subject_template_2020.png) -### Determine Submission Order via Node Links +### 3. Upload files to Object Storage with Cloud Resource Command Line Interface -The prepared TSV files must be submitted in a specific order due to node links. Referring back to the graphical data model, a record cannot be submitted without first submitting the record(s) to which it is linked upstream (its "parent"). If metadata are submitted out of order, such as submitting a TSV with links to parent records that don't yet exist, the validator will reject the submission on the basis that the dependency is not present with the error message, "INVALID_LINK". +Data can be uploaded to a single or separate cloud resources as long as requirements for access and authorization are met. In order to support the many advantages of using Gen3’s standard tooling for CLI-DFS, data needs to first be organized and copied to cloud buckets following the guidelines detailed below. -The `program` and `project` nodes are the most upstream nodes and are created by a commons administrator. The first node submitted by data contributors after `core_metadata_collection` depends on the specific data dictionary employed by the data commons but is usually the `study` or `experiment` node, which points directly upstream to the `project` node. +#### Allocating Data in Buckets Based on User Access -Often the study participants are recorded in the `case` or `subject` node, and subsequently any clinical information (demographics, diagnoses, etc.), biospecimen data (biopsy samples, extracted analytes), or other experimental methods/details are linked to each case. +Gen3 has the capability to grant access granularity at the bucket level designation only. In this way data a particular bucket should only be associated with a single user access. -### More about Specifying Required Links +##### Bucket Allocation Example: +A user’s authorization may look something like: +A user has read access to phs001416.c1, phs001416.c2, phs000974.c2 -At least one link is required for every record in a TSV, and sometimes multiple links could be specified. The links are specified in a TSV with the variable header `.submitter_id`, where the back-reference of the upstream node record is linking to. The value of this link variable is the specific `submitter_id` of the parent record. TSV or JSON templates that list all the possible link headers can be downloaded from the Data Dictionary Viewer on the data commons' website. Properties that represent these links such as “subjects.submitter_id” or “studies.submitter_id” are array variables and can take either a single submitter_id or a comma-separated list of `submitter_id`s in the case that a single record links to multiple records in its parent node. +The data in buckets could be separated by phsid+consent code combinations (as this is the smallest granularity of data access required). -For example, there are four cases in two studies in one `project`. The `study` node was made with two study `submitter_id`s: "study-01" and "study-02". The "case.tsv" file uploaded to describe the study participants enrolled will have a corresponding study. +The following bucket structure supports the ingestion of dbGaP’s MESA and FHS projects (from TOPMed). Each project has 2 distinct consent groups, and the data is mirrored on both AWS and Google buckets. +TOPMed-MESA (phs001416.c1 and .c2) +TOPMed-FHS (phs000974.c1 and .c2) -| case | submitter_id | studies.submitter_id | +| Project | AWS buckets | Google buckets | | --- | --- | --- | -| 1 | case_1 | study-01 | -| 2 | case_2 | study-02 | -| 3 | case_3 | study-01 | -| 4 | case_4 | study-01 | +| MESA (consent group 1) | s3://nih-nhlbi-topmed-released-phs001416-c1 | gs://nih-nhlbi-topmed-released-phs001416-c1 | +| MESA (consent group 2) | s3://nih-nhlbi-topmed-released-phs001416-c2 | gs://nih-nhlbi-topmed-released-phs001416-c2 | +| FHS (consent group 1) | s3://nih-nhlbi-topmed-released-phs000974-c1 | gs://nih-nhlbi-topmed-released-phs000974-c1 | +| FHS (consent group 2) | s3://nih-nhlbi-topmed-released-phs000974-c2 | gs://nih-nhlbi-topmed-released-phs000974-c2 | + +
+ +With a setup similar to this, Gen3 is able to support signed URLs and fully configured end-user access. + +#### Bucket Population + +Once a data allocation scheme is determined, data can be uploaded accordingly to cloud buckets. It should be noted that while Amazon AWS and Google are the most supported cloud providers, Gen3 is cloud agnostic. Any method and hierarchy structure can be used for upload as long as a the same parent directory is maintained with end user access. + +Regardless of the cloud platform, the CLI-DFS workflow requires file data gathered from its cloud location. Information such as file name, location, size, and md5sum are usually available from cloud platforms. Documentation for [AWS](https://aws.amazon.com/cli/), [Google](https://cloud.google.com/storage/docs/gsutil) and [Microsoft Azure](https://learn.microsoft.com/en-us/cli/azure/) should provide guidance to acquiring this information. + + >Note: The recommended (detailed here) way for Gen3 to provide controlled access to data is via Signed URLs. Signed URLs are the only fully cloud-agnostic method supported by Gen3 and additionally are supported by all major cloud resource providers. They also allow for short-lived, temporary access to the data for reduced risk. Lastly, utilizing signed URLs places very few restrictions on the organization of data within could bucket(s). + + +The files relevant to a Gen3 CLI-DFS Workflow submission: + +- Bucket mapping file: File that maps authorization designations to parent level bucket locations. + +- Bucket manifest file: Created for each submission and contains file level information (i.e. name, size, md5sum) + +- Indexing manifest: Created for each submission and submits both authorization and file level information into the [Indexd microservice](https://github.com/uc-cdis/indexd). + +The creation and submission of these files is covered below. + + >Note: The recommended (and simplest) way for Gen3 to provide controlled access to data is via Signed URLs. Signed URLs are the only fully cloud-agnostic method supported by Gen3 and additionally are supported by all major cloud resource providers. They also allow for short-lived, temporary access to the data for reduced risk. Lastly, utilizing signed URLs places very few restrictions on the organization of data within could bucket(s). + +### 4. Create Bucket Mapping and Manifest Files -In this example cases 1, 2, and 4 all belong to "study-01", but case 2 belongs to "study-02". All the cases have different `submitter_id`s and these will be used in the subtending node that refers to a specific case. -> __NOTE:__ The `submitter_id` needs to be unique not only within one node, but across all nodes in a project. The combination of `submitter_id` and `project_id` must be unique. +The below is the Gen3 recommended indexing file schema. While possible to utilize other configurations, they likely require significantly more administrative effort to maintain correct permissions in the cloud platform(s). -### Specifying Multiple Links +#### Bucket Mapping File +A Bucket mapping file is used maintain clear links between designated project authorization and parent level bucket locations. It should be maintained for the entire commons and appended when new datasets are ingested. The bucket mapping file should minimally contain the following fields and be presented in a tab delimited format. -Links can be one-to-one, many-to-one, one-to-many, and many-to-many. Since a single study participant can be enrolled in multiple studies, and a single study will have multiple cases enrolled in it, this link is "many-to-many". On the other hand, since a single study cannot be linked to multiple projects, but a single project can have many studies linked to it, the study -> project link is "many-to-one". -Properties that represent links, like “subjects.submitter_id” or “studies.submitter_id” are array variables and can take either a single submitter_id or a comma-separated list of submitter_ids in the case that a single record links to multiple records in its parent node. Using the example above, the entry in the `studies.submitter_id` can be "study-01, study-02". +The below example has 4 different authorizations for 8 bucket locations +##### Example Bucket Mapping File -#### Deprecated version +| bucket | authz | +| --- | --- | +| s3://nih-nhlbi-topmed-phs001416-c1 | phs001416.c1 | +| gs://nih-nhlbi-topmed-phs001416-c1 | phs001416.c1 | +| s3://nih-nhlbi-topmed-phs001416-c2 | phs001416.c2 | +| gs://nih-nhlbi-topmed-phs001416-c2 | phs001416.c2 | +| s3://nih-nhlbi-topmed-phs000974-c1 | phs000974.c1 | +| gs://nih-nhlbi-topmed-phs000974-c1 | phs000974.c1 | +| s3://nih-nhlbi-topmed-phs000974-c2 | phs000974.c2 | +| gs://nih-nhlbi-topmed-phs000974-c2 | phs000974.c2 | -In the above example, if "case_2" was enrolled in both "study-01" and "study-02", then there would be two columns to specify these links in the case.tsv file: "studies.submitter_id#1" and "studies.submitter_id#2". The values would be "study-01" for one of them and "study-02" for the other. +
+In the situation where Gen3 must support cloud-specific data access methods, Gen3 also requires the authz or acl column which contain the granular access control which would represent access to the entire bucket). -| case | submitter_id | studies.submitter_id#1 | studies.submitter_id#2 | +The authz column coordinates with the user permissions set in the Gen3 microservices [Fence](https://github.com/uc-cdis/fence) and [Arborist](https://github.com/uc-cdis/arborist). + +#### Bucket Manifest File +The bucket manifest file should contain individual file level metadata for a single batch of ingestion files. This means there will be several bucket manifest files per data commons. It is recommended that they are represented in a tab separated variable format and in each, a row should minimally contain the following information: + +- File Name +- File Size +- File hash via md5sum +- Exact file url in the bucket location + +If files are mirrored between cloud locations, bucket urls can be appended together with a whitespace delimiter. + +In the below example of an Bucket manifest file, please note the mirrored file bucket locations in S3 and GCP: + +##### Example Bucket Manifest File + +| File_name | File_size | md5sum | bucket_urls | | --- | --- | --- | --- | -| 1 | case_1 | study-01 | | -| 2 | case_2 | study-01 | study-02 | -| 3 | case_3 | study-01 | | -| 4 | case_4 | study-01 | | +| examplefile.txt | 123456 | sample_md5 | s3://nih-phs001416-c1/exfile.txt gs://nih-phs001416-c1/exfile.txt | +| otherexamplefile.txt | 123456 | different_md5 | s3://nih-nhlbi-topmed-released-phs001416-c1/otherexamplefile.txt gs://nih-nhlbi-topmed-released-phs001416-c1/otherexamplefile.txt | +| examplefile.txt | 123456 | sample_md5 | s3://nih-nhlbi-topmed-released-phs001416-c2/examplefile.txt gs://nih-nhlbi-topmed-released-phs001416-c2/examplefile.txt | +| otherexamplefile.txt | 123456 | different_md5 | s3://nih-nhlbi-topmed-released-phs001416-c2/otherexamplefile.txt gs://nih-nhlbi-topmed-released-phs001416-c2/otherexamplefile.txt | + + +### 5. Create Indexing Manifest File -### Begin Metadata TSV Submissions +An Indexing Manifest File is submitted to the [Indexd microservice](https://github.com/uc-cdis/indexd) and is a combination of both the Bucket Mapping and Manifest file information. -To submit a TSV in the data portal: +While the two preceding files are not strictly necessary for maintenance and operation of a Gen3 data commons, they are recommended for ease of maintenance. For instance, if multiple authorization designations are required within a single bucket location, administrators will need to set them individually directly in the cloud platform as Gen3 has no capability to interact with cloud resource permissions in that manner. -1. Login to the data portal for the commons. +#### Indexd Microservice Overview -2. Click on "Submit Data" in the top navigation bar. +The [Indexd microservice](https://github.com/uc-cdis/indexd) is used by Gen3 to maintain an index of all files in a data commons and serves as the data source by several other microservices to build various features of Gen3 data commons. A central part of what enables Gen3's Indexd is the integration of a [Globally Unique Identifier (GUID)](https://dataguids.org/#) to each element added to the microservice. - ![Data Submission](/gen3-resources/operator-guide/img/Gen3_Toolbar_data_submission.png) +#### Globally Unique Identifier (GUID) +GUIDs are primarily used to track and provide the current location of data and is designed to persist even as data is moved or copied. Information regarding the concept of GUIDs, GUID generation and look up of particular GUIDs can be found at [dataguids.org](https://dataguids.org/#). -3. Click on "Submit Data" beside the project of interest to submit metadata. +#### Indexing Manifest Components and Structure +By default GUIDs will be added to rows that lack an entry for that field when an indexing manifest is submitted to Indexd. GUIDs that are minted in this way are both available by querying Indexd or by referencing the submission output file that is generated. - ![Submit Data](/gen3-resources/operator-guide/img/Gen3_Data_Submission_submit_data.png) +As the Indexing Manifest is the file that is submitted to the [Indexd microservice](https://github.com/uc-cdis/indexd), it must be submitted in a tab separated variable file (.tsv) and contain the following fields: -4. Click on "Upload File". +- Globally Unique Identifier (GUID) - Either generated by indexd microservice at the time of submission or provided by the user prior to submission +- File Name +- File Size +- File hash via md5sum +- Exact file url in the bucket location +- authz or acl authorization designation - ![Upload and Submit](/gen3-resources/operator-guide/img/Gen3_Data_Submission_Use_Form.png) +Users may notice that with the exception of GUIDs, this file is a combination of the Bucket Mapping and Manifest files. If either AWS or Google cloud resources are used, Gen3 offers tools to produce bucket manifest files available at the following links: -5. Navigate to the TSV and click "open". The contents of the TSV should appear in the grey box -below. +- [AWS S3 Bucket Manifest Generation](https://github.com/uc-cdis/cloud-automation/blob/master/doc/bucket-manifest.md) +- [Google Bucket Manifest Generation](https://github.com/uc-cdis/cloud-automation/blob/master/doc/gcp-bucket-manifest.md) -6. Click "Submit". + >Note: Bucket manifest generation scripts require using Gen3's full deployment code and, depending on the amount of data, calculating checksums for files can be costly and take time. -A message should appear that indicates either success (green, "succeeded: 200") or failure (grey, "failed: 400"). Further details can be reviewed by clicking on "DETAILS", which displays the API response in JSON form. Each record/entity that was submitted, gets a true/false value for "valid" and lists "errors" if it was not valid. +The below is an example of a Indexing Manifest File: -For anything other than success, check the other fields for any information on the error with the submission. The most descriptive information will be found in the individual entity transaction logs. Each line in the TSV will have its own output with the following attributes: +##### Example Indexing Manifest File + +| guid | File_name | File_size | md5sum | bucket_urls | auth | +| --- | --- | --- | --- | --- | --- | +| dg.4503/02... ...7103bbe | examplefile.txt | 34141 | c79... ...dbd | s3://nih-phs001416-c1/exfile.txt gs://nih-phs001416-c1/exfile.txt | [phs001416.c1] | +| dg.4503/00... ...0211dfg | otherexamplefile.txt | 562256 | 65a... ...bca | s3://nih-nhlbi-topmed-released-phs001416-c1/otherexamplefile.txt gs://nih-nhlbi-topmed-released-phs001416-c1/otherexamplefile.txt | [phs001416.c1] | +| dg.4503/00... ...7103bbe | examplefile.txt | 36564 | dca... ...774 | s3://nih-nhlbi-topmed-released-phs001416-c2/examplefile.txt gs://nih-nhlbi-topmed-released-phs001416-c2/examplefile.txt | [phs001416.c2] | +| dg.4503/01... ...0410nnd | otherexamplefile.txt | 2675 | 742... ...f1b | s3://nih-nhlbi-topmed-released-phs001416-c2/otherexamplefile.txt gs://nih-nhlbi-topmed-released-phs001416-c2/otherexamplefile.txt | [phs001416.c2] | + + +### 6. Submit file Indexing Manifest to Indexd + + +Once created, Gen3 offers an [Indexing sdk toolkit](https://uc-cdis.github.io/gen3sdk-python/_build/html/tools/indexing.html) to build, validate and map all files into a Gen3 datacommons. The sdk functions reconcile and add data to the indexd microservice. + +Sample code for validation and submission of a constructed indexing manifest file to indexd. +``` +import gen3 +from gen3.tools.indexing import validate_manifest_format +from gen3.tools.indexing.index_manifest import index_object_manifest + +file_path = <.tsv_indexd_file_addition> + +gen3.tools.indexing.validate_manifest_format.is_valid_manifest_format(file_path) + +gen3.tools.indexing.index_manifest.index_object_manifest(commons_url=commons_url, + manifest_file=file_path, + thread_num=8, + auth=authentication_object, + output_filename=index_manifest[:-4] + '_output.tsv') +``` + +*Please refer to the [authentication sdk](https://uc-cdis.github.io/gen3sdk-python/_build/html/auth.html) for set up of the authentication_object used above* + +*Note: Users in the Gen3-Community have published [repos](https://github.com/jacquayj/gen3-s3indexer-extramural) that index large pre-existing s3 buckets (disclaimer: CTDS is not responsible for the content and opinions on the third-party repos).* + + +### 7. Map files to a Data Node with the Gen3 SDK + + +Once indexing is complete, Gen3 offers a [Submission sdk toolkit](https://uc-cdis.github.io/gen3sdk-python/_build/html/submission.html) to map indexed data files to nodes designated to contain data in the [data dictionary](/resources/user/dictionary/#what-is-a-data-dictionary-and-data-model) via the [Sheepdog microservice](https://github.com/uc-cdis/sheepdog). Unless single data files are being ingested, the sdk submission toolkit generally requires a tab separated variable file, and specific nodes requirements for each data file type can be specified in the data dictionary. After mapping in Sheepdog is complete the file metadata will be mapped from the [program and project](/resources/user/cli-submission#1-prepare-project-sdk) administrative nodes (previously created) to its respective data containing nodes. The mapping in sheepdog is the basis for other search and query services either natively in sheepdog or after other extraction, tranformation and load [(ETL)](/resources/operator/#8-etl-and-data-explorer-configurations) services have been performed. + + +To continue your data submission return to the main [Gen3 - Data Contribution](/resources/user/submit-data/#4-submit-additional-project-metadata) page. + + + +* ** + +## Gen3 client instructions for uploading data + +The gen3-client provides an easy-to-use, command-line interface for uploading and downloading data files to and from a Gen3 data commons from the terminal or command prompt, respectively. Only information related to uploading is included below. For instruction related to download please review the [Downloading Files Using the Gen3-client section](/gen3-resources/user-guide/access-data/#download-files-using-the-gen3-client). + + + + + +### 1. Installation Instructions + +Installation instructions are covered in the [Downloading Files Using the Gen3-client section](/gen3-resources/user-guide/access-data/#download-files-using-the-gen3-client). + + +### 2. Configure a Profile with Credentials + +Profile configuration instructions are covered in the [Downloading Files Using the Gen3-client section](/gen3-resources/user-guide/access-data/#download-files-using-the-gen3-client). + +### 3. Upload Data Files using the Gen3 Client + +The gen3-client provides an easy-to-use, command-line interface for uploading and downloading data files to and from a Gen3 data commons from the terminal or command prompt, respectively. These instructions will only cover the uploading capabilities. Please refer to [Download Files Using the Gen3 Client section](/gen3-resources/user-guide/access-data/#download-files-using-the-gen3-client) for instructions on downloading. + +For the typical data contributor, the `gen3-client upload` command should be used to upload data files to a Gen3 Data Commons. The commands `upload-single` and `upload-multiple` are used only in special cases, for example, when a file or collection of files are uploaded to specific GUIDs *after* generating structured data records for the files. These two commands are described in further detail in sections 7 and 8 below. + +When data files are uploaded to a Gen3 data common's object storage, they are assigned a unique, 128-bit ID called a ['GUID'](https://dataguids.org/), which stands for "globally unique identifier". GUIDs are generated by the system software, not provided by users, and they are stored in the property `object_id` of a data_file's structured data. + +When using the `gen3-client upload` command, a random, unique GUID will be generated and assigned to each data file that has been submitted, and an entry in the indexd database will be created for that file, which associates the storage location of the file with the file's object_id ("did" in the indexd record, see below for more details). + +#### Options and User Input Flags + +The following flags can be used with the `gen3-client upload` command: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Flag nameRequired?Default valueExplanationSample usage
profileYesN/AThe profile name that user wishes to use from the config file.--profile=demo
upload-pathYesN/AThe directory or file in which contains file(s) to be uploaded.--upload-path=../data_folder/
batchNofalseIf set to `true`, gen3-client will upload multiple files simultaneously. The maximum number of file can be uploaded at a same time is specified by the `numparallel` option--batch=true
numparallelNo3Number of uploads to run in parallel. Must be used in together with the `batch` option.--numparallel=5
include-subdirnameNofalseInclude subdirectory names in file name.--include-subdirname=true
force-multipartNofalseForce to use multipart upload if possible.--force-multipart=true
+ + +Example of a single file upload: +``` +~> gen3-client upload --profile=demo --upload-path=test.txt +2019/11/19 12:45:41 Finish parsing all file paths for "/Users/demo/Documents/test.txt" + +The following file(s) has been found in path "/Users/demo/Documents/test.txt" and will be uploaded: + /Users/demo/Documents/test.txt + +2019/11/19 12:45:41 Uploading data ... +test.txt 25 B / 25 B [=======================================================================================================================================] 100.00% 0s +2019/11/19 12:45:41 Successfully uploaded file "/Users/demo/Documents/test.txt" to GUID 1a82043e-02ec-4974-a803-7c0fd33ecfd7. +2019/11/19 12:45:41 Local succeeded log file updated + + +Submission Results +Finished with 0 retries | 1 +Finished with 1 retry | 0 +Finished with 2 retries | 0 +Finished with 3 retries | 0 +Finished with 4 retries | 0 +Finished with 5 retries | 0 +Failed | 0 +TOTAL | 1 +``` + +Example of uploading all files within an folder: +``` +~/Documents> gen3-client upload --profile=demo --upload-path=test_dir +2019/11/19 13:12:47 Finish parsing all file paths for "/Users/demo/Documents/test_dir" + +The following file(s) has been found in path "/Users/demo/Documents/test_dir" and will be uploaded: + /Users/demo/Documents/test_dir/test.doc + /Users/demo/Documents/test_dir/test.jpg + /Users/demo/Documents/test_dir/test_1.txt + /Users/demo/Documents/test_dir/test_2.txt + +2019/11/19 13:12:48 Uploading data ... +test.doc 46 B / 46 B [=================================================================================================================================================================] 100.00% 0s +2019/11/19 13:12:48 Successfully uploaded file "/Users/demo/Documents/test_dir/test.doc" to GUID 7d1b41d9-002e-46d0-8934-6606d246ca30. +2019/11/19 13:12:48 Local succeeded log file updated +2019/11/19 13:12:48 Uploading data ... +test.jpg 50 B / 50 B [=================================================================================================================================================================] 100.00% 0s +2019/11/19 13:12:48 Successfully uploaded file "/Users/demo/Documents/test_dir/test.jpg" to GUID 59059e8d-29bf-4f8b-b9a4-2cd0ef2420f6. +2019/11/19 13:12:48 Local succeeded log file updated +2019/11/19 13:12:48 Uploading data ... +test_1.txt 30 B / 30 B [===============================================================================================================================================================] 100.00% 0s +2019/11/19 13:12:48 Successfully uploaded file "/Users/demo/Documents/test_dir/test_1.txt" to GUID 6f6686f1-45f2-4e8d-a997-a669b9419fd3. +2019/11/19 13:12:48 Local succeeded log file updated +2019/11/19 13:12:48 Uploading data ... +test_2.txt 27 B / 27 B [===============================================================================================================================================================] 100.00% 0s +2019/11/19 13:12:49 Successfully uploaded file "/Users/demo/Documents/test_dir/test_2.txt" to GUID d8ec2f5a-0990-495f-8192-ca2f037d6236. +2019/11/19 13:12:49 Local succeeded log file updated + + +Submission Results +Finished with 0 retries | 4 +Finished with 1 retry | 0 +Finished with 2 retries | 0 +Finished with 3 retries | 0 +Finished with 4 retries | 0 +Finished with 5 retries | 0 +Failed | 0 +TOTAL | 4 +``` + +Example of upload using wildcard. Here we specify `*txt` in the `--upload-path` to get only files with a "txt" extension in the "test_dir" directory: +``` +~/Documents> gen3-client upload --profile=demo --upload-path=test_dir/*txt +2019/11/19 15:49:07 Created folder "/Users/demo/.gen3/logs/" +2019/11/19 15:49:07 Finish parsing all file paths for "/Users/demo/Documents/test_dir/*txt" + +The following file(s) has been found in path "/Users/demo/Documents/test_dir/*txt" and will be uploaded: + /Users/demo/Documents/test_dir/test_1.txt + /Users/demo/Documents/test_dir/test_2.txt + +2019/11/19 15:49:07 Uploading data ... +test_1.txt 30 B / 30 B [===============================================================================================================================================================] 100.00% 0s +2019/11/19 15:49:07 Successfully uploaded file "/Users/demo/Documents/test_dir/test_1.txt" to GUID 956890a9-b8a7-4abd-b8f7-dd0020aaf562. +2019/11/19 15:49:07 Local succeeded log file updated +2019/11/19 15:49:07 Uploading data ... +test_2.txt 27 B / 27 B [===============================================================================================================================================================] 100.00% 0s +2019/11/19 15:49:07 Successfully uploaded file "/Users/demo/Documents/test_dir/test_2.txt" to GUID 6cf194f1-c68e-4976-8ca4-a0ce9701a9f3. +2019/11/19 15:49:07 Local succeeded log file updated + + +Submission Results +Finished with 0 retries | 2 +Finished with 1 retry | 0 +Finished with 2 retries | 0 +Finished with 3 retries | 0 +Finished with 4 retries | 0 +Finished with 5 retries | 0 +Failed | 0 +TOTAL | 2 + +``` + +Example using two wildcards in one path. Here we add `test_*/` to the `--upload-path` to upload files in more than one directory, and then we add `*.jpg` to add only the files from those directories with a ".jpg" extension: +``` +~/Documents> gen3-client upload --profile=demo --upload-path=./test_*/*.jpg +2019/11/19 15:53:12 Finish parsing all file paths for "/Users/demo/Documents/test_*/*.jpg" + +The following file(s) has been found in path "/Users/demo/Documents/test_*/*.jpg" and will be uploaded: + /Users/demo/Documents/test_dir/test.jpg + /Users/demo/Documents/test_dir_2/test_2.jpg + +2019/11/19 15:53:12 Uploading data ... +test.jpg 50 B / 50 B [=================================================================================================================================================================] 100.00% 0s +2019/11/19 15:53:13 Successfully uploaded file "/Users/demo/Documents/test_dir/test.jpg" to GUID 9bd009b6-e518-4fe5-9056-2b5cba163ca3. +2019/11/19 15:53:13 Local succeeded log file updated +2019/11/19 15:53:13 Uploading data ... +test_2.jpg 50 B / 50 B [===============================================================================================================================================================] 100.00% 0s +2019/11/19 15:53:13 Successfully uploaded file "/Users/demo/Documents/test_dir_2/test_2.jpg" to GUID 3d275025-8b7b-4f84-9165-72a8a174d642. +2019/11/19 15:53:13 Local succeeded log file updated + + +Submission Results +Finished with 0 retries | 2 +Finished with 1 retry | 0 +Finished with 2 retries | 0 +Finished with 3 retries | 0 +Finished with 4 retries | 0 +Finished with 5 retries | 0 +Failed | 0 +TOTAL | 2 +``` + +#### Local Submission History + +The application will keep track of which local files have already been submitted to avoid potential duplication in submissions. This information is kept in a .JSON file in the "logs" directory under the same user folder as where the `config` file lives, for example: + +``` +Mac/Linux: /Users/demo/.gen3/logs/_succeeded_log.json +Windows: C:\Users\demo\.gen3\logs\_succeeded_log.json +``` + +Each object in the succeeded log file is a key/value pair of the full path of a file and the GUID it is associated with. + +Example of a succeeded log JSON File: ``` -JSON { - "action": "update/create", - "errors": [ - { - "keys": [ - "species (the property name)" - ], - "message": "'Homo sapien' is not one of ['Drosophila melanogaster', 'Homo sapiens', 'Mus musculus', 'Mustela putorius furo', 'Rattus rattus', 'Sus scrofa']", - "type": "ERROR" - } - ], - "id": "1d4e9bb0-515d-4158-b14b-770ab5077d8b (the GUID created for this record)", - "related_cases": [], - "type": "case (the node name)", - "unique_keys": [ - { - "project_id": "training (the project name)", - "submitter_id": "training-case-02 (the record/entity submitter_id)" - } - ], - "valid": false, - "warnings": [] + "/Users/demo/test.gif":"65f5d77c-1b2a-4f41-a2c9-9daed5a59f14" } ``` -The "action" above can be used to identify if the node was newly created or updated. Updating a node is submitting to a node with the same `submitter_id` and overwriting the existing node entries. Other useful information includes the "id" for the record. This is the GUID for the record and is unique throughout the entirety of the data commons. The other "unique_key" provided is the tuple "project_id" and "submitter_id", which is to say the "submitter_id" combined with the "project_id" is a universal identifier for this record. +When you run a `gen3-client upload` command, the client will check the succeeded_log.json log file for the files found in the provided `--upload-path`. If a file in the `--upload-path` is found in the succeeded log file, it will be skipped. For example: +``` +~/Documents> gen3-client upload --profile=demo --upload-path=test.txt +2019/11/19 16:00:42 Finish parsing all file paths for "/Users/demo/Documents/test.txt" -To confirm that a data file is properly registered, enter the GUID of a data file record in the index API endpoint of the data commons: usually "https://gen3.datacommons.io/index/index/GUID", where "https://gen3.datacommons.io" is the URL of the Gen3 data portal and GUID is the specific GUID of a registered data file. This should display a JSON response that contains the URL that was registered. If the record was not registered successfully, it is likely an error message will occur. An error that says "access denied" might also occur if the user is not logged in or the session has timed out. Note, that for these user guides, https://gen3.datacommons.io is an example URL and can be replaced with the URL from other data commons powered by Gen3. +The following file(s) has been found in path "/Users/demo/Documents/test.txt" and will be uploaded: + /Users/demo/Documents/test.txt -> __Note:__ Gen3 users can also submit metadata using the Gen3 SDK, which is a Python library containing functions for sending standard requests to the Gen3 APIs. For example, the function `submit_file` from the **Gen3Submission** class will submit data in a spreadsheet file containing multiple records in rows to a Gen3 Commons. The code is open-source and available on [GitHub](https://github.com/uc-cdis/gen3sdk-python) along with [documentation for using it](https://uc-cdis.github.io/gen3sdk-python/_build/html/index.html). Furthermore, [this section](https://gen3.org/resources/user/analyze-data/#4-using-the-gen3-sdk) describes steps on how to get started. +2019/11/19 16:00:42 File "/Users/demo/Documents/test.txt" has been found in local submission history and has been skipped for preventing duplicated submissions. +Submission Results +Finished with 0 retries | 0 +Finished with 1 retry | 0 +Finished with 2 retries | 0 +Finished with 3 retries | 0 +Finished with 4 retries | 0 +Finished with 5 retries | 0 +Failed | 0 +TOTAL | 0 +``` -#### TSV Formatting Checklist +In the rare case that you need to upload the same file again, the success log file will need to be moved, modified, renamed, or deleted. Alternatively, the file itself can be moved or renamed, as the information stored in the succeeded_log.json is the file's full path. -1. Specify the node `type` for every row. This is the name of the node (or `node_id`), and it must be exactly the same for every row. -2. Specify the `submitter_id` of every record by entering a unique text identifier in each row. Make sure you don't use the same value in more than one row of your TSV because every record in a project must have a unique `submitter_id`! -3. Specify the links to the parent node(s) for each record. Note: parent records must exist before submitting child records! You can specify either the links with either the `parents.submitter_id` or the `parents.id` -4. Fill in all required properties. Every row in the TSV must have a value for all required properties. Optional properties can be filled in for only some rows or the column can be left out entirely. +#### Indexd records -Templates can be downloaded from the data dictionary page of your commons. See the [Gen3 Data Hub](https://gen3.datacommons.io/DD) as an example and click on the TSV option for each node. +When files are successfully uploaded by the gen3-client, the software service indexd creates a record for that file in the file index database, which can be accessed at the /index endpoint. For example, if the file's GUID is `5bcd2a59-8225-44a1-9562-f74c324d8dec`, enter the following URL in a browser or request it via the API to view its indexd record: https://nci-crdc-demo.datacommons.io/index/5bcd2a59-8225-44a1-9562-f74c324d8dec. +``` +{ + acl: [ ], + baseid: "d8dfffe8-ea07-4fff-9a2e-2405d4e061d7", + created_date: "2019-11-19T22:00:41.196521", + did: "5bcd2a59-8225-44a1-9562-f74c324d8dec", + file_name: "test.txt", + form: null, + - hashes: { + crc: "daaadff6", + md5: "2d282102fa671256327d4767ec23bc6b", + sha1: "e6c4fbd4fe7607f3e6ebf68b2ea4ef694da7b4fe", + sha256: "649b8b471e7d7bc175eec758a7006ac693c434c8297c07db15286788c837154a", + sha512: "bf9bac8036ea00445c04e3630148fdec15aa91e20b753349d9771f4e25a4f68c82f9bd52f0a72ceaff5415a673dfebc91f365f8114009386c001f0d56c7015de" + }, + metadata: { }, + rev: "9e1436a6", + size: 21, + updated_date: "2019-11-19T22:00:41.196528", + uploader: "my-email@uchicago.edu", + - urls: [ + "s3://ncicrdcdemo-data-bucket/5bcd2a59-8225-44a1-9562-f74c324d8dec/test.txt" + ], + - urls_metadata: { + s3://ncicrdcdemo-data-bucket/5bcd2a59-8225-44a1-9562-f74c324d8dec/test.txt: { } + }, + version: null +} +``` -### Troubleshooting and Finishing the Submission +#### Mapping uploaded files +Files that have been successfully uploaded now have a GUID associated with them, and there is also an associated record in the indexd database. However, in order for the files to show up in the data portal, the files have to be registered in the PostgreSQL database. In other words, indexd records exist for the files, but sheepdog records (that is, structured metadata in the graph model) don't exist yet. Thus, the files aren't yet associated with any particular program, project, or node. To create the structured data records for the files via the sheepdog service, the Data Portal offers a "Map My Files" UI, which can be reviewed [here](#3-map-uploaded-files-to-a-data-file-node). -If the submission throws errors or claims the submission to be invalid, it will be the submitter's task to fix the submission. The best first step is to go through the outputs from the individual entities, as seen in the previous section. The errors fields will give a rough description of what failed the validation check. The most common problems are simple issues such as spelling errors, mislabeled properties, or missing required fields. +#### Removing unwanted uploaded files -### Learning More About the Existing Submission +Before the files are mapped to a project's node in the data model, the files can be deleted both from indexd and from the cloud location by sending a delete request to the fence endpoint `/user/data/`. For example, to delete the file we checked in the index above, we'd send a delete API request to this URL: https://nci-crdc-demo.datacommons.io/user/data/5bcd2a59-8225-44a1-9562-f74c324d8dec -When viewing a project, clicking on a node name will allow the user to view the records in that node. From here a user can download, view, or completely delete records associated with any project they have delete access to. +For example, running [this script](https://github.com/uc-cdis/planx-bioinfo-tools/blob/master/submission_tool/delete_unmapped_files.py) will delete all the user's unmapped files from indexd and from the storage location using the fence endpoint: +``` +~> python delete_uploaded_files.py -a https://nci-crdc-demo.datacommons.io/ -u user@gen3.org -c ~/Downloads/demo-credentials.json +Found the following guids for uploader user@gen3.org: ['3d275025-8b7b-4f84-9165-72a8a174d642', '5bcd2a59-8225-44a1-9562-f74c324d8dec', '6cf194f1-c68e-4976-8ca4-a0ce9701a9f3', '956890a9-b8a7-4abd-b8f7-dd0020aaf562', '9bd009b6-e518-4fe5-9056-2b5cba163ca3'] +Successfully deleted GUID 3d275025-8b7b-4f84-9165-72a8a174d642 +Successfully deleted GUID 5bcd2a59-8225-44a1-9562-f74c324d8dec +Successfully deleted GUID 6cf194f1-c68e-4976-8ca4-a0ce9701a9f3 +Successfully deleted GUID 956890a9-b8a7-4abd-b8f7-dd0020aaf562 +Successfully deleted GUID 9bd009b6-e518-4fe5-9056-2b5cba163ca3 -![Node Click](/gen3-resources/operator-guide/img/Gen3_Model_Click_highlight.png) +``` -![Node Information](/gen3-resources/operator-guide/img/Gen3_Model_node_view.png) +### 3. How to Upload a Single Data File Using a GUID +If a data file has already been assigned a GUID via registration in a Gen3 data commons' indexd database, then the gen3-client can be used to upload the file associated with that GUID to object storage via the `gen3-client upload-single` command. +> __NOTE:__ For most data uploaders, using the `gen3-client upload` command followed by mapping files in the data portal is the preferred method for uploading files. See [this section](#2-upload-data-files-to-object-storage) of the documentation for details. +The GUID or `object_id` property for a submitted data file can then be obtained via graphQL query or viewing the data file JSON record in the graphical model of the project. +Example Usage: +``` +gen3-client upload-single --profile= --guid= --file= + +gen3-client upload-single --profile=demo --guid=b4642430-8c6e-465a-8e20-97c458439387 --file=test.gif + +Uploading data ... +test.gif 3.64 MiB / 3.64 MiB [==========================================================================================] 100.00% +Successfully uploaded file "test.gif" to GUID b4642430-8c6e-465a-8e20-97c458439387. +1 files uploaded. +``` +### 4. How to Upload Multiple Data Files Using a Manifest +Users can automate the bulk upload of data files by providing the gen3-client with an upload manifest. The upload manifest should follow the same format as the download manifest, which is described in the previous section. Minimally, the manifest file is a JSON that contains `object_id` fields. The value of each `object_id` field is the GUID (globally unique identifier) of a data file that will be uploaded. In this mode, we assume the filenames of data files to be uploaded are the same as the GUIDs. + +Example of manifest.json (Minimal): + +``` +{ + { + "object_id": "a12ff17c-2fc0-475a-9c21-50c19950b082" + }, + { + "object_id": "b12ff17c-2fc0-475a-9c21-50c19950b082" + }, + { + "object_id": "c12ff17c-2fc0-475a-9c21-50c19950b082" + } +} +``` + +The gen3-client will upload all the files in the provided manifest using the `gen3-client upload-multiple` command. + +Example Usage: + +``` +gen3-client upload-multiple --profile= --manifest= --upload-path= + +gen3-client upload-multiple --profile=demo --manifest=manifest.json --upload-path=upload + +Uploading data ... +a12ff17c-2fc0-475a-9c21-50c19950b082 3.64 MiB / 3.64 MiB [==========================================================================================] 100.00% +b12ff17c-2fc0-475a-9c21-50c19950b082 3.63 MiB / 3.63 MiB [==========================================================================================] 100.00% +c12ff17c-2fc0-475a-9c21-50c19950b082 3.65 MiB / 3.65 MiB [==========================================================================================] 100.00% +Successfully uploaded file "a12ff17c-2fc0-475a-9c21-50c19950b082" to GUID a12ff17c-2fc0-475a-9c21-50c19950b082. +Successfully uploaded file "b12ff17c-2fc0-475a-9c21-50c19950b082" to GUID b22ff17c-2fc0-475a-9c21-50c19950b082. +Successfully uploaded file "c12ff17c-2fc0-475a-9c21-50c19950b082" to GUID c22ff17c-2fc0-475a-9c21-50c19950b082. +3 files uploaded. +``` +### 5. Quick Start for Experienced Users or Cheat Sheet +Quick start instructions are covered in the [Downloading Files Using the Gen3-client section](/gen3-resources/user-guide/access-data/#download-files-using-the-gen3-client). -## Semi-structured Data +### 6. Working from the Command line +Working from the command line instructions are covered in the [Downloading Files Using the Gen3-client section](/gen3-resources/user-guide/access-data/#download-files-using-the-gen3-client). diff --git a/gen3/mkdocs.yml b/gen3/mkdocs.yml index 21b756bb..cde54366 100644 --- a/gen3/mkdocs.yml +++ b/gen3/mkdocs.yml @@ -134,8 +134,7 @@ markdown_extensions: - md_in_html - toc: permalink: True - baselevel: 1 - toc_depth: 1-5 + toc_depth: 3 - pymdownx.superfences - pymdownx.details plugins: