Skip to content

Commit

Permalink
responding to internal feedback on documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
michaelfitzo committed Dec 19, 2024
1 parent 33cccd1 commit 79938f4
Show file tree
Hide file tree
Showing 17 changed files with 141 additions and 114 deletions.
2 changes: 1 addition & 1 deletion gen3/docs/gen3-resources/developer-guide/microservices.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ This service handles reading from and writing to a user's S3 folder containing t
The Metadata Service provides an API for retrieving JSON metadata of GUIDs. It is a flexible option for "semi-structured" data (key:value mappings). The content of the MDS powers the Data Portal Discovery Page for a Data Commons. The Gen3 SDK can be used to upload and edit the metadata. This service includes a feature known as the aggregated metadata service (AggMDS), which caches metadata from the metadata services of multiple data commons. The AggMDS holds the content viewable in a Data Portal Discovery page for a Data Mesh.

## [Peregrine][peregrine github]
Peregrine is the metadata seeking service which responds to GraphQL search queries and translates them to queries over our graph-like source of truth postgres database for structured data. The GraphQL service allows Commons operators and users to precisely query only the information they are most interested in from the metadata collections. The service translates the GraphQL search into the appropriate statements which are run against the PostgreSQL backend before being returned as friendly JSON.
Peregrine is the metadata seeking service which responds to GraphQL search queries and translates them to queries over the graph-like postgres database for structured data. The service translates the GraphQL search into the appropriate statements which are run against the PostgreSQL backend before being returned as friendly JSON.

## [Requestor][requestor github]
Requestor exposes an API to manage access requests.
Expand Down
29 changes: 21 additions & 8 deletions gen3/docs/gen3-resources/glossary.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,15 @@
# Gen3 Glossary

## Allow Lists
An allow list is simply a list of users (identified based on your method of authentication) that controls which users have access to which data. It is in the form of a user.yaml file that is maintained by the operator of your Gen3 system. Gaining access may require you to sign a Data Use Agreement. Data access is granted at the program or project level.

Alternatively, an allow list may also refer to a list of authorized tools within a Gen3 workspace.
## API
Gen3 services expose APIs (or Application Programming Interface), which allows users to interact directly with the system or data without using the Data Portal or GUI. An Open API refers to an API that does not require any authentication.
## Commons Services Operations Center (CSOC)
A Common Services Operations Center is an operations center operated by a commons services provider for setting up, configuring, operating, and monitoring data commons, data meshes, data hubs, and other data platforms for managing, analyzing, and sharing data.
## Crosswalk
Linking patients from across data commons where some patient data exists in commons A and additional data exists in commons B. This linkage is recorded in the metadata service. An example of how to set this up is found [here][crosswalk setup].
Typically, used for linking patients from across data commons where some patient data exists in commons A and additional data exists in commons B. This linkage enables metadata associations across commons and the promise of richer datasets. Crosswalks can be made for several types of metadata and are recorded in the metadata service. An example of how to set this up is found [here][crosswalk setup].
## Data Commons
A data commons co-locates data with cloud computing infrastructure and commonly used software services, tools, and applications for managing, integrating, analyzing and sharing data that are exposed through web portals and APIs to create an interoperable resource for a research community. A data commons provides services so that the data is findable, accessible, interoperable and reusable (FAIR)
## Data Dictionary
Expand Down Expand Up @@ -39,12 +45,7 @@ Structured data submitted to commons are stored in PostgreSQL. Querying data fro
FAIR data are data which meet the principles of findability, accessibility, interoperability, and reusability [12]. There is now an extensive literature on FAIR data.
## Framework Services
Framework Services or Data Commons Framework (DCF) Services is the term used by Gen3 to refer to data mesh services in the narrow middle architecture, for data meshes, such as the NCI Cancer Research Data Commons. These are set of standards-based services with open APIs for authentication, authorization, creating and accessing FAIR data objects, and for working with bulk structured data in machine-readable, self-contained format.
## Globally Unique Identifier (GUID)
A GUID is an essentially unique identifier that is generated by an algorithm so that no central authority is needed, but rather different programs running in different locations can generate GUID with a low probability that they will collide. A common format for a GUID is the hexadecimal representation of a 128-bit binary number.
## Kubernetes
An open-source system for automating deployment, scaling, and management of containerized applications, which Gen3 is built from.
## Microservice
Microservices are a software architecture that organizes software into small, independent services that communicate over well-defined APIs. These services can be developed, set up, and scaled independently. A more traditional architecture is to put all the APIs and other required functionality into a single application. This is sometimes called a monolithic architecture. Microservices provide important advantages for large-scale systems that require scalability and must continue to evolve even as their code base grows very large, but increases the complexity of operating small-scale systems.

## Flattened Data
Structured data that has been processed via Tube and stored in elasticsearch to accelerate searchability.
## Gen3 Client
Expand All @@ -69,6 +70,18 @@ A simple list of most relevant microservices are included below. For a descript
* Tube
* Workspace Token Service

## Globally Unique Identifier (GUID)
A GUID is an essentially unique identifier that is generated by an algorithm so that no central authority is needed, but rather different programs running in different locations can generate GUID with a low probability that they will collide. A common format for a GUID is the hexadecimal representation of a 128-bit binary number. Some external systems may use the term Universallt Unique Identifier (UUID), which is essentially the same thing.
## Graph model
The graph model refers to the structured data within a Gen3 data commons. The "graph" is defined by the relationship between nodes that is specified in the data dictionary. This can be flattened via Tube and stored in elasticsearch to accelerate searchability.
## Kubernetes
An open-source system for automating deployment, scaling, and management of containerized applications, which Gen3 is built from.
## Manifest
Usually refers to a file manifest. This is a json formatted file that includes GUIDs, file names, md5 checksums, and file sizes for files of interest. It can be used by the gen3 client or SDK to download files provided a user has the appropriate credentials.
## Microservice
Microservices are a software architecture that organizes software into small, independent services that communicate over well-defined APIs. These services can be developed, set up, and scaled independently. A more traditional architecture is to put all the APIs and other required functionality into a single application. This is sometimes called a monolithic architecture. Microservices provide important advantages for large-scale systems that require scalability and must continue to evolve even as their code base grows very large, but increases the complexity of operating small-scale systems.
## Persistent Drive
This is a directory in a Gen3 workspace that allows a user to store files that will remain available after termination of the workspace session. It will be represented as `pd`.
## Portable Format for Biomedical data (PFB)
PFB is a serialization file format designed to store bio-medical data and metadata. The format is built on top Avro to make it fast, extensible and interoperable between different systems. You can find the GitHub repo [here][PFB GitHub] and the publication [here][PFB Pub].
## Workspace
Expand All @@ -84,7 +97,7 @@ Gen3 workspaces are secure data analysis environments in the cloud that can acce
[Data Portal User Guide]: user-guide/portal.md
[Microservices]: developer-guide/microservices.md
[Gen3 client docs]: user-guide/access-data.md#installation-instructions
[SDK docs]: user-guide/search.md#exporting-structured-data-programmatically
[SDK docs]: user-guide/search.md#the-gen3-sdk
[PFB GitHub]: https://github.com/uc-cdis/pypfb
[PFB Pub]: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010944
[workspace use]: user-guide/analyze-data.md
2 changes: 1 addition & 1 deletion gen3/docs/gen3-resources/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,6 @@ This section contains the bulk of the Gen3 technical documentation. It is broke


* **Gen3 User Guide** - This is for the data scientist, researcher, or analyst who needs to explore, download, or analyze data found within an existing instance of Gen3.
* **Gen3 Operator Guide** - This is for those organizations who operate their own Gen3 instances. It will include content on how to deploy, configure, and maintain a Gen3 instances; configuring a data dictionary and uploading data; and customizing the frontend.
* **Gen3 Operator Guide** - This is for those organizations who operate their own Gen3 instances. It will include content on how to deploy, configure, and maintain a Gen3 instance; configure a data dictionary and upload data; and customize the frontend.
* **Gen3 Developer Guide** - This is for a software engineer who wants to extend Gen3 either by contributing to the source code or by integrating Gen3 services into a larger system. This section will cover the Gen3 architecture including the individual microservices and how they interact with each other.
* **Glossary** - This section can be used as reference for terminology found within Gen3 technical documentation.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
---
tags:
- submission
---

# Semi-structured Data

Expand Down
28 changes: 15 additions & 13 deletions gen3/docs/gen3-resources/operator-guide/submit-structured-data.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
---
tags:
- submission
---


# Structured Data (Clinical or experimental data)
Expand All @@ -16,7 +20,7 @@ There are a number of properties that deserve special mention:

* `type`: Every node has a `type` property, which is simply the name of the node. By providing the node name in the "type" property, the submission portal knows which node to put the data in.

* `id`: Every record in every node in a data commons has the unique property `id`, which is not submitted by the data contributor but rather generated on the backend. The value of the property `id` is a 128-bit UUID (a unique 32 character identifier).
* `id`: Every record in every node in a data commons has the unique property `id`, which is not submitted by the data contributor but rather generated on the backend. The value of the property `id` is a 128-bit GUID (a unique 32 character identifier).

* `project_id` and `code`: Every project record in a data commons is linked to a parent `program` node and has the properties `project_id` and a `code`. The property `project_id` is the dash-separated combination of `program` and the project's `code`. For example, if your project was named 'Experiment1', and this project was part of the 'Pilot' program, the project's `project_id` would be 'Pilot-Experiment1', and the project's `code` would be 'Experiment1'. Finally, just like every record in the data commons, the project has the unique property `id`, which is not to be confused with the project's `project_id`.

Expand Down Expand Up @@ -108,14 +112,11 @@ To submit a TSV in the data portal:

3. Click on "Submit Data" beside the project of interest to submit metadata.

![Submit Data][submit data]

4. Click on "Upload File".

![Upload and Submit][upload file]

5. Navigate to the TSV and click "open". The contents of the TSV should appear in the grey box
below.
5. Navigate to the TSV from your local directory and click "open". The contents of the TSV should appear in the grey box below.

6. Click "Submit".

Expand Down Expand Up @@ -156,6 +157,15 @@ To confirm that a data file is properly registered, enter the GUID of a data fil

> __Note:__ Gen3 users can also submit metadata using the Gen3 SDK, which is a Python library containing functions for sending standard requests to the Gen3 APIs. For example, the function `submit_file` from the **Gen3Submission** class will submit data in a spreadsheet file containing multiple records in rows to a Gen3 Commons. The code is open-source and available on [GitHub](https://github.com/uc-cdis/gen3sdk-python) along with [documentation for using it](https://uc-cdis.github.io/gen3sdk-python/_build/html/index.html). Furthermore, [this section](https://gen3.org/resources/user/analyze-data/#4-using-the-gen3-sdk) describes steps on how to get started.
### Review submitted structured data

To review the content of submitted data, you can start from the [directions above](#begin-metadata-tsv-submissions) and instead of selecting "Upload" in Step 4, you can review the graph below. You can select particular nodes to view individual records where you have the option to delete, view, or download.

> Note: Users who are not authorized to submit data may see a “Browse Data” button instead of “Submit Data”. These users will still have access to view the graph and individual nodes, but not to upload or delete.
The number you see underneath the node name, for example ‘subject’, reflects the number of records in that node of the project. The “Toggle View” button is used to show or hide nodes in the data model that the project has no records for.




### TSV Formatting Checklist
Expand All @@ -174,13 +184,7 @@ Templates can be downloaded from the data dictionary page of your commons. See
If the submission throws errors or claims the submission to be invalid, it will be the submitter's task to fix the submission. The best first step is to go through the outputs from the individual entities, as seen in the previous section. The error fields will give a rough description of what failed the validation check. The most common problems are simple issues such as spelling errors, mislabeled properties, or missing required fields.


## Learning More About the Existing Submission

When viewing a project, clicking on a node name will allow the user to view the records in that node. From here a user can download, view, or completely delete records associated with any project for which they have delete access.

![Node Click][Node Click]

![Node Information][Node Informatin]


<!--Links -->
Expand All @@ -192,5 +196,3 @@ When viewing a project, clicking on a node name will allow the user to view the
[toolbar submission]: img/Gen3_Toolbar_data_submission.png
[submit data]: img/Gen3_Data_Submission_submit_data.png
[upload file]: img/Gen3_Data_Submission_Use_Form.png
[Node Click]: img/Gen3_Model_Click_highlight.png
[Node Informatin]: img/Gen3_Model_node_view.png
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
---
tags:
- submission
---

# Unstructured Data (Data Files)

Expand Down
14 changes: 12 additions & 2 deletions gen3/docs/gen3-resources/user-guide/access-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Operators can also take advantage of the Requestor Service for dynamic authoriza


## Download Files Using the Gen3-client
The gen3-client provides an easy-to-use, command-line interface for uploading and downloading data files to and from a Gen3 data commons from the terminal or command prompt, respectively. In some systems "download" may be restricted to only within a Gen3 Workspace.
The gen3-client provides an easy-to-use, command-line interface for uploading and downloading data files to and from a Gen3 data commons from the terminal or command prompt, respectively. In some systems "download" may be restricted to only within a Gen3 Workspace. Note that Gen3 also comes with an SDK tool that can perform many of the same functions as the client for downloading along with many other features not found in the client. You can read more about the Python SDK tool [here][sdk_github].

This guide has the following sections:

Expand Down Expand Up @@ -74,7 +74,7 @@ To check that your copy of the client is working and confirm the version, the to

### Configure a Profile with Credentials

Before using the gen3-client to upload or download data, the gen3-client needs to be configured with API credentials downloaded from the user’s data commons Profile (via Windmill data portal):
Before using the gen3-client to upload or download data, the gen3-client needs to be configured with API credentials downloaded from the user’s data commons Profile:

1. To download the “credentials.json” from the data commons, the user should start from that common’s Windmill data portal, followed by clicking on “Profile” in the top navigation bar and then creating an API key. In the popup window which informs user an API key has been successfully created, click the “Download json” button to save a local copy of the API key.

Expand Down Expand Up @@ -248,6 +248,15 @@ Most programs require some sort of user input to run properly. Some programs wil
For example, when configuring a profile with the client, the user must specify the `configure` option and also specify the profile name, API endpoint, and credentials file by adding the flags `--profile`, `--apiendpoint` and `--cred` to the end of the command (see [configuring a profile section][config profile] above for specific examples).


### Expired Token

Many commons have a limit to how long a token is good before it is expired. Once expired you may receive an error such

``RequestNewAccessToken with error code 401``

If this happens (and you are still authorized to access the data), you can download a new API token and re-create your profile using the previously used command.


<!-- AuthN/Z -->

[configure auth]: ../operator-guide/gen3-authn-methods.md
Expand All @@ -257,6 +266,7 @@ For example, when configuring a profile with the client, the user must specify t

<!--Gen3 client -->
[Gen3 Client]: https://github.com/uc-cdis/cdis-data-client/releases/latest
[sdk_github]: https://github.com/uc-cdis/gen3sdk-python
[PATH]: access-data.md#working-from-the-command-line
[img create API key]: img/Gen3_Keys.png
[config profile]: access-data.md#configure-a-profile-with-credentials
Loading

0 comments on commit 79938f4

Please sign in to comment.