Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

major refactoring #59

Merged
merged 7 commits into from
Feb 8, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,5 @@ __pycache__
paper/paper.pdf
paper/jats/
venv

.qodo
134 changes: 94 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,8 @@ First, clone the repository on your local machine and then install the package u
```bash
$ git clone https://github.com/PRIDE-Archive/pridepy
$ cd pridepy
$ pip install .
$ poetry build
$ pip install dist/*.whl
```

Install with setup.py:
Expand All @@ -34,73 +35,126 @@ $ cd pridepy
$ poetry build
$ pip install dist/pridepy-{version}.tar.gz
```
# Usage and Documentation

# Examples
This Python CLI tool, built using the Click module,
already provides detailed usage instructions for each command. To avoid redundancy and potential clutter in this README, you can access the usage instructions directly from the CLI
Use the below command to view a list of commands available:

Download all the raw files from a dataset(eg: PXD012353).
Warning: Raw files are generally large in size, so it may take some time to download depending on the number of files and file sizes.
```bash
$ pridepy --help
Usage: pridepy [OPTIONS] COMMAND [ARGS]...

`-p`: in download specifies protocol (ftp default):
- **ftp**: FTP protocol
- **aspera**: using the aspera protocol
- **globus**: PRIDE globus endpoint (_the data is downloaded through https_)
Options:
--help Show this message and exit.

Commands:
download-all-public-raw-files Download all public raw files...
download-file-by-name Download a single file from a...
get-files-by-filter get paged files :return:
get-files-by-project-accession get files by project accession...
get-private-files Get private files by project...
get-projects get paged projects :return:
get-projects-by-accession get projects by accession...
stream-files-metadata Stream all files metadata in...
stream-projects-metadata Stream all projects metadata...

```
> [!NOTE]
> Please make sure you are using Python3, not Python 2.7 version.

## Downloading a project from PRIDE Archive

The main purpose of this tool is to download data from the PRIDE Archive. Here, how to download all the raw files from a dataset(eg: PXD012353).

```bash
$ pridepy download-all-public-raw-files -a PXD012353 -o /Users/yourname/Downloads/foldername/ -p aspera
```
- `-a` flag is used to specify the project accession number.
- `-o` flag is used to specify the output directory.
- `-p` flag is used to specify the protocol (**aspera, ftp, globus**)

> [!IMPORTANT]
> Currently, pridepy supports multiple protocols for downloading including ftp, aspera, globus, s3. ftp, aspera uses those protocols to download the files; the pridepy includes the aspera client. For globus and s3, the tool uses https of both services endpoints. Read the whitepaper to know more about the performance of each protocol.

Additional options:

- `-skip` flag is used to skip the download of files that already exist in the output directory.
- `--aspera_maximum_bandwidth` flag is used to specify the maximum bandwidth for the Aspera download. The default value is 100M.
- `--checksum_check` flag is used to check the checksum of the downloaded files. The default value is False.

## Download single file by name

Users instead of downloading an entire project files may be interested in downloading a single file if they know it by name. Here is how to download a single file by name.

Download single file by name:
```bash
$ pridepy download-file-by-name -a PXD022105 -o /Users/yourname/Downloads/foldername/ -f checksum.txt -p globus
```

>**NOTE**: Currently we use Globus URLs (when `-p globus` is used) via HTTPS, not the Globus protocol. For more information about Globus, see [Globus documentation](https://www.globus.org/data-transfer).
Please be aware that the additional parameters are the same as the previous command [Downloading a project from PRIDE Archive](#downloading-a-project-from-pride-archive).

Search projects with keywords and filters
```bash
$ pridepy search-projects-by-keywords-and-filters --filter accession==PXD012353
## Download project files by category

Users may be interested in downloading files by category. Here is how to download files by category. The different categories are available in the PRIDE Archive:

$ pridepy search-projects-by-keywords-and-filters --keyword PXD012353
- RAW: Raw data files
- PEAK: Peak list files
- SEARCH: Search engine output files
- OTHER: Other files
- RESULT: Result files
- SPECTRUM LIBRARIES: Spectrum libraries
- FASTA: FASTA files

```bash
$ pridepy download-files-by-category -a PXD022105 -o /Users/yourname/Downloads/foldername/ -c RAW -p ftp
```

Stream metadata of all projects as json and write it to a file
Please be aware that the additional parameters are the same as the previous command [Downloading a project from PRIDE Archive](#downloading-a-project-from-pride-archive).

>[!IMPORTANT]
> We also implemented a direct command to download RAW files from a project which is the most common use case.

## Download private files

Users and especially reviewers may be interested in downloading private files. Here is how to download private files.

First, the user can list the private files of a project:

```bash
$ pridepy stream-projects-metadata -o all_pride_projects.json
$ pridepy list-private-files -a PXD022105 -u yourusername -p yourpassword
```

Stream metadata of all files as json and write it to a file. Project accession can be specified as an optional parameter
This command will list the private files of the project PXD022105. Including the file name, file size, and download link.

Then the user can download the private files:

```bash
$ pridepy stream-files-metadata -o all_pride_files.json
OR
$ pridepy stream-files-metadata -o PXD005011_files.json -a PXD005011
$ pridepy download-file-by-name -a PXD022105 -o /Users/yourname/Downloads/foldername/ --username yourusername --password yourpassword -f checksum.txt
```

This Python CLI tool, built using the Click module,
already provides detailed usage instructions for each command. To avoid redundancy and potential clutter in this README, you can access the usage instructions directly from the CLI
Use the below command to view a list of commands available:
>[!WARNING]
> To download preivate files, the user should use the same command as downloading a single file by name. The only difference is that the user should provide the username and password. However, protocol in this case is unnecessary as the tool will use the https protocol to download the files. At the moment we only allow this protocol because of the infrastructure of PRIDE private files (read the whitepaper for more information).

## Streamming metadata

One of the great features of PRIDE and pridepy is the ability to stream metadata of all projects and files. This is useful for users who want to analyze the metadata of all projects and files locally.

Stream metadata of all projects as JSON and write it to a file:

```bash
$ pridepy --help
Usage: pridepy [OPTIONS] COMMAND [ARGS]...
$ pridepy stream-projects-metadata -o all_pride_projects.json
```

Options:
--help Show this message and exit.
Stream all files metadata in a specific project as JSON and write it to a file:

Commands:
download-all-public-raw-files Download all public raw files...
download-file-by-name Download a single file from a...
get-files-by-filter get paged files :return:
get-files-by-project-accession get files by project accession...
get-private-files Get private files by project...
get-projects get paged projects :return:
get-projects-by-accession get projects by accession...
stream-files-metadata Stream all files metadata in...
stream-projects-metadata Stream all projects metadata...

```bash
$ pridepy stream-files-metadata -o all_pride_files_metadata.json
```
# NOTE
Stream the files metadata of a specific project as JSON and write it to a file:

Please make sure you are using Python3, not Python 2.7 version.
```bash
$ pridepy stream-files-metadata -o PXD005011_files.json -a PXD005011
```

# White paper

Expand Down
12 changes: 2 additions & 10 deletions pridepy/authentication/authentication.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,7 @@ def get_token(self, username, password):
url = self.base_url + "/login"
headers = {"Content-type": "application/json", "Accept": "text/plain"}
credentials = (
'{"Credentials":{"username":"'
+ username
+ '", "password":"'
+ password
+ '"}}'
'{"Credentials":{"username":"' + username + '", "password":"' + password + '"}}'
)

response = requests.post(url, data=credentials, headers=headers)
Expand All @@ -55,8 +51,4 @@ def validate_token(self, token):

response = requests.post(url, headers=headers)

return (
response.ok
and response.status_code == 200
and response.text == "Token Valid"
)
return response.ok and response.status_code == 200 and response.text == "Token Valid"
Loading
Loading