Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search by key word #60

Merged
merged 4 commits into from
Feb 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ Commands:
get-projects-by-accession get projects by accession...
stream-files-metadata Stream all files metadata in...
stream-projects-metadata Stream all projects metadata...
search-projects-by-keywords-and-filters Search all projects by keywords...

```
> [!NOTE]
Expand Down Expand Up @@ -135,7 +136,7 @@ $ pridepy download-file-by-name -a PXD022105 -o /Users/yourname/Downloads/folder
>[!WARNING]
> To download preivate files, the user should use the same command as downloading a single file by name. The only difference is that the user should provide the username and password. However, protocol in this case is unnecessary as the tool will use the https protocol to download the files. At the moment we only allow this protocol because of the infrastructure of PRIDE private files (read the whitepaper for more information).

## Streamming metadata
## Streaming metadata

One of the great features of PRIDE and pridepy is the ability to stream metadata of all projects and files. This is useful for users who want to analyze the metadata of all projects and files locally.

Expand All @@ -156,6 +157,14 @@ Stream the files metadata of a specific project as JSON and write it to a file:
$ pridepy stream-files-metadata -o PXD005011_files.json -a PXD005011
```

## Search projects by keywords and filters

Get the Project metadata by keywords and filters

```bash
$ python -m pridepy.pridepy search-projects-by-keywords-and-filters -f projectTags==Proteometools,organismsPart==Pancreas -k human -sd DESC -sf accession -sf submissionDate
```

# White paper

A white paper is available at [here](paper/paper.md). We can build it as PDF using pandoc.
Expand Down
66 changes: 57 additions & 9 deletions pridepy/pridepy.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ def download_all_public_raw_files(
"--category",
required=True,
help="Category of the files to be downloaded",
type=click.Choice("RAW,PEAK,SEARCH,RESULT,SPECTRUM_LIBRARY,OTHER, FASTA".split(",")),
type=click.Choice("RAW,PEAK,SEARCH,RESULT,SPECTRUM_LIBRARY,OTHER,FASTA".split(",")),
)
def download_all_public_category_files(
accession: str,
Expand Down Expand Up @@ -312,23 +312,71 @@ def stream_files_metadata(accession, output_file):
@click.option(
"-k",
"--keyword",
required=False,
default="",
required=True,
help="The entered word will be searched among the fields to fetch "
"matching pride. The structure of the keyword is : *:*",
"matching pride."
)
@click.option(
"-f",
"--filter",
required=False,
help="Parameters to filter the search results. The structure of the "
"filter is: field1==value1, field2==value2. Example "
"accession==PRD000001",
)
Comment on lines +319 to +326
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Clarify filter parsing logic.

The help text suggests using comma-separated field-value pairs (e.g., "field1==value1, field2==value2"), but the implementation doesn't show how these are parsed. Please verify:

  1. Is the filter string parsed correctly?
  2. Are multiple filters supported?

🏁 Script executed:

#!/bin/bash
# Description: Check if filter parsing logic exists in the Project class

# Search for filter parsing logic in the Project class
ast-grep --pattern $'class Project {
  $$$
  search_by_keywords_and_filters($$$) {
    $$$
  }
  $$$
}'

# Search for any filter-related code
rg -A 5 "filter.*==.*"

Length of output: 1141


Clarify Filter Parsing Implementation

The current implementation only shows a help text suggesting the use of comma‑separated field–value pairs, but no explicit code was found that splits or otherwise processes the filter string. In particular, a search for filter‐related parsing (e.g. splitting on commas or "==") within the Project class (and specifically in the search_by_keywords_and_filters function) didn’t reveal any parsing logic. This raises two concerns:

• The filter string does not appear to be explicitly parsed to convert something like "field1==value1, field2==value2" into structured filters.
• There is no clear evidence that multiple filters are handled, even though the documentation suggests that they should be.

It would be advisable to verify whether filter processing is handled elsewhere or to add explicit parsing (with corresponding inline comments and tests) so that the behavior matches the help text.

@click.option(
"-ps",
"--page_size",
required=False,
default=100,
type=click.IntRange(min=1, max=1000),
help="Number of results to fetch in a page",
)
@click.option(
"-p",
"--page",
required=False,
default=0,
type=click.IntRange(min=0),
help="Identifies which page of results to fetch",
)
@click.option(
"-sd",
"--sort_direction",
required=False,
default="DESC",
help="Sorting direction: ASC or DESC",
)
@click.option(
"-sf",
"--sort_fields",
required=False,
default=["submission_date"],
multiple=True,
help="Field(s) for sorting the results on. Default for this "
"request is submission_date. More fields can be separated by "
"comma and passed. Example: submissionDate,accession",
type=click.Choice("accession,submissionDate,diseases,organismsPart,organisms,instruments,softwares,"
"avgDownloadsPerFile,downloadCount,publicationDate".split(",")),
)
def search_projects_by_keywords_and_filters(
keyword, filter, page_size, page, date_gap, sort_direction, sort_fields
keyword, filter, page_size, page, sort_direction, sort_fields
):
"""
TODO: @selva this function and command line should be reimplemented.
TODO: The idea is that the user can type a keyword or keywords and filters and get all the files projects in
TODO: JSON. Please remember to update the README.
Search all projects by keywords and filters
Parameters:
keyword (str): keyword to search in entire project.
filter (str): filter the search results. field1==value1
page_size (int): no of records or projects per page
page (int): Page number
sort_direction (str): sort direction of the results based on sortfield
sort_fields (str): field to sort the results by.
"""
project = Project()
sf = ', '.join(sort_fields)
logging.info(
project.search_by_keywords_and_filters(
keyword, filter, page_size, page, sort_direction, sort_fields
keyword, filter, page_size, page, sort_direction, sf
)
)

Expand Down
Loading