Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regulation GAF #14

Open
hunter-moseley opened this issue Jan 20, 2021 · 13 comments
Open

Regulation GAF #14

hunter-moseley opened this issue Jan 20, 2021 · 13 comments
Assignees

Comments

@hunter-moseley
Copy link
Member

Would be nice if GOcats could generate a regulation GAF.

@ehinderer ehinderer self-assigned this Jan 20, 2021
@ehinderer
Copy link
Member

ehinderer commented Jan 27, 2021

To document the planned changes:

Planning on identifying regulatory inferences in GO by incorporating inferred regulatory ancestors of Regulates/negatively_regulates/positively_regulates edges into the list of annotations associated to a genes/gene products in a gene annotation file (GAF). This "regulatory GAF" (rGAF) should allow for enrichment of regulatory mechanisms when used as an input for hypergeometric enrichment analyses.

4 types of rGAFs will exist, one for each type of regulation edge in GO, and one for all three:

  • regulates
  • positively_regulates
  • negatively_regulates
  • (any regulation)

The inference logic is as follows:
If (A) -[regulates/[positively_regulates/negatively_regulates]-> (B);
and (A) -[is_a]-> (A') -[is_a]-> [A''];
and (B) -[is_a/part_of/part_of_some*]-> (B') -[is_a/part_of/part_of_some*]-> [B'']

Then all instances of genes with annotations B, B', or B'' will--in the rGAF--instead be annotated to A, A', and A'' (the full ancestor set of hypernym relations).

The process should be accomplished with three nested loops:

  1. Iterate through gene annotations provided in the original GAF (with direct annotations).
  2. Iterate through all edges in GO, searching for regulates/positively_regulates/negatively_regulates/(any) edges
  3. When a regulation edge is found, if the object of the edge or any of its ancestors (B, B', or B'' in the example above) is in the set of annotations for the gene/gene product in the original GAF, create a new annotation set which includes A and its ancestors as described in the inference logic above. Replace the original annotation set for genes found with regulation edges with this new set in the rGAF.

In the rGAF, original gene annotations (and their ancestors) are not associated with the original gene, they are exclusive to regulatory annotations. However, we may enable a special case of rGAF which also includes the original annotations in future iterations.

* part_of_some is a logical approximation of the inverse of has_part, where the interpretation is that some instances of the ancestors of one concept are part of the other concept (non-universal; i.e. some but not all instances of B part_of B' if the original relation was B' has_part B). This logical approximation is appropriate in the context of gene annotation enrichment, see Hinderer et. al. 2019.

@hunter-moseley
Copy link
Member Author

In creating the rGAF, the original gene annotations must not be included, since they do not represent the regulation relationship.

There could be an option to include original gene annotations that match an A set, but this should be an option and not the default. Also, if this option is allowed, then the ancestors of any original gene annotation matching an A set would need to be included as well. The resulting rGAF would thus include the direct annotations of the regulator (A) and the regulation annotations based on matching B.

@ehinderer
Copy link
Member

I've updated the description of the planned changes, do they look correct now? I'll hopefully have some time to work on it this week as long as I'm understanding the intention properly.

@hunter-moseley
Copy link
Member Author

Just to be clear, you need to check if a gene's specific annotation is a member of the B_plus_ancestors set. This was not explicitly stated.

@ehinderer
Copy link
Member

Okay, check the italicized changes and hopefully I've captured it accurately now!

@hunter-moseley
Copy link
Member Author

That clearly states what should be done. By the way, it would be good to have options that limit the A_plus_ancestors and B_plus_ancestors sets to just A and B respectfully. Something like --limit-regulator (for A) and --limit-regulatee (for B).

@ehinderer
Copy link
Member

ehinderer commented Feb 2, 2021

I'm wondering if I should just add a new argument to gocats.categorize_dataset() for outputting the rGAF? The issue is that we aren't necessarily interested in categorizing the annotations in this use case.

Alternatively, I could write a new top-level function. This would mean that you could run it from the command line. That function would:

  • Create a GOcats GOgraph that included relationship edges
  • Import the original GAF
  • Perform the rGAF creation steps listed above.

Also, I created a new branch for tracking these changes. I think it's best to work within GitHub for these changes, since we're already in release versions.

@hunter-moseley
Copy link
Member Author

Would suggest a new top-level function.
If done the right way, the rGAF could be later categorized.

@ehinderer
Copy link
Member

Okay, working on it now!

@ehinderer
Copy link
Member

@hunter-moseley When you get a chance could you double check my logic in the new commit on rGAF. Here's the permalink to the new create_regulatory_gaf() method.

I am running out of memory when doing this. I believe including all ancestors of each annotation is too permissive. It's leading to a lot of regulatory annotations being added. From the few I looked at, they looked reasonable. But I'd like to make sure I'm not doing anything silly before suggesting we move this to the computing cluster.

@hunter-moseley
Copy link
Member Author

hunter-moseley commented Feb 10, 2021

mapped_rgaf_array needs to be built from a set of node.id, otherwise you are likely to have a lot of duplicates.
This may be why you are running out of memory.
Also, you need to make sure you are analyzing one gene's worth of annotations at a time.
Otherwise, mapped_rgaf_array is going to have a lot of duplicates.

@ehinderer
Copy link
Member

Changes are reflected in rGAF branch. Was this tested and is it safe to merge?

@hunter-moseley
Copy link
Member Author

hunter-moseley commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants