Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Query classes and related code #492

Merged
merged 23 commits into from
Nov 7, 2023
Merged

refactor: Query classes and related code #492

merged 23 commits into from
Nov 7, 2023

Conversation

DaniBodor
Copy link
Collaborator

@DaniBodor DaniBodor commented Sep 6, 2023

In this PR:

  • Query class has been converted into a dataclass and all common arguments are defined in the parent class instead of the children.
  • ProteinProteinInterfaceResidueQuery and ProteinProteinInterfaceAtomicQuery were merged into a single child dataclass of Query, as were SingleResidueVariantResidueQuery and SingleResidueVariantAtomicQuery
    • resolution is now given as an attribute instead of being a separate class
  • Moved as much code as possible into parent class
    • unbound functions were unified and made internal methods of parent.
    • unused/obsolete functions were removed:
      • code related to adding hydrogens
      • _get_atom_node_key function
      • potentially more that I've lost track of
    • build function is largely defined in parent class, with specifics defined in child class as internal _build_helper methods.
      • it feels like build function could be further unified, but my tests doing this kept failing.
  • Reorganized module
    • QueryCollection at the bottom now
  • Update external modules to match this
    • tests
    • docs
    • notebooks
    • etc

Not yet done:

fixes: #480 and #490

Final TO DO list:

  • Move parameters to parent class
  • Switch to dataclasses
  • Unify atomic and residue version of each query into one
    • fix all tests, linting, READMEs, notebooks until here
  • Move as much as possible into parent class
    • Make main build function in parent, have specific helper functions in child classes.
    • Remove unbound functions or make them methods or parent if still useful.
    • Unify build methods for PPI and SVR (if possible)
  • Make sure all tests and external modules work with new classes
  • Update docstrings
    • Check how this should work with inheritance, what is listed in which docstring, etc
    • Check if it's possible to inherit docstrings.
      • Haven't found a nice way of doing this.
    • Check for all methods as well (not all necessarily require one).
  • Remove code related to inlcuding hydrogens
    • check that this is indeed OK
  • Clarify distance_cutoff vs radius across different Query classes (see here)
  • Resolve conflicts with docs: type hinting and docstrings in molstruct #497
    • Potentially others
  • Check all #TODOs throughout script
  • Refactor duplicate code from buildgraph module

@DaniBodor DaniBodor force-pushed the 480_new branch 2 times, most recently from 47b97ee to 3cd2a8c Compare November 1, 2023 13:21
@DaniBodor
Copy link
Collaborator Author

DaniBodor commented Nov 1, 2023

I fixed the attribute issue above in 3cd2a8c. Can you take a look and see if this is what you were thinking?

Can you also take a look at the TODOs I put in lines 250 and 335 (same issue) and 462 of query.py and give me your opinion on those.

When that is done, I can finally merge this PR :)

Note:

@DaniBodor
Copy link
Collaborator Author

Also, should we remove codacy with this PR?

no, let's remove it in a separate PR? This one is already so big and unwieldly, that I don't want to add anything that doesn't need to be here.

@gcroci2
Copy link
Collaborator

gcroci2 commented Nov 1, 2023

Also, should we remove codacy with this PR?

no, let's remove it in a separate PR? This one is already so big and unwieldly, that I don't want to add anything that doesn't need to be here.

Done in #517

@gcroci2
Copy link
Collaborator

gcroci2 commented Nov 1, 2023

I fixed the attribute issue above in 3cd2a8c. Can you take a look and see if this is what you were thinking?

Can you also take a look at the TODOs I put in lines 250 and 335 (same issue) and 462 of query.py and give me your opinion on those.

When that is done, I can finally merge this PR :)

Note:

line 250: I think it's actually fine to throw an error since only 1 chain identifier is needed to identify the chain where the variant is. If more are given then the user hasn't understood the purpose of chain_ids.
line 335: same
line 462: I don't see any TODO.

One final thing: looking at those lines made me think about something I commented earlier about, but was lost someway. I think it should be clearer that chain_ids refers to the chains involved in the definition of whatever query (e.g., the chain identifier of the variant), but does not limit the usage of other eventual chains present in the pdb files. It could be indeed that the radius or the cutoff distance defined is enough to include other chains, not considered in the chain_ids variable because its purpose is another one. So, for example in SingleResidueVariantQuery, the doc string says the chain identifier(s) in the pdb file (generally a single capital letter) but I think it should be something like the chain identifier of the variant residue in the pdb file (generally a single capital letter). For ProteinProteinInterfaceQuery, it could be something like the two chain identifiers of the interface in the pdb file (generally single capital letters).

Also, if there are more things to discuss let's just have a meeting because I have the impression that finding stuff in this PR is becoming very complex and inefficient ;)

@gcroci2 gcroci2 mentioned this pull request Nov 2, 2023
@DaniBodor
Copy link
Collaborator Author

I fixed the attribute issue above in 3cd2a8c. Can you take a look and see if this is what you were thinking?

Have you looked at this also?

Can you also take a look at the TODOs I put in lines 250 and 335 (same issue) and 462 of query.py and give me your opinion on those.

line 462: I don't see any TODO.

This is about error catching in _process_one_query:
Right now, if there's an error, it's caught and that query is skipped. My guess is that this was put in place so that it didnt crash due to a handfull of faulty pdb files.
My question was whether we want to make this optional, and by default raise an error. I think generally, users should have clean data to work with, and if not pro-actively set this to ignore errors. What do you think?

One final thing: looking at those lines made me think about something I commented earlier about, but was lost someway. I think it should be clearer that chain_ids refers to the chains involved in the definition of whatever query (e.g., the chain identifier of the variant), but does not limit the usage of other eventual chains present in the pdb files. It could be indeed that the radius or the cutoff distance defined is enough to include other chains, not considered in the chain_ids variable because its purpose is another one. So, for example in SingleResidueVariantQuery, the doc string says the chain identifier(s) in the pdb file (generally a single capital letter) but I think it should be something like the chain identifier of the variant residue in the pdb file (generally a single capital letter). For ProteinProteinInterfaceQuery, it could be something like the two chain identifiers of the interface in the pdb file (generally single capital letters).

Clarified in 23357f0

@DaniBodor DaniBodor marked this pull request as draft November 3, 2023 14:02
prospector suddenly flagged a lot of code that wasnt touched here.
Probably because I updated my local versions for prospector and pylint.
I now also updated the versions in the toml to match my versions.
also list internal `QueryCollection` attributes in init
also change "atomic" to "atom" to be consistent with "residue" when setting `resolution`
also remove some TODOs and made some style improvements
@DaniBodor DaniBodor marked this pull request as ready for review November 3, 2023 16:22
@gcroci2 gcroci2 linked an issue Nov 6, 2023 that may be closed by this pull request
@DaniBodor DaniBodor merged commit 97db708 into dev Nov 7, 2023
11 of 12 checks passed
@DaniBodor DaniBodor deleted the 480_new branch November 7, 2023 10:30
@gcroci2 gcroci2 added the SS label Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update prospector and pylint in pyproject.toml Is _get_atom_node_key obsolete? Refactor Query classes
3 participants