Implementing `gemmi`-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) #4712

marinegor · 2024-09-20T21:57:20Z

Fixes #2367 and also extends #4303

Changes made in this Pull Request:

uses gemmi library (link) to parse mmcif files
adds a class MMCIFReader(base.SingleFrameReaderBase) and class MMCIFParser(TopologyReaderBase) classes for that

As a bonus, this implementation would potentially allow to read any of the gemmi-supported formats (source):

mmCIF (PDBx/mmCIF),
PDB (with popular extensions),
mmJSON

Also, this (with slight modifications) also would allow reading mmcif with multiple models sharing the same topology, as well as more feature-rich parsing of PDBs (the same code without changes can be used for parsing altlocs, charges, etc, from all of these formats).

However, I'm slightly lost on what's to be done next for this PR to be merged, so I'm asking if someone could help me navigate here (tagging @richardjgowers here as author of original PDBx implementation 4303).

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

Developers certificate of origin

I certify that this contribution is covered by the LGPLv2.1+ license as defined in our LICENSE and adheres to the Developer Certificate of Origin.

📚 Documentation preview 📚: https://mdanalysis--4712.org.readthedocs.build/en/4712/

pep8speaks · 2024-09-20T21:57:28Z

Hello @marinegor! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file package/MDAnalysis/coordinates/MMCIF.py:

Line 28:80: E501 line too long (84 > 79 characters)
Line 41:80: E501 line too long (85 > 79 characters)
Line 42:80: E501 line too long (93 > 79 characters)
Line 61:80: E501 line too long (104 > 79 characters)
Line 65:80: E501 line too long (87 > 79 characters)
Line 67:80: E501 line too long (107 > 79 characters)

In the file package/MDAnalysis/topology/MMCIFParser.py:

Line 2:24: W291 trailing whitespace
Line 60:80: E501 line too long (111 > 79 characters)
Line 72:80: E501 line too long (123 > 79 characters)
Line 82:80: E501 line too long (122 > 79 characters)
Line 106:80: E501 line too long (108 > 79 characters)
Line 113:80: E501 line too long (80 > 79 characters)
Line 128:80: E501 line too long (91 > 79 characters)
Line 175:80: E501 line too long (126 > 79 characters)
Line 185:80: E501 line too long (125 > 79 characters)
Line 224:80: E501 line too long (126 > 79 characters)
Line 242:80: E501 line too long (140 > 79 characters)
Line 281:80: E501 line too long (87 > 79 characters)
Line 292:80: E501 line too long (90 > 79 characters)

In the file package/MDAnalysis/topology/PDBParser.py:

Line 56:80: E501 line too long (80 > 79 characters)
Line 57:80: E501 line too long (84 > 79 characters)

In the file package/MDAnalysis/topology/__init__.py:

Line 335:26: W292 no newline at end of file

In the file testsuite/MDAnalysisTests/datafiles.py:

Line 48:80: E501 line too long (103 > 79 characters)
Line 81:80: E501 line too long (80 > 79 characters)
Line 97:80: E501 line too long (86 > 79 characters)
Line 271:80: E501 line too long (90 > 79 characters)
Line 340:80: E501 line too long (104 > 79 characters)
Line 387:80: E501 line too long (83 > 79 characters)
Line 436:80: E501 line too long (80 > 79 characters)
Line 463:80: E501 line too long (80 > 79 characters)
Line 481:80: E501 line too long (80 > 79 characters)
Line 493:80: E501 line too long (80 > 79 characters)
Line 494:80: E501 line too long (80 > 79 characters)
Line 497:80: E501 line too long (83 > 79 characters)
Line 498:80: E501 line too long (86 > 79 characters)
Line 546:80: E501 line too long (82 > 79 characters)
Line 547:80: E501 line too long (82 > 79 characters)
Line 549:80: E501 line too long (88 > 79 characters)
Line 551:80: E501 line too long (88 > 79 characters)
Line 552:80: E501 line too long (81 > 79 characters)
Line 777:80: E501 line too long (81 > 79 characters)
Line 778:80: E501 line too long (87 > 79 characters)
Line 779:80: E501 line too long (84 > 79 characters)
Line 780:80: E501 line too long (85 > 79 characters)
Line 781:80: E501 line too long (83 > 79 characters)

Comment last updated at 2024-10-25 11:17:29 UTC

github-actions · 2024-09-20T21:59:32Z

Linter Bot Results:

Hi @marinegor! Thanks for making this PR. We linted your code and found the following:

Some issues were found with the formatting of your code.

Code Location	Outcome
main package	⚠️ Possible failure
testsuite	⚠️ Possible failure

Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/11148966346/job/30986736623

Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!

richardjgowers

Looks good so far, will require a small test file to check reader/parser halves.

richardjgowers · 2024-09-21T15:10:13Z

package/MDAnalysis/coordinates/MMCIF.py

+        pass
+
+
+class MMCIFWriter(base.WriterBase):


I wouldn't include this at this stage, Writer is optional

richardjgowers · 2024-09-21T15:10:31Z

package/MDAnalysis/topology/MMCIFParser.py

+from .base import TopologyReaderBase
+
+
+def _into_idx(arr: list[int]) -> list[int]:


Document what this does, ideally with an example

richardjgowers · 2024-09-21T15:11:33Z

package/MDAnalysis/topology/MMCIFParser.py

+            record_types,  # res.het_flag
+            tempfactors,  # at.b_iso
+            residx,  # _into_idx(res.seqid.num)
+        ) = map(  # this is probably not pretty, but it's efficient -- one loop over the mmcif


Are all the fields here guaranteed in a valid pdbx? One benefit to working column by column is that you can do optional columns

Do you have an example of a PDBx in mind, or like a test set for them? I've never actually worked with the format, since in RCSB afaik we have only pdb or mmcif

PDBx is mmcif. The download links here will give you an example file: https://www.rcsb.org/structure/4ake we use 4ake elsewhere in the testsuite. In my experience, sometimes the PDB / mmcif versions of the same entry aren't completely identical, so I wouldn't worry about trying to align the PDB & PDBx tests.

richardjgowers · 2024-09-21T15:12:17Z

package/MDAnalysis/topology/MMCIFParser.py

+            np.array,
+            list(
+                zip(
+                    *[


I'm struggling to follow the logic here, a comment breaking down what this double nested loop iteration into a zip is doing would be nice

richardjgowers · 2024-09-21T15:12:54Z

package/pyproject.toml

@@ -78,6 +78,7 @@ extra_formats = [
    "pytng>=0.2.3",
    "gsd>3.0.0",
    "rdkit>=2020.03.1",
+    "gemmi", # for mmcif format


This will probably be optional, so other imports will have to respect that too

codecov · 2024-10-24T19:54:16Z

Codecov Report

Attention: Patch coverage is 85.54217% with 12 lines in your changes missing coverage. Please review.

Project coverage is 93.59%. Comparing base (101008b) to head (e80632c).
Report is 1 commits behind head on develop.

Files with missing lines	Patch %	Lines
package/MDAnalysis/coordinates/MMCIF.py	75.75%	5 Missing and 3 partials ⚠️
package/MDAnalysis/topology/MMCIFParser.py	91.11%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #4712      +/-   ##
===========================================
- Coverage    93.65%   93.59%   -0.06%     
===========================================
  Files          175      189      +14     
  Lines        21564    22715    +1151     
  Branches      3023     3028       +5     
===========================================
+ Hits         20195    21261    +1066     
- Misses         925     1005      +80     
- Partials       444      449       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

marinegor · 2024-10-24T20:38:42Z

Hi @richardjgowers , I think the PR is ready now, and I also fixed (I believe) your comments.

orbeckst · 2024-12-04T00:23:45Z

@richardjgowers do you have the bandwidth to look after this PR? If not please unassign yourself. Thanks!

marinegor · 2024-12-19T23:26:14Z

just wanted to bump it before the holidays @richardjgowers and/or @orbeckst

orbeckst · 2024-12-19T23:49:02Z

Sorry, I won't be reviewing, I haven't had to deal with pdbx & friends (and I have plenty of other PRs to review). Someone with experience should have a look. Maybe hustle on Discord in #developers to see if you can motivate someone to have a look — perhaps @yuxuanzhuang @fiona-naughton @RMeli ??? (I believe @richardjgowers is currently oo.)

Regardless, I would at least make sure that codecov shows good patch coverage.

If you have a specific question then I'd ask the question instead of waiting for a review — sometimes people have enough time/energy to answer a question in 2 mins but not 30 mins for a code review.

yuxuanzhuang · 2024-12-20T03:40:47Z

I’d really love to take some time to review and get this merged—but after the holiday! I spoke with @BradyAJohnston the other day about supporting reading mmcif files in MDA would be a fantastic addition for MolecularNodes. Maybe @BradyAJohnston might be interested in taking a look at this? No pressure at all to actually review it, though!

BradyAJohnston · 2024-12-20T03:47:38Z

I'm certainly interested in reviewing, as this would help unify the backends potentially to just MDAnalysis for Molecular Nodes, but won't have time until next before holidays / next year!

marinegor · 2024-12-20T21:18:12Z

I guess I'll relax then and plan it for after the holidays @BradyAJohnston! Only if you have the capacity though :)

I can't suggest you as a reviewer but I guess you can do that yourself, right?
Also, I'd be happy for any testing suggestions -- I have a feeling that current approach is a bit sloppy, and there's no clear distinction between topology and coordinates testing (mainly because they're both stored in one file)

@orbeckst I will, thanks! I felt that it's a bit too early to test specific exceptions and warnings since perhaps their specifics will change

hmacdope · 2024-12-26T09:24:31Z

Ill also try and review after holidays

marinegor added 13 commits May 22, 2024 20:04

Start working on MMCIF parser

aa2a88f

Add first (not working) version of MMCIFReader and MMCIF topology parser

218cf43

Do some squashing

7f78e02

Remove inherited docs

6682d6e

Try improving the parsing

817f3a0

Try three independent loops over the model

3cc8c80

Merge remote-tracking branch 'upstream/develop' into feature/mmcif

f1bf325

Add gemmi dependency

d21c220

necessary params

2a1be15

finished sorting atom attrs

77645e6

add function for transformation into *idx

91e6942

oh damn seems to finally be working

9a0c086

remove TODOs

9c731df

Remove debug prints

8b40ec7

richardjgowers reviewed Sep 21, 2024

View reviewed changes

marinegor added 13 commits September 23, 2024 00:31

Merge branch 'develop' into feature/mmcif

bdcbd73

try to pack things into separate class in utils?

401a4d3

remove unnecessary functions

9c336bd

copy all loops into separate functions

def88e4

Move loops over structures into functions

cabfd37

Move coordinate fetching into function for the coordinate reader as well

4c9d930

Fix imports

184491a

Start adding documentation

3de8565

Reference MMCIFParser in PDBParser

ca6ebbb

Add documentation for trajectory and topology parsers

45077ad

Add mmcif tests

9a1a59a

Update format specifications

27c10d6

Write simple tests

950cfcf

marinegor added 5 commits October 24, 2024 14:06

Merge remote-tracking branch 'upstream/develop' into feature/mmcif

8d1a8b5

update github action with gemmi

ef29338

fix gemmi import errors

caca17e

add mmcif testfiles

f0e49cc

add mmcif to __all__

b7ada7c

marinegor marked this pull request as ready for review October 24, 2024 20:37

marinegor requested a review from richardjgowers October 24, 2024 20:38

add black instead of ruff

e80632c

orbeckst changed the title ~~[WIP] Implementing gemmi-based mmcif reader (with easy extension to PDB/PDBx and mmJSON)~~ Implementing gemmi-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) Dec 4, 2024

orbeckst assigned richardjgowers Dec 4, 2024

RMeli mentioned this pull request Dec 10, 2024

add active PRs to black ignore #4815

Merged

5 tasks

hmacdope self-assigned this Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing `gemmi`-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) #4712

Implementing `gemmi`-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) #4712

marinegor commented Sep 20, 2024 •

edited by github-actions bot

Loading

pep8speaks commented Sep 20, 2024 •

edited

Loading

github-actions bot commented Sep 20, 2024 •

edited

Loading

richardjgowers left a comment

richardjgowers Sep 21, 2024

richardjgowers Sep 21, 2024

richardjgowers Sep 21, 2024

marinegor Sep 21, 2024

richardjgowers Sep 22, 2024

richardjgowers Sep 21, 2024

richardjgowers Sep 21, 2024

codecov bot commented Oct 24, 2024 •

edited

Loading

marinegor commented Oct 24, 2024

orbeckst commented Dec 4, 2024

marinegor commented Dec 19, 2024

orbeckst commented Dec 19, 2024

yuxuanzhuang commented Dec 20, 2024

BradyAJohnston commented Dec 20, 2024

marinegor commented Dec 20, 2024 •

edited

Loading

hmacdope commented Dec 26, 2024

		from .base import TopologyReaderBase


		def _into_idx(arr: list[int]) -> list[int]:

Implementing gemmi-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) #4712

Are you sure you want to change the base?

Implementing gemmi-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) #4712

Conversation

marinegor commented Sep 20, 2024 • edited by github-actions bot Loading

PR Checklist

Developers certificate of origin

pep8speaks commented Sep 20, 2024 • edited Loading

Comment last updated at 2024-10-25 11:17:29 UTC

github-actions bot commented Sep 20, 2024 • edited Loading

Linter Bot Results:

richardjgowers left a comment

Choose a reason for hiding this comment

richardjgowers Sep 21, 2024

Choose a reason for hiding this comment

richardjgowers Sep 21, 2024

Choose a reason for hiding this comment

richardjgowers Sep 21, 2024

Choose a reason for hiding this comment

marinegor Sep 21, 2024

Choose a reason for hiding this comment

richardjgowers Sep 22, 2024

Choose a reason for hiding this comment

richardjgowers Sep 21, 2024

Choose a reason for hiding this comment

richardjgowers Sep 21, 2024

Choose a reason for hiding this comment

codecov bot commented Oct 24, 2024 • edited Loading

Codecov Report

marinegor commented Oct 24, 2024

orbeckst commented Dec 4, 2024

marinegor commented Dec 19, 2024

orbeckst commented Dec 19, 2024

yuxuanzhuang commented Dec 20, 2024

BradyAJohnston commented Dec 20, 2024

marinegor commented Dec 20, 2024 • edited Loading

hmacdope commented Dec 26, 2024

Implementing `gemmi`-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) #4712

Implementing `gemmi`-based mmcif reader (with easy extension to PDB/PDBx and mmJSON) #4712

marinegor commented Sep 20, 2024 •

edited by github-actions bot

Loading

pep8speaks commented Sep 20, 2024 •

edited

Loading

github-actions bot commented Sep 20, 2024 •

edited

Loading

codecov bot commented Oct 24, 2024 •

edited

Loading

marinegor commented Dec 20, 2024 •

edited

Loading