Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault after lots of Featurizer failed errors with long sequence #162

Open
entropybit opened this issue Jan 19, 2025 · 0 comments

Comments

@entropybit
Copy link

Hi,

I got a segmentation fault with my boltz run for a rather long sequence (>4000AA). My input xml file was this:

version: 1  # Optional, defaults to 1
sequences:
  - protein:
      id: [A]
      sequence: MAGSGAGVRCSLLRLQETLSAADRCGAALAGHQLIRGLGQECVLSSSPAVLALQTSLVFSRDFGLLVFVR KSLNSIEFRECREEILKFLCIFLEKMGQKIAPYSVEIKNTCTSVYTKDRAAKCKIPALDLLIKLLQTFRS SRLMDEFKIGELFSKFYGELALKKKIPDTVLEKVYELLGLLGEVHPSEMINNAENLFRAFLGELKTQMTS AVREPKLPVLAGCLKGLSSLLCNFTKSMEEDPQTSREIFNFVLKAIRPQIDLKRYAVPSAGLRLFALHAS QFSTCLLDNYVSLFEVLLKWCAHTNVELKKAALSALESFLKQVSNMVAKNAEMHKNKLQYFMEQFYGIIR NVDSNNKELSIAIRGYGLFAGPCKVINAKDVDFMYVELIQRCKQMFLTQTDTGDDRVYQMPSFLQSVASV LLYLDTVPEVYTPVLEHLVVMQIDSFPQYSPKMQLVCCRAIVKVFLALAAKGPVLRNCISTVVHQGLIRI CSKPVVLPKGPESESEDHRASGEVRTGKWKVPTYKDYVDLFRHLLSSDQMMDSILADEAFFSVNSSSESL NHLLYDEFVKSVLKIVEKLDLTLEIQTVGEQENGDEAPGVWMIPTSDPAANLHPAKPKDFSAFINLVEFC REILPEKQAEFFEPWVYSFSYELILQSTRLPLISGFYKLLSITVRNAKKIKYFEGVSPKSLKHSPEDPEK YSCFALFVKFGKEVAVKMKQYKDELLASCLTFLLSLPHNIIELDVRAYVPALQMAFKLGLSYTPLAEVGL NALEEWSIYIDRHVMQPYYKDILPCLDGYLKTSALSDETKNNWEVSALSRAAQKGFNKVVLKHLKKTKNL SSNEAISLEEIRIRVVQMLGSLGGQINKNLLTVTSSDEMMKSYVAWDREKRLSFAVPFREMKPVIFLDVF LPRVTELALTASDRQTKVAACELLHSMVMFMLGKATQMPEGGQGAPPMYQLYKRTFPVLLRLACDVDQVT RQLYEPLVMQLIHWFTNNKKFESQDTVALLEAILDGIVDPVDSTLRDFCGRCIREFLKWSIKQITPQQQE KSPVNTKSLFKRLYSLALHPNAFKRLGASLAFNNIYREFREEESLVEQFVFEALVIYMESLALAHADEKS LGTIQQCCDAIDHLCRIIEKKHVSLNKAKKRRLPRGFPPSASLCLLDLVKWLLAHCGRPQTECRHKSIEL FYKFVPLLPGNRSPNLWLKDVLKEEGVSFLINTFEGGGCGQPSGILAQPTLLYLRGPFSLQATLCWLDLL LAALECYNTFIGERTVGALQVLGTEAQSSLLKAVAFFLESIAMHDIIAAEKCFGTGAAGNRTSPQEGERY NYSKCTVVVRIMEFTTTLLNTSPEGWKLLKKDLCNTHLMRVLVQTLCEPASIGFNIGDVQVMAHLPDVCV NLMKALKMSPYKDILETHLREKITAQSIEELCAVNLYGPDAQVDRSRLAAVVSACKQLHRAGLLHNILPS QSTDLHHSVGTELLSLVYKGIAPGDERQCLPSLDLSCKQLASGLLELAFAFGGLCERLVSLLLNPAVLST ASLGSSQGSVIHFSHGEYFYSLFSETINTELLKNLDLAVLELMQSSVDNTKMVSAVLNGMLDQSFRERAN QKHQGLKLATTILQHWKKCDSWWAKDSPLETKMAVLALLAKILQIDSSVSFNTSHGSFPEVFTTYISLLA DTKLDLHLKGQAVTLLPFFTSLTGGSLEELRRVLEQLIVAHFPMQSREFPPGTPRFNNYVDCMKKFLDAL ELSQSPMLLELMTEVLCREQQHVMEELFQSSFRRIARRGSCVTQVGLLESVYEMFRKDDPRLSFTRQSFV DRSLLTLLWHCSLDALREFFSTIVVDAIDVLKSRFTKLNESTFDTQITKKMGYYKILDVMYSRLPKDDVH AKESKINQVFHGSCITEGNELTKTLIKLCYDAFTENMAGENQLLERRRLYHCAAYNCAISVICCVFNELK FYQGFLFSEKPEKNLLIFENLIDLKRRYNFPVEVEVPMERKKKYIEIRKEAREAANGDSDGPSYMSSLSY LADSTLSEEMSQFDFSTGVQSYSYSSQDPRPATGRFRRREQRDPTVHDDVLELEMDELNRHECMAPLTAL VKHMHRSLGPPQGEEDSVPRDLPSWMKFLHGKLGNPIVPLNIRLFLAKLVINTEEVFRPYAKHWLSPLLQ LAASENNGGEGIHYMVVEIVATILSWTGLATPTGVPKDEVLANRLLNFLMKHVFHPKRAVFRHNLEIIKT LVECWKDCLSIPYRLIFEKFSGKDPNSKDNSVGIQLLGIVMANDLPPYDPQCGIQSSEYFQALVNNMSFV RYKEVYAAAAEVLGLILRYVMERKNILEESLCELVAKQLKQHQNTMEDKFIVCLNKVTKSFPPLADRFMN AVFFLLPKFHGVLKTLCLEVVLCRVEGMTELYFQLKSKDFVQVMRHRDDERQKVCLDIIYKMMPKLKPVE LRELLNPVVEFVSHPSTTCREQMYNILMWIHDNYRDPESETDNDSQEIFKLAKDVLIQGLIDENPGLQLI IRNFWSHETRLPSNTLDRLLALNSLYSPKIEVHFLSLATNFLLEMTSMSPDYPNPMFEHPLSECEFQEYT IDSDWRFRSTVLTPMFVETQASQGTLQTRTQEGSLSARWPVAGQIRATQQQHDFTLTQTADGRSSFDWLT GSSTDPLVDHTSPSSDSLLFAHKRSERLQRAPLKSVGPDFGKKRLGLPGDEVDNKVKGAAGRTDLLRLRR RFMRDQEKLSLMYARKGVAEQKREKEIKSELKMKQDAQVVLYRSYRHGDLPDIQIKHSSLITPLQAVAQR DPIIAKQLFSSLFSGILKEMDKFKTLSEKNNITQKLLQDFNRFLNTTFSFFPPFVSCIQDISCQHAALLS LDPAAVSAGCLASLQQPVGIRLLEEALLRLLPAELPAKRVRGKARLPPDVLRWVELAKLYRSIGEYDVLR GIFTSEIGTKQITQSALLAEARSDYSEAAKQYDEALNKQDWVDGEPTEAEKDFWELASLDCYNHLAEWKS LEYCSTASIDSENPPDLNKIWSEPFYQETYLPYMIRSKLKLLLQGEADQSLLTFIDKAMHGELQKAILEL HYSQELSLLYLLQDDVDRAKYYIQNGIQSFMQNYSSIDVLLHQSRLTKLQSVQALTEIQEFISFISKQGN LSSQVPLKRLLNTWTNRYPDAKMDPMNIWDDIITNRCFFLSKIEEKLTPLPEDNSMNVDQDGDPSDRMEV QEQEEDISSLIRSCKFSMKMKMIDSARKQNNFSLAMKLLKELHKESKTRDDWLVSWVQSYCRLSHCRSRS QGCSEQVLTVLKTVSLLDENNVSSYLSKNILAFRDQNILLGTTYRIIANALSSEPACLAEIEEDKARRIL ELSGSSSEDSEKVIAGLYQRAFQHLSEAVQAAEEEAQPPSWSCGPAAGVIDAYMTLADFCDQQLRKEEEN ASVIDSAELQAYPALVVEKMLKALKLNSNEARLKFPRLLQIIERYPEETLSLMTKEISSVPCWQFISWIS HMVALLDKDQAVAVQHSVEEITDNYPQAIVYPFIISSESYSFKDTSTGHKNKEFVARIKSKLDQGGVIQD FINALDQLSNPELLFKDWSNDVRAELAKTPVNKKNIEKMYERMYAALGDPKAPGLGAFRRKFIQTFGKEF DKHFGKGGSKLLRMKLSDFNDITNMLLLKMNKDSKPPGNLKECSPWMSDFKVEFLRNELEIPGQYDGRGK PLPEYHVRIAGFDERVTVMASLRRPKRIIIRGHDEREHPFLVKGGEDLRQDQRVEQLFQVMNGILAQDSA CSQRALQLRTYSVVPMTSRLGLIEWLENTVTLKDLLLNTMSQEEKAAYLSDPRAPPCEYKDWLTKMSGKH DVGAYMLMYKGANRTETVTSFRKRESKVPADLLKRAFVRMSTSPEAFLALRSHFASSHALICISHWILGI GDRHLNNFMVAMETGGVIGIDFGHAFGSATQFLPVPELMPFRLTRQFINLMLPMKETGLMYSIMVHALRA FRSDPGLLTNTMDVFVKEPSFDWKNFEQKMLKKGGSWIQEINVAEKNWYPRQKICYAKRKLAGANPAVIT CDELLLGHEKAPAFRDYVAVARGSKDHNIRAQEPESGLSEETQVKCLMDQATDPNILGRTWEGWEPWM
      msa: ./bfd_uniref_hits.a3m

for which I got the following output:

(boltz)> user@node dna_pkcs $ boltz predict dna_pkcs.yaml --recycling_steps 10 --diffusion_samples 25 --cache /work --use_msa_server
Checking input data.
Running predictions for 1 structure
Processing input data.
100%|___________________________________________________________________________________________________________________________| 1/1 [00:00<00:00,  1.46it/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `t
ensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this
 reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightnin
g[extra]` or one of them to enable TensorBoard support by default
You are using a CUDA device ('NVIDIA A100-SXM4-80GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('med
ium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_prec
ision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Predicting: |                                                                                                                           | 0/? [00:00<?, ?it/s]
Featurizer failed on dna_pkcs with error index 16908288 is out of bounds for axis 0 with size 16908288. Skipping.
Featurizer failed on dna_pkcs with error index 16908288 is out of bounds for axis 0 with size 16908288. Skipping.
Featurizer failed on dna_pkcs with error index 16908288 is out of bounds for axis 0 with size 16908288. Skipping.
Featurizer failed on dna_pkcs with error index 16908288 is out of bounds for axis 0 with size 16908288. Skipping.
Featurizer failed on dna_pkcs with error index 16908288 is out of bounds for axis 0 with size 16908288. Skipping.
Featurizer failed on dna_pkcs with error index 16908288 is out of bounds for axis 0 with size 16908288. Skipping.
Featurizer failed on dna_pkcs with error index 16908288 is out of bounds for axis 0 with size 16908288. Skipping.
Featurizer failed on dna_pkcs with error index 16908288 is out of bounds for axis 0 with size 16908288. Skipping.
Featurizer failed on dna_pkcs with error index 16908288 is out of bounds for axis 0 with size 16908288. Skipping.
Featurizer failed on dna_pkcs with error index 16908288 is out of bounds for axis 0 with size 16908288. Skipping.
Featurizer failed on dna_pkcs with error index 16908288 is out of bounds for axis 0 with size 16908288. Skipping.
Featurizer failed on dna_pkcs with error index 16908288 is out of bounds for axis 0 with size 16908288. Skipping.
Featurizer failed on dna_pkcs with error index 16908288 is out of bounds for axis 0 with size 16908288. Skipping.Featurizer failed on dna_pkcs with error index 16908288 is out of bounds for axis 0 with size 16908288. Skipping.
Featurizer failed on dna_pkcs with error index 16908288 is out of bounds for axis 0 with size 16908288. Skipping.
Featurizer failed on dna_pkcs with error index 16908288 is out of bounds for axis 0 with size 16908288. Skipping.
Featurizer failed on dna_pkcs with error index 16908288 is out of bounds for axis 0 with size 16908288. Skipping.
Featurizer failed on dna_pkcs with error [enforce fail at alloc_cpu.cpp:117] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 11134
09024 bytes. Error code 12 (Cannot allocate memory). Skipping.
Featurizer failed on dna_pkcs with error [enforce fail at alloc_cpu.cpp:117] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 11039
23548 bytes. Error code 12 (Cannot allocate memory). Skipping.
Featurizer failed on dna_pkcs with error [enforce fail at alloc_cpu.cpp:117] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 11039
23548 bytes. Error code 12 (Cannot allocate memory). Skipping.
Featurizer failed on dna_pkcs with error [enforce fail at alloc_cpu.cpp:117] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 11039
23548 bytes. Error code 12 (Cannot allocate memory). Skipping.
Featurizer failed on dna_pkcs with error [enforce fail at alloc_cpu.cpp:117] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 11039
23548 bytes. Error code 12 (Cannot allocate memory). Skipping.
Featurizer failed on dna_pkcs with error [enforce fail at alloc_cpu.cpp:117] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 11039
23548 bytes. Error code 12 (Cannot allocate memory). Skipping.
Featurizer failed on dna_pkcs with error [enforce fail at alloc_cpu.cpp:117] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 70090
384 bytes. Error code 12 (Cannot allocate memory). Skipping.
Featurizer failed on dna_pkcs with error [enforce fail at alloc_cpu.cpp:117] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 70090
384 bytes. Error code 12 (Cannot allocate memory). Skipping.
Failed to load input for dna_pkcs with error Unable to allocate 16.1 MiB for an array with shape (16908288,) and data type [('res_type', 'i1')]. Skipping.
Failed to load input for dna_pkcs with error Unable to allocate 1.01 MiB for an array with shape (33247,) and data type [('name', 'i1', (4,)), ('element', 'i1
'), ('charge', 'i1'), ('coords', '<f4', (3,)), ('conformer', '<f4', (3,)), ('is_present', '?'), ('chirality', 'i1')]. Skipping.
Failed to load input for dna_pkcs with error Unable to allocate 1.01 MiB for an array with shape (33247,) and data type [('name', 'i1', (4,)), ('element', 'i1
'), ('charge', 'i1'), ('coords', '<f4', (3,)), ('conformer', '<f4', (3,)), ('is_present', '?'), ('chirality', 'i1')]. Skipping.
Failed to load input for dna_pkcs with error Unable to allocate 1.01 MiB for an array with shape (33247,) and data type [('name', 'i1', (4,)), ('element', 'i1
'), ('charge', 'i1'), ('coords', '<f4', (3,)), ('conformer', '<f4', (3,)), ('is_present', '?'), ('chirality', 'i1')]. Skipping.
Failed to load input for dna_pkcs with error Unable to allocate 1.01 MiB for an array with shape (33247,) and data type [('name', 'i1', (4,)), ('element', 'i1
'), ('charge', 'i1'), ('coords', '<f4', (3,)), ('conformer', '<f4', (3,)), ('is_present', '?'), ('chirality', 'i1')]. Skipping.
Failed to load input for dna_pkcs with error . Skipping.
Failed to load input for dna_pkcs with error . Skipping.
...
Failed to load input for dna_pkcs with error . Skipping.
Failed to load input for dna_pkcs with error . Skipping.
Failed to load input for dna_pkcs with error . Skipping.
Failed to load input for dna_pkcs with error . Skipping.
Failed to load input for dna_pkcs with error . Skipping.
Failed to load input for dna_pkcs with error . Skipping.
Failed to load input for dna_pkcs with error . Skipping.
Failed to load input for dna_pkcs with error . Skipping.
Failed to load input for dna_pkcs with error . Skipping.
Failed to load input for dna_pkcs with error . Skipping.
Failed to load input for dna_pkcs with error . Skipping.
Failed to load input for dna_pkcs with error . Skipping.
Failed to load input for dna_pkcs with error . Skipping.
Failed to load input for dna_pkcs with error . Skipping.
Failed to load input for dna_pkcs with error . Skipping.
Traceback (most recent call last):
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1243, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/opt/miniconda3/envs/boltz/lib/python3.10/queue.py", line 180, in get
    self.not_empty.wait(remaining)
  File "/opt/miniconda3/envs/boltz/lib/python3.10/threading.py", line 324, in wait
    gotit = waiter.acquire(True, timeout)
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 73, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3977782) is killed by signal: Segmentation fault.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/miniconda3/envs/boltz/bin/boltz", line 8, in <module>
    sys.exit(cli())
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/click/core.py", line 1688, in invoke                                                   [0/704]
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/boltz/main.py", line 640, in predict
    trainer.predict(
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 858, in predict
    return call._call_and_handle_interrupt(
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 47, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 897, in _predict_impl
    results = self._run(model, ckpt_path=ckpt_path)
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run
    results = self._run_stage()
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1020, in _run_stage
    return self.predict_loop.run()
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 178, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/pytorch_lightning/loops/prediction_loop.py", line 121, in run
    batch, batch_idx, dataloader_idx = next(data_fetcher)
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/pytorch_lightning/loops/fetchers.py", line 133, in __next__
    batch = super().__next__()
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/pytorch_lightning/loops/fetchers.py", line 60, in __next__
    batch = next(self.iterator)
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/pytorch_lightning/utilities/combined_loader.py", line 341, in __next__
    out = next(self._iterator)
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/pytorch_lightning/utilities/combined_loader.py", line 142, in __next__
    out = next(self.iterators[0])
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 701, in __next__
    data = self._next_data()
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1448, in _next_data
    idx, data = self._get_data()
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1402, in _get_data
    success, data = self._try_get_data()
  File "/opt/miniconda3/envs/boltz/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1256, in _try_get_data
    raise RuntimeError(
RuntimeError: DataLoader worker (pid(s) 3977782) exited unexpectedly
Predicting: |          | 0/? [8:50:24<?, ?it/s]
```

(where I put ... since the Featurizer failed and failed to load input messegages just repeated a lot). 
Any help on getting this to run would be appreciated, or is this sequence simply too long ?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant