Regularly stuck / hanging during extraction of files #461

HenryJones23 · 2023-03-09T16:27:26Z

So I tried to index about 2.000.000 files of many types, which occupy a total of around 5TB on a NAS drive. Before starting the process, I made sure that the following options were deactivated: 'OCR', 'OCR of images in PDF documents', 'SpaCy NER', 'Standford NER', 'Segmentation to pages' and 'Sentence segmentation'. Those are nice to have, but for now I am only interested in an indexed content search.

It took my system around 14 hours to index all 2.000.000 files, and of course, while it was indexing, it was also extracting the content of some files. And here comes the problem:

After the extraction of a couple of thousand files, the SOLR-Server crashed. I restarted the machine, tried restarting several services, but OSS was just stuck and did not continue indexing. I thought that maybe this problem was caused because of some large zip files I had, so I deleted the index, purged the rabbit queue, excluded all zip files and started indexing again.
This time, SOLR did not crash, but after the extraction of around 30.000 files, OSS just got stuck again. I tried ending / restarting the opensemanticetl, tika and solr services, and at first this seemed to nudge OSS to extract another 20.000 files. But then it got stuck for good at around 50.000 files and nothing I did nudged it to continue indexing.
I thought, maybe it's just too many files, so I again deleted the index, purged the rabbit queue and reindexed a subfolder with only 15.000 files, a total of only 100GB. Again, OSS got stuck after indexing around 5.000 files, so I tried ending / restarting opensemanticetl, tika and solr services, and again, at first this seemed to nudge OSS to extract another couple of files, but it got stuck again and won't extract the remaining 6.000 files, and nothing I can think of seems to nudge it again to continue extraction.

Now, the curious thing is that in all three cases, all CPU cores of the machine showed around 100% load, even though there clearly was no visible progress.

Checking the status of the opensemanticetl service always shows the same warning for all pool workers:

etl_tasks[number]: [date time: WARNING/ForkPoolWorker-1] Connecting to Tika server (will retry in 120 seconds) failed. Exception: ('Connection aborted.', RemoteDisconnected('remote end closed connection without response'))
etl_tasks[number]: [date time: WARNING/ForkPoolWorker-1] Retrying to connect to Tika server in 120 seconds(s).
...

But stopping or restarting Tika does not seem to help. Also Flower shows 18 active workers, but absolutely no progress. And when I click on any 'active' task, it says 'Unknown task 'xyc''. There is no terminate button.

In the mean time I have seen that other people have encountered probably the very same problem as well: #282, last comment is from July 23, 2021, but the issue is still open. So, I assume that the problem still exists.

Has anyone got any idea what can be done to mend this? Your help would be much appreciated.

I am running OSS (Installation package from 22.10.08) with Virtualbox. My hardware and software information:
Host: AMD Ryzen 9 with 12C/24T, 64GB RAM, Windows 10 Pro
Guest: 18 Cores, 48GB RAM, Lubuntu 20.04.5

The text was updated successfully, but these errors were encountered:

HenryJones23 mentioned this issue Mar 17, 2023

"Import Status: Running file import" stuck. #282

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regularly stuck / hanging during extraction of files #461

Regularly stuck / hanging during extraction of files #461

HenryJones23 commented Mar 9, 2023 •

edited

Loading

Regularly stuck / hanging during extraction of files #461

Regularly stuck / hanging during extraction of files #461

Comments

HenryJones23 commented Mar 9, 2023 • edited Loading

HenryJones23 commented Mar 9, 2023 •

edited

Loading