Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant amounts of timeouts while using threading on Grobid Docker Service #939

Open
matthieu-perso opened this issue Aug 8, 2022 · 2 comments

Comments

@matthieu-perso
Copy link

matthieu-perso commented Aug 8, 2022

Configuration

  • Using the Docker Service both locally (Mac 2017, 8GB ram) and as a GCP Cloud Run Instance (4GB RAM, 80 threads)

Problem

  • In both these cases, I tried to speed up training by implementing a very basic Thread Pool calling the services.
  • Both locally and in the cloud, I get 20% of my threads time-ing out with the standard Grobid message [TIMEOUT] PDF to XML conversion timed out.
  • Even with a low number of workers (5), I still get a significant number of timeouts.
  • I'd assume my machines are powerful enough to run the software, so it wouldn't be capacity limits - but my knowledge here is clearly limited.

What would be the reason the service times out so fast ? Any workarounds if I wish for all requests to be completed ?

Code (for the local instance, identical cloud one except for url and token )

import concurrent.futures 
import time
import requests
import glob
import time
start_time = time.time()


def requesting(url, index):
    '''Requests GROBID service'''
    cloud_token = ""
    headers = {
        'Authorization': f"bearer {token}"}

    files = {
        'input': open(url, 'rb')}
    response = requests.post('http://localhost:8070/api/processFulltextDocument', files=files, headers=headers)
    return response.text, index

def main()
    filelist = glob.glob('./download/unpacked/**/*.pdf', recursive=True)

   with concurrent.futures.ThreadPoolExecutor(max_workers=5) as thread_pool: 
      futures = []
      for index, url in enumerate(filelist):
         futures.append(thread_pool.submit(requesting, url, index)) 

   for future in concurrent.futures.as_completed(futures): 
        data, index = future.result()
        with open(f'thread_{index}.xml', 'w') as f:
            f.write(data)

if __name__ == '__main__':
      main()
      print("--- %s seconds ---" % (time.time() - start_time))
@kermitt2
Copy link
Owner

kermitt2 commented Aug 8, 2022

Hello @MatthieuMoullecDev !

Thank you for the interest in Grobid and the issue.

You can use the Grobid python client, which is very well tested and has been able to scale to 12M PDF. Without managing the server availability (503 responses), you will get for sure these timeouts, but the python client is managing them for you.

Then the main adaptation to avoid timeout is on the server settings. You can have a look at the FAQ entry on the topic here. Two important aspects I think from your description are the amount of RAM memory and the number of threads. The settings for threads in the client and the grobid server need to be aligned with the real number of available threads available on the server.

@matthieu-perso
Copy link
Author

matthieu-perso commented Aug 9, 2022

Hey Patrice,

Thanks for your quick and helpful reply !

I saw the Python client but was struggling with an error I managed to debug (write-up here). I will have a go with it.

Thanks for the link to the production FAQs, will follow these guidelines and go from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants