Skip to content

Celery troubleshooting

zagorsky edited this page Aug 12, 2021 · 3 revisions

For problems with data download/data processing, sending push notifications, or running Forest

Celery is the task queue that Beiwe uses for sending push notifications, processing/batching data, and running Forest. On a scalable deployment, Celery runs on the "data processing" aka "worker" server.

This page is a work in progress. Please add to it if you come up with new helpful info.

First things to check:

  • Read the most recent messages in mail with the command nano /var/mail/ubuntu and then jump to the end of the file (Ctrl + End). That should show any error messages output by Cron.
  • Read the most recent messages in ~/celery_processing.log, ~/celery_push_send.log, and ~/celery_forest.log to see if any of them include an error.

Is the server out of disk space?

Check if the server is out of disk space by running df. If your main partition is close to 100% used, that's likely your problem. Here's how to increase your disk and partition size:

  1. In the AWS web console, find your EC2 instance, use that to find the Volume, and increase the size of the volume. AWS documentation here.
  2. When SSHed into the server, extend the partition and then extend the filesystem. AWS documentation is here.
    1. You'll likely run into the error mkdir: cannot create directory '/tmp/growspart.xxxx': No space left on device. The easiest way around that is to delete some unnecessary files; one way is to purge everything older than 30 days from the systemd journal file: sudo journalctl --vacuum-time=30days.
  3. Check that supervisord is still up, and restart it if not. It's probably sufficient to just run processing-start and/or sudo processing-restart.

Is Celery running?

If Celery isn't running, one symptom is that ArchivedEvents aren't being created.

To check if Celery is running, SSH into the worker server, and run htop. If Celery is running, you should see something like this in the htop console:

|- /usr/bin/python /usr/bin/supervisord
|  |- python3 -m celery -A services.celery_push_notification...
|  |- python3 -m celery -A services.celery_data_processing w...
|  |  |- python3 -m celery -A services.celery_data_processin...
|  |  |- python3 -m celery -A services.celery_data_processin...
|  |- python3 -m celery -A services.celery_forest worker -Q ...
|     |- python3 -m celery -A services.celery_forest worker ...
|     |- python3 -m celery -A services.celery_forest worker ...

If you don't see that, then Celery probably isn't running. One likely culprit is supervisord not running. Start it by running processing-start.

Is RabbitMQ down?

If you open the Beiwe Celery logs in the home directory (~/celery_processing.log, ~/celery_push_send.log, or ~/celery_forest.log) and see this message, then RabbitMQ is probably down:

[ERROR/MainProcess] consumer: Cannot connect to amqp://beiwe:**@HOSTNAME:50000//: [Errno 111] Connection refused.
Trying again in 32.00 seconds... (16/100)

Try running sudo rabbitmqctl status. If the status output tells you that the node isn't running, check the logs in /var/log/rabbitmq. One procedure for restarting RabbitMQ is here. It's basically:

  1. Back up what you want from the RabbitMQ logs directory, and then delete the contents of the directory: rm /var/log/rabbitmq/*
  2. sudo service rabbitmq-server start
  3. sudo rabbitmqctl start_app If this gives you an error, then run:
  4. sudo service rabbitmq-server restart
  5. Finally, run sudo rabbitmqctl status again to confirm that it's running.

Other possibilities

  • Does your participant have a current FCM token? (Check using the database shell)

  • Did an ArchivedEvent get created? When a push notification fails to send, it often creates an ArchivedEvent in the database with a status message that gives some information.

  • If your events are failing with the error message google.auth.exceptions.RefreshError: ('invalid_grant: Invalid JWT Signature.', '{"error":"invalid_grant","error_description":"Invalid JWT Signature."}'), your backend Firebase credentials key is probably invalid. Go to the IAM & Admin console. (To get there from the Firebase Console, click the Gear Icon -> Project Settings -> Service Accounts -> n service accounts [under "All service accounts"]). Once you're in the IAM & Admin console, click "Service Accounts" and then "Manage Keys" on the relevant Service Account.

Clone this wiki locally