Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry/backoff to get diagnostics data #112

Closed
vpetersson opened this issue Oct 18, 2021 · 9 comments · Fixed by #113
Closed

Add retry/backoff to get diagnostics data #112

vpetersson opened this issue Oct 18, 2021 · 9 comments · Fixed by #113

Comments

@vpetersson
Copy link
Contributor

vpetersson commented Oct 18, 2021

The hm-config container currently crashes with the following error if it is launched before the hm-diag container has been fully initiated:

 gateway-config  2021-10-18 17:44:50,728 - [DEBUG] - gatewayconfig.gatewayconfig_app - (gatewayconfig_app.py).__init__ -- /opt/gatewayconfig/gatewayconfig_app.py:(43) - Read eth0 mac address 00:BD:27:EF:6D:E7 and wlan0 FF:FF:FF:FF:FF:FF
 gateway-config  Traceback (most recent call last):
 gateway-config    File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
 gateway-config      "__main__", mod_spec)
 gateway-config    File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
 gateway-config      exec(code, run_globals)
 gateway-config    File "gatewayconfig/__main__.py", line 74, in <module>
 gateway-config      main()
 gateway-config    File "gatewayconfig/__main__.py", line 29, in main
 gateway-config      start()
 gateway-config    File "gatewayconfig/__main__.py", line 63, in start
 gateway-config      FIRMWARE_VERSION
 gateway-config    File "/opt/gatewayconfig/gatewayconfig_app.py", line 47, in __init__
 gateway-config      pub_key = diagnostics_json['PK']
 gateway-config  KeyError: 'PK'
 gateway-config  Sentry is attempting to send 1 pending error messages
 gateway-config  Waiting up to 2 seconds
 gateway-config  Press Ctrl-C to quit
 gateway-config  Agent registered
[CHG] Controller 00:1A:7D:DA:71:13 Pairable: yes
 gateway-config  [bluetooth]# pairable off
 gateway-config  [bluetooth]# quit
method return time=1634579096.751134 sender=org.freedesktop.DBus -> destination=:1.52 serial=3 reply_serial=2
 gateway-config     array [
 gateway-config        string "org.freedesktop.DBus"
 gateway-config        string ":1.52"
 gateway-config        string "com.helium.Miner"
 gateway-config        string ":1.46"
 gateway-config     ]
 gateway-config  DBus is now accepting connections
 gateway-config  2021-10-18 17:44:59,378 - [DEBUG] - __main__ - (__main__.py).validate_env -- gatewayconfig/__main__.py:(50) - Starting with the following ENV:
 gateway-config          SENTRY_DSN=https://[email protected]/5725518
 gateway-config          BALENA_APP_NAME=HELIUM-TESTNET
 gateway-config          BALENA_DEVICE_UUID=204f46e27d5a44bc65fac817ebc77e5b
 gateway-config          VARIANT=NEBHNT-OUT1
 gateway-config          ETH0_MAC_ADDRESS_FILEPATH=/sys/class/net/eth0/address
 gateway-config          WLAN0_MAC_ADDRESS_FILEPATH=/sys/class/net/wlan0/address
 gateway-config          DIAGNOSTICS_JSON_URL=http://localhost:80?json=true
 gateway-config          ETHERNET_IS_ONLINE_FILEPATH=/sys/class/net/eth0/carrier
 gateway-config          FIRMWARE_VERSION=2021.10.07.2
 gateway-config  
 gateway-config  2021-10-18 17:44:59,648 - [DEBUG] - gatewayconfig.gatewayconfig_app - (gatewayconfig_app.py).__init__ -- /opt/gatewayconfig/gatewayconfig_app.py:(43) - Read eth0 mac address 00:BD:27:EF:6D:E7 and wlan0 FF:FF:FF:FF:FF:FF
 gateway-config  Traceback (most recent call last):
 gateway-config    File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
 gateway-config      "__main__", mod_spec)
 gateway-config    File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
 gateway-config      exec(code, run_globals)
 gateway-config    File "gatewayconfig/__main__.py", line 74, in <module>
 gateway-config      main()
 gateway-config    File "gatewayconfig/__main__.py", line 29, in main
 gateway-config      start()
 gateway-config    File "gatewayconfig/__main__.py", line 63, in start
 gateway-config      FIRMWARE_VERSION
 gateway-config    File "/opt/gatewayconfig/gatewayconfig_app.py", line 47, in __init__
 gateway-config      pub_key = diagnostics_json['PK']
 gateway-config  KeyError: 'PK'

This is not a big deal, as it is recovering automatically. However, we should clean this up with some better retry logic in the near future.

@shawaj
Copy link
Member

shawaj commented Oct 18, 2021

@vpetersson can we just add a depends-on flag to the docker compose?

@vpetersson
Copy link
Contributor Author

If balena had a newer version of docker compose that would have worked. However the version they are using is rather primitive unfortunately and won't guarantee start order.

@shawaj
Copy link
Member

shawaj commented Oct 18, 2021

Depends on does work...
https://www.balena.io/docs/reference/supervisor/docker-compose/

You can even add a healthcheck command as well
https://docs.docker.com/compose/compose-file/compose-file-v2/#healthcheck

(Although it says "Only array form is supported" for depends_on on balena so not sure it'll be possible to use that here)

Other possibility is using command in docker compose

@shawaj
Copy link
Member

shawaj commented Oct 18, 2021

@vpetersson this might be a better way to handle the dbus-wait and diagnostics wait in the miner and config containers actually rather than shell code in the start scripts

@vpetersson
Copy link
Contributor Author

Strange. I'm pretty sure I looked for health check before in Balena and determined it wasn't supported. Looks like I was wrong (or they added it later).

Yes, a combination of depends_on and healthcheck is the way to go here as we delegate this logic to docker instead

https://docs.docker.com/compose/compose-file/compose-file-v2/

@shawaj
Copy link
Member

shawaj commented Oct 19, 2021

The only thing I'm not sure of is where it says "Only array form is supported" for depends on. Does that mean we can't pass a healthcheck to it?

Also I'm pretty sure I read somewhere that the supervisor automatically kills and restarts a container if a healthcheck fails. But can't seem to find where I read that. But worth keeping in mind anyway

@vpetersson
Copy link
Contributor Author

Don't worry about this. I'm sorting it in https://github.com/NebraLtd/hm-diag/pull/171/files

@shawaj
Copy link
Member

shawaj commented Oct 20, 2021

@vpetersson
Copy link
Contributor Author

Superseeded by NebraLtd/helium-miner-software#177

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants