Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

at_lookup causes a crash when lookups fail to root (possibly other places too) #156

Open
cconstab opened this issue Mar 31, 2022 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@cconstab
Copy link
Member

cconstab commented Mar 31, 2022

Describe the bug
If a device is offline then at_lookup can cause the application to crash rather than wait for connectivity to come back. Looks to be a failure in name lookup in this case but if the network is down nothing @ should crash it should wait until network is available..

To Reproduce
Steps to reproduce the behavior:

  1. Create a small app that connects to a secondary (ColinSnippets/ssh_control works)
  2. Then run the program whilst having no network connection
  3. And then watch

Expected behavior
@ apps should never crash and should handle being offline gracefully and reconnect when possible

Screenshots

pi@raspberrypi:~/Colin-snippets/ssh_control $ bin/ssh_control
initializing storage
INFO|2022-03-31 03:00:52.990549|HiveBase|commit_log_f15959d1046b21a3e727245571dcd5c697956835e967a0a273db44d5681ac682 initialized suc
cessfully

AtServer.getHiveSecretFromFile file found
INFO|2022-03-31 03:00:53.003552|HiveBase|f15959d1046b21a3e727245571dcd5c697956835e967a0a273db44d5681ac682 initialized successfully

SEVERE|2022-03-31 03:00:53.016314|AtLookup|AtLookup.findSecondary connection to root.atsign.org exception: SocketException: Failed h
ost lookup: 'root.atsign.org' (OS Error: Temporary failure in name resolution, errno = -3)

Unhandled exception:
Exception: Secondary server not found
#0      AtLookupImpl.createConnection (package:at_lookup/src/at_lookup_impl.dart:270)
<asynchronous suspension>
#1      AtLookupImpl._sendCommand (package:at_lookup/src/at_lookup_impl.dart:550)
<asynchronous suspension>
#2      AtLookupImpl.authenticate (package:at_lookup/src/at_lookup_impl.dart:415)
<asynchronous suspension>
#3      AtOnboardingServiceImpl.authenticate (package:at_onboarding_cli/src/at_onboarding_service_impl.dart:187)
<asynchronous suspension>
#4      main (file:///home/pi/Colin-snippets/ssh_control/bin/ssh_control.dart:29)
<asynchronous suspension>
pi@raspberrypi:~/Colin-snippets/ssh_control $

Additional context
This is critically important for IoT use cases and also for mobile apps

@cconstab cconstab added the bug Something isn't working label Mar 31, 2022
@cconstab
Copy link
Member Author

@VJag would love your thoughts on this one and how to handle gracefully at the @ platform level.. Thanks

@VJag
Copy link
Member

VJag commented Mar 31, 2022

I will certainly analyse.

@VJag
Copy link
Member

VJag commented Mar 31, 2022

@cconstab I have captured my analysis here:

https://docs.google.com/spreadsheets/d/1KE22RrWzIKPvR1NTDZWKcfaU1sFnCEhJeHoDYjd3QBA/edit?usp=sharing

In the "Network usage analysis" tab I tried to capture the network usage, in the "Solution" tab tried to capture the new abstraction I am trying to propose.

Please let me know if my analysis is in line what was expected.

@cconstab
Copy link
Member Author

cconstab commented Apr 2, 2022

Dealing with on/off and intermittent network needs to be core to @.. it's important in mobile but critical in IoT .

I think this needs to be looked at this sprint..

Thoughts ?

@gck @VJag @nickelskevin

@gkc
Copy link
Contributor

gkc commented Apr 2, 2022

I agree. Every network-related error needs to have clearly defined predictable well-tested behaviour, and also have reliable predictable well tested behaviour on reconnect

I'd like the focus for this sprint to be

  1. Graceful reliable fully tested handling of intermittent network availability in the client libraries
  2. Lots more tests ensuring we cover all recovery scenarios for inter-server connection errors.

The problems which we discovered in e2e tests this week are most easily discovered in unit tests where you can more easily control the environment by plugging in stubs and mocks. This is especially true for network interactions.

@gkc
Copy link
Contributor

gkc commented Apr 3, 2022

I'd add

  1. Review how errors are passed from the underlying core libraries to application code, agree on any enhancements that need to be made, implement them
  2. Ensure that the apps have clear visibility of all client-side state which is relevant - i.e. document and agree what enhancements need to be made in order to give the visibility that app code needs, and implement those enhancements

@gkc
Copy link
Contributor

gkc commented Apr 3, 2022

Tagging @sarika01 also. Making all of this happen will need close collaboration across all of the engineers irrespective of whether they've been more focussed on apps or client SDK or server - the more cross-pollination that happens, the better. It'd be great to see 'core' developers working on app widgets and 'app' developers working on 'core' libraries!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants