Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"SOCKET timeouts" causing lockups of an entire device when communicating with backend. #48

Open
chrisbloomfieldcollie opened this issue Jul 5, 2023 · 11 comments

Comments

@chrisbloomfieldcollie
Copy link

I am using currently using the thinger library on about 20 MKR NB devices, that connect over LTE-M or NBIOT, which is soon about to jump up to 80 devices. For that reason I desperately need a solution to this problem that I am having.

Basically, there seems to be three scenarios where the thinger library causes my devices to lockup and the only way to recover them is to use a watch-dog timer and reset the devices when detected. This has been an OK solution until now however it is happening so frequently (once per hour per device on average) that it effects the the battery life of my devices as they need to go through the startup sequence every time.

These three different errors that I get in this scenario are "Writing bytes [FAIL]", “[_SOCKET] cannot read from socket!” and "[_SOCKET] Timeout!". All cause my device to lock up indefinitely. Screenshots are attached.

The SOCKET timeouts seems to happen more frequently in some of the afternoons. It seems like then there are more people in our office building and potentially in the buildings around us (more devices connecting using the network?)

It has been a problem the whole time I have been using this library with this device but I solved it temporarily with a watchdog reset.

Someone else also seems to have had a similar issue when using the GSM version on the MKR https://community.thinger.io/t/mkr-gsm-1400-losing-connection-to-thinger-io/2991

Can someone help with this issue ASAP as it is causing us a lot of downstream problems with our product.

Thanks!

Screen Shot 2023-04-06 at 20 13 25
Screen Shot 2023-04-06 at 20 21 07
Screen Shot 2023-04-09 at 11 54 35
Screen Shot 2023-04-06 at 20 24 34
Screen Shot 2023-04-06 at 20 01 00
Screen Shot 2023-04-06 at 21 04 27
Untitled

@colinvdspek
Copy link

I have some of these NB boards too and have the same issue! Would love to know how to fix this.

@alvarolb
Copy link
Member

alvarolb commented Jul 6, 2023

Hi, do you need your devices to be permanently connected to the platform?

NB devices use to sleep most of the time, then wake up, connect to the internet, and transmit data, especially if they are powered by batteries.

Building reliable NB-IOT solutions requires some more engineering according to the specific use case, and probably the general-purpose Arduino library for thinger.io is not the best approach here.

  • What are the specific requirements of your use case?
  • Are you able to monitor the connection with the network, i.e., sending AT+CEREG?
  • Did you test the connection stability without peripherals or surrounding code? Just the library with the keep alives.
  • Did you try using an RTOS to run tasks in parallel to the thinger connection, so, the device is not blocked waiting for network connectivity?
  • Did you check that the NB-modem is in its latest version?

@chrisbloomfieldcollie
Copy link
Author

Hi @alvarolb

Thanks for your reply.

Yes, we do indeed need to be connected to the platform permanently. Realise the NB device is not a good system to be using long term but we chose it so that we could get our system up and running as fast as possible and iterate quickly from there. The reliability doesn't need to be perfect but right now we are dependent on getting or current solution working as best we can so that we can demo it for an investment round. For that reason we would love to find a viable workaround or solution.

The requirements are:

  • We will have 65 devices in field at one time
  • The devices themselves need to be performing actions continuously
  • They need to be able to send some sensor and location data up every 3 mins
  • They need to be connect reliably and pull a bunch of properties from thinger on startup and also send some logging data up
  • A few times a day they will receive commands to change actions
  • They need to be able to reset if they have an issue.
  • They also need to survive as long as possible on battery

Currently the problem is that the device loses connection so regularly that it needs to be reset with a watch dog timer so many times that it is unpractical and burns extra battery. Could you point me in a rough direction to try and fix this lockup? Like how could I get it to try again if I get this socket fail error?

I haven't tried to monitor the connection with AT+CEREG yet but I will try that. Could I then easily trigger a reconnect if I detect it has been lost?

We have tested the library without peripherals a while ago but will try test in the same scenario we are getting these issues.

I have a basic RTOS in place yes. There are not a crazy amount of tasks although the GPS task can take up to 100ms. What is the maximum time you would recommend between handle() runs?

All of our NB SARA chips get the latest firmware version (at least I think, it's L0.0.00.00.05.08,A.02.04) before we use them.

Thanks in advance!

@alvarolb
Copy link
Member

alvarolb commented Jul 8, 2023

Please, review the firmware as I think the latest is 05.12.

I have read many issues regarding the MKRNB1500 stability, especially when the modem hangs. In the meanwhile, I have released a new Arduino Library 2.26.0 to try to improve the connection stability. It has not been tested properly, so, try it and let me if it improves something.

I have a basic RTOS in place yes. There are not a crazy amount of tasks although the GPS task can take up to 100ms. What is the maximum time you would recommend between handle() runs?

100ms will not be a problem. You can call practically at any rate under a minute. But It will make the device less responsive to API requests, i.e., calling it every 5 seconds, you can expect a 5 seconds delay when calling a device function.

@chrisbloomfieldcollie
Copy link
Author

Hi @alvarolb

Thanks for the info, we have tried upgrading the firmware to 0.5.12 (was a mission) but it does not solve the issue.

The new version of which library exactly? How do I find it?

Thanks!

@alvarolb
Copy link
Member

Hi, I released a new Arduino library for Thinger.io with version 2.26.0. Update it via Arduino IDE.

@chrisbloomfieldcollie
Copy link
Author

Hi @alvarolb

Just to update you, we have updated to the latest library version and we are still getting the same errors. Is there anything/anywhere you could point us to so that we could try and get to the bottom of this issue ourselves.

Thanks in advance.

Chris

@alvarolb
Copy link
Member

Hi @chrisbloomfieldcollie,

I have an MKRNB1500 here and will test it today. Just curious, what is your network provider?

@alvarolb
Copy link
Member

Just received an MKRNB1500 and have it connected with a basic sketch. I'll update you on its performance. Have you experimented with different SIM cards or antennas?

Image

On another note, I've come across some issues related to the MKRNB1500, with numerous customers reporting errors, firmware problems, and hangs. It's concerning that Arduino doesn't seem to maintain or support this hardware, and there are no responses on their forums.

At thinger.io, we're using custom NB-IOT hardware based on ESP32 and Quectel BC660K for two different projects. Is there a specific reason you need to use the MRKNB1500? Perhaps we could explore alternative options.

@chrisbloomfieldcollie
Copy link
Author

Hi @alvarolb

Our network provider is KPN here in The Netherlands. We experimented with Tele2 but found KPN to be more reliable. We haven't experimented with antennas yet. Is there anything you would recommend?

Aware of the issues with the MKRNB, we also have problems with the device locking up and we have built circuitry in our device to perform a hard reset of the SARA module when we detect this issue and that seems to fix it. The problem outlined in this thread though I am reasonably certain that it is a software issue on the arduino side (and I think in the thinger library) as it is fixed buy just a software reset on the arduino.

We have chose the MKRNB systems for their speed to develop on for the particular prototypes we are building. We need these to work until the end of October so we can get investment and then we will look for more reliable alternatives so would be happy to discuss your solution then.

Curious on the results from your testing with the MKRNB?

Chris

@alvarolb
Copy link
Member

I think it is not a problem with the Thinger.io Arduino library, but a bad implementation on the MKRNB libraries, those that are responsible for talking to the modem via AT commands. You can make your own tests: just create a simple sketch with other protocols, like MQTT, and check how it behaves. Looking at the number of issues on the forums with the MKRNB1500 (that are not using tinger.io), I am certainly sure the library is stuck somewhere else waiting for a response from the modem or something similar.

In my first attempt, the MKRNB1500 was connected for 8 hours, then disconnected. Will keep checking it those days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants