-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network Stability #7
Comments
Thank you very much for the detailed report on the error that was occurring. The report makes it sound like there might be a problem with traffic going through the USB dongle which then causes problems with the subscribe.py script. However, having the devices still registered on bbbb::100 webpage suggest otherwise. After the socket error does the "last seen" column on the [bbbb::100]/sensors page keep increasing and not reset? If this is the case then that means that there is a problem with 6lbr and the dongle handling the traffic. If not then the problem could be with the mqtt broker. Also what is your '#define NETSTACK_CONF_RDC_CHANNEL_CHECK_RATE' set to? By increasing this by powers of 2 (8, 16, 32, 64 max) it will greatly help increase the traffic through the mesh, at a cost of power consumption. As you increase the number of devices you will probably have to up the amount. |
Jon,
During what we believe to be the crash and automatic recovery of the
6lbr/eth0/br0 framework, the page at http://[bbbb::100] disappears (page
not found error). When it comes back, the sensors are mostly gone from the
listing at [bbbb::100]/sensors (usually one or two have reconnected by the
time refreshing brings up the page), though they re-register automatically
over the next minute or so.
After the socket error, the canaries return to just (what I assume is) a
basic check-in once per minute; the "last seen" does reset accordingly. It
is the consistent near-simultaneous recovery of the network framework and
timeout of the subscribe.py client in mosquitto that lead us to believe
that the
In /contiki/examples/canary/mqtt_protobuf_demo/project-conf.h, line 47
reads:
define NETSTACK_CONF_RDC_CHANNEL_CHECK_RATE 8
I'll try increasing this value and see if the many-canary network setups
are more resilient. Does this also affect the rate at which canaries
publish their own sensor readings? Prior to the firmware update, each
canary would publish a data point approximately once per second;
post-update that dropped to about once per 10 seconds. It would be helpful
to be able to tune that frequency for certain deployment scenarios.
I'm copying Dr. Goldblum on the thread so she can independently offer
information and/or ask questions relevant to the issue. Let me know if
there's any other information that would aid with diagnostics.
Thanks,
Chris
…On Wed, Oct 11, 2017 at 11:37 AM, steelsmithj ***@***.***> wrote:
Thank you very much for the detailed report on the error that was
occurring. The report makes it sound like there might be a problem with
traffic going through the USB dongle which then causes problems with the
subscribe.py script. However, having the devices still registered on
bbbb::100 webpage suggest otherwise.
After the socket error does the "last seen" column on the
[bbbb::100]/sensors page keep increasing and not reset? If this is the case
then that means that there is a problem with 6lbr and the dongle handling
the traffic. If not then the problem could be with the mqtt broker.
Also what is your '#define NETSTACK_CONF_RDC_CHANNEL_CHECK_RATE' set to?
Located in https://github.com/PureEngineering/contiki/blob/
master/examples/canary/mqtt_protobuf_demo/project-conf.h
By increasing this by powers of 2 (8, 16, 32, 64 max) it will greatly help
increase the traffic through the mesh, at a cost of power consumption. As
you increase the number of devices you will probably have to up the amount.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#7 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AeIYsxWBikHZ6cS86sF8QMyC5UgySqztks5srQrpgaJpZM4PwpFG>
.
|
In a previous email it was stated that the VM did not contain these network errors. Is this true for the hardware and the VM together, or just the simulator? |
It was only tested with the simulator. However, I think we may have found the issue. The subscribe.py script uses the following calls:
mClient.subscribe("c", 0)
mClient.loop_forever()
As the mqtt unsubscribe function is never called, we never fully terminate the loop in the case of an interruption and any child processes that are running may prevent re-subscription without a reboot. We are currently working on modifying the subscription to allow for these corner cases and will test to see if this improves the stability.
Is the data rate set by NETSTACK_CONF_RDC_CHANNEL_CHECK_RATE or MSG_INTERVAL?
… On Oct 16, 2017, at 12:27 PM, steelsmithj ***@***.***> wrote:
In a previous email it was stated that the VM did not contain these network errors. Is this true for the hardware and the VM together, or just the simulator?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#7 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHtU7QeAOtI0yqsMJ-3SYWah3BiQnFG_ks5ss644gaJpZM4PwpFG>.
|
|
Jon,
I’ve modified two of the three values you mentioned (specifics below) in
the firmware source, re-compiled it, and flashed the image onto all of our
canaries. In addition, I wrote a couple of functions into subscribe.py that
handle network setup and trigger automatic and rebuild of the network when
an on_disconnect callback is received.
SENSOR_READING_PERIOD: kept equal to (CLOCK_SECOND)
CC26XX_WEB_DEMO_DEFAULT_PUBLISH_INTERVAL: decreased from (CLOCK_SECOND *
10) to (CLOCK_SECOND)
My understanding of these settings was that the sensors were (and still
are) being read approximately once per second, but that only every 10th
reading was being published. By decreasing the publish interval to the same
as the sensor reading period, we are now publishing every sensor reading.
NETSTACK_CONF_RDC_CHANNEL_CHECK_RATE: increased from 8 to 64 with no
apparent change in stability as a function of number of devices connected.
My earlier notion (that the border router software being run off the dongle
antenna was the point of failure) seems supported by the lack of
improvement in stability with change in the radio toggling frequency on the
canaries. A quick google search on stability issues with many connections
running through 6lbr yielded multiple high-ranked results on problems
people were having when connecting double-digit numbers of ContikiOS
devices through 6lbr. At least one thread mentioned having success with
Contiki’s rpl-border-router. Do you see any reason why we shouldn’t pursue
the use of Contiki’s rpl-border-router, given that the goal is to have
(very) many simultaneously connected devices?
Thanks,
Chris
…On Wed, Oct 18, 2017 at 11:35 AM, steelsmithj ***@***.***> wrote:
NETSTACK_CONF_RDC_CHANNEL_CHECK_RATE Controls the "wake up" period in the
radio duty cycle protocol. To conserve power, each node turns on and off
really quickly and checks to see if there is an incoming message.
Increasing the check rate will make it turn on and off more often, giving a
higher on rate. This makes sure that messages travel through the mesh more
reliably, but at the cost of holding the radio on more. As more nodes are
added to the mesh, this value will need to increase to make up for the
higher amount of data.
#define CC26XX_WEB_DEMO_DEFAULT_PUBLISH_INTERVAL (CLOCK_SECOND * 10)
Located here
<https://github.com/PureEngineering/contiki/blob/master/examples/canary/mqtt_protobuf_demo/cc26xx-web-demo.h>
controls how often a message is published over mqtt. This rate will not be
faster than one second without changing core contiki code.
#define SENSOR_READING_PERIOD (CLOCK_SECOND)
Located here
<https://github.com/PureEngineering/contiki/blob/master/examples/canary/mqtt_protobuf_demo/cc26xx-web-demo.c>
controls how fast the sensors are polled. Having this larger than your
publish interval will result in duplicate information.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#7 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AeIYs3BK8nE8UZqFzSkqoEtY6hw581D-ks5stkUKgaJpZM4PwpFG>
.
|
CC26XX_WEB_DEMO_DEFAULT_PUBLISH_INTERVAL is the period in which the canary will send a packet through 6lbr to the mqtt broker. It is safe to assume that every sensor reading will be published if both are set to clock_second. Although they are two separate timers so I guess there is a small possibly that a value can be published twice before reading or read twice between publishes. The NETSTACK_CONF_RDC_CHANNEL_CHECK_RATE will increase the performance in the mesh, not to the 6lbr router. As you start spreading out canaries and they begin needing to hop multiple times to get to the router, having a higher check rate will increase the performance of the entire mesh. The dongle does not perform a channel check rate, it is always receiving. So when a message is sent to it and assuming no collisions or its already reading a message, it should receive it. So the more devices that are a single hop distant to the dongle will decrease its performance. At a base level 6lbr is designed to use the rpl-border-router example. rpl is the routing protocol which 6lbr is using. We can try using the rpl-border-router but we lose a lot of functionality (no [bbbb::101], mqtt wont work as easily, etc). I think in order to have very many connected devices the publish intervals need to be longer. Which is counter intuitive to your testing at the moment. The end goal of this mesh network will be very fast communication between canaries (high channel check rate) and then a single message is sent through 6lbr after the canaries have communicated with each other. |
The subscriber client script subscribe.py will occasionally reach time out and stop logging data without fatal error. This has been observed to be preceded by the LED on the CC2531 USB dongle going through the same pattern of flashes as when the network is first being established. After interrupting subscribe.py via Ctrl+C in the terminal, it fails to reconnect to the network, throwing “Error 113: No Route To Host.” Any active canaries are able to reconnect. A reboot of the computer hosting the network is necessary to reconnect subscribe.py.
The text was updated successfully, but these errors were encountered: