Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed join when end node too close to gateway #557

Open
2 tasks done
chopmann opened this issue Nov 4, 2021 · 11 comments
Open
2 tasks done

Failed join when end node too close to gateway #557

chopmann opened this issue Nov 4, 2021 · 11 comments

Comments

@chopmann
Copy link
Contributor

chopmann commented Nov 4, 2021

FROM: https://forum.chirpstack.io/t/failed-join-when-end-node-too-close-to-gateway/12537/3

  • The issue is present in the latest release.
  • I have searched the issues of this repository and believe that this is not a duplicate.

What happened?

I’m working with the end node mere meters away from our test gateway here at the office. Join fails most of the time, because Join-Message is received in multiple channels.

What did you expect?

Join succeeding. For this corner case to be fixed, de-duplication needs to be a bit smarter and only replace an ongoing join request with a new one if less than X time has passed (1s would work) and the the new request has higher RSSI than the previous one.

Steps to reproduce this issue

Steps:

  • End node transmits join request on 868.500.

  • Gateway receives join request on 868.500 (RSSI -63) and on adjacent channel 868.300 (RSSI -97)

  • Gateway sends both join requests to network server.

  • Network server responds to join request on 868.500 with a join accept on 868.500.

  • Network server responds to join request on 868.300 with a join accept on 868.300 followed by an RX2 join accept on 869.525. This is the join that remains valid to the network server.

  • End node receives join accept on 868.500 and it now thinks it has successfully joined.

Surely you see the problem. To the network server the assigned network address and session key are the ones it sent last on 868.300, whereas to the end node they are the ones that it receives on 868.500.

As a result, both happily think the join was successful but communication is impossible because network address and session keys are different.

Could you share your log output?

Your Environment

Component Version
Application Server v?.?.?
Network Server
Gateway Bridge
Chirpstack API
Geolocation
Concentratord
@iggarpe
Copy link

iggarpe commented Nov 10, 2021

I'm the poster of the message on the forum. Some extra info:

I'm using network server 3.15.3

The problem is clearly that in the network server the second join request received overwrites the first after the join accept has been already sent. If the gateway, as is my case, sends to the network server first the true join request then the false join request on the adjacent channel with much lower RSSI, both are replied with a join accept but only the false one remains as the valid join to the network server, whereas the end node actually receives the first join accept and happily things it has successfully joined.

The fix would be super easy: when more than one join request is received in a small time window (1s?) from the same end node, overwrite the previous one only if the RSSI is way higher.

This would work no matter the order in which the gateway sends the join requests, true one first / false one first.

@brocaar
Copy link
Owner

brocaar commented Nov 15, 2021

This issue has been brought up a couple of times. What happens is that when a device is really close to the gateway, it will over-drive the gateway hardware causing a "ghost" package. Thus in the end the uplink is reported on two frequencies.

Currently the de-duplication logic does not inspect the LoRaWAN payload. Based on the raw payload + frequency it starts the de-duplication logic meaning that when the same payload was reported on two frequencies, there are two de-duplication functions running simultaneously and the first one "wins".

I'm not sure of over-driving the gateway radios can cause any permanent damage to the gateway. Maybe somebody else can comment on this. My assumption is that you should try to avoid this.

As well, I'm not sure what would be the best, secure and still performant solution for this assuming this scenario doesn't cause any harm to the gateway radios.

The reason why the de-duplication logic includes the frequency in its key is because there was a security issue reported a while ago, which would allow a re-play with a better RSSI / SNR to "steal" the downlink path. One could replay the uplink within the configured de-duplication duration using a different frequency, and with that breaking the downlink path (e.g. letting the LNS respond using a different frequency or time).

I'm open to suggestions.

(also posted on the forum)

@urbie-mk2
Copy link
Contributor

I have access to multiple gateways in development and device production where lorawan connectivity is tested.
Devices have a range of about 5m to gateway, so far the gatewa, which uses the sentech hardware reference Implementation, does run for 3 years after having produced 10000 of nodes. So damaging the hardware is not likely but I have never seen this behaviour before where the gateway receives and forwards a ghost package.
Are there steps to reproduce this reliably?
I was updating production to the latest version set a month ago.

@mmrein
Copy link

mmrein commented Nov 15, 2021

As I've written on forum, I would not expect the damage of hardware (assuming power levels used on LoRa), but this close proximity also is not what "LOngRAnge" is designed for.

You should be able to reproduce this simply by moving the device closer to the gateway. I currently have test device 2-3m from test gateway and I have been receiving quite a lot of ghost packets.

Solved simply by using 50ohm RF load instead of gateway's original antenna. It also gives me RF numbers closer to real-world conditions like -110dBm RSSI for example).

@csanso-limit
Copy link

Hello @brocaar, are there any plans to implement this fix and when can we expect it?

Although it is mentioned multiple times that it only occurs in close proximity our client is stating and proving otherwise, and he is upset because he has to deploy hundreds of nodes in the next two weeks with uplinks and downlinks, communicating uplinks and downlinks every 5 minutes, a join request once every 5 hours, which means an incorrect join will result in hours of lost information, and loss of downlinks.

I am confused however to why Chirpstack approves both duplicated join requests if they both have the same DevNonce, perhaps if Chirpstack did not accept the repeated DevNonce join request in most cases this error would not occur at all.

In our case it seems like the network server first receives the real join request with the correct frequency and then before having time to reply with a join accept it immeditely receives the second join request (duplicate, different frequency) and then proceeds to reply with a join accept for the first join request received.

And it seems like what ends up happening is that the end device receives the new DevAddr from the first join request but the Application Sever ends up with the DevAddr genereated from the second join request.

This morning the time difference between the duplicated join requests were 8 microseconds, that is why it has no time to reply before the duplicate is received.
Times: 2021-12-09T08:31:58.439906Z and 2021-12-09T08:31:58.439914Z

Perhaps the quickest solution would be to reject the second duplicate join request because the DevNonce is the same as the previous join request DevNonce, not sure why it's not currently rejecting it, it could have to do with the small time difference.

@brocaar
Copy link
Owner

brocaar commented Dec 9, 2021

Yes, I'm planning to address this, but I don't have an ETA for this yet.

@mmrein
Copy link

mmrein commented Dec 9, 2021

Perhaps the quickest solution would be to reject the second duplicate join request because the DevNonce is the same as the previous join request DevNonce, not sure why it's not currently rejecting it, it could have to do with the small time difference.

I wouldn't be sure that the first packet is always the right one. Its the first one demodulated and received by the server, it sure is the one with strongest signal, but it doesn't say that gateway could not demodulate the incorrect one first.

Quick workaround you can try is to set default DataRatate of your device to higher value.

@cairb
Copy link

cairb commented Apr 13, 2022

@brocaar we encountered this problem too and this is my initial fix. Please advise if you find this solution problematic:
image

@brocaar
Copy link
Owner

brocaar commented Apr 14, 2022

@cairb please note that the mutex only applies to a single instance. In case of multiple instances it doesn't prevent an other NS instance from handling the "ghost" join-request.

@cairb
Copy link

cairb commented Apr 14, 2022

@brocaar Thanks for the quick reply. There is only one instance of NS running right now on an embedded ARM gateway. What will be the scenarios that require multiple NS instances?

danieroux added a commit to danieroux/chirpstack-network-server that referenced this issue Apr 26, 2022
…nt data loss being experienced

It defaults to false. If set to true, it will ignore the frequency on which the packets come in from the gateways. This allows a ghost packet to be gracefully collected as Just Another Packet to de-duplicate with.

Without this setting a ghost package gets in and overrides an established `dev_addr`. Leading to data being lost until the next JOIN request, with the edge device unaware that it has lost its JOIN status.

We have not been able to trace where the ghost JOINs come from. This stops those from being a problem for now.

- brocaar#557 (comment) is not what we are experiencing, our devices are far apart
- This fixes brocaar#566 for us
@brocaar
Copy link
Owner

brocaar commented Apr 27, 2022

I have just pushed the following change: 1b50594. I believe this should fix at least the duplicated OTAA accept in case of a ghost uplink.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants