-
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If interrupt occurs while doing the asynchronous read of I2C, the poll will be rejected and the chip can be left in unrecoverable state #59
Comments
…ous read of I2C, the poll will be rejected and the chip can be left in unrecoverable state Bump version number Provide tiered read using setTimeout
There is one additional alternative I did not consider at first. Thoughts appreciated. |
I've attached a sample test file that illustrates how we could do delayed retries should something fail. This follows the promise reject retry design pattern from here: https://stackoverflow.com/questions/38213668/promise-retry-design-patterns This would allow us to have a finite set of retries with a delay that starts small and grows to a max value. I'm actually thinking of implementing this in my other modules. What do you think? Lynde |
Thank you for your explanations and figuring out this issue. 👍️ In my opinion it would be the best option to delay the execution of the Using the sync i2c functions should be no option here since this will block the event loop of the whole node processes and so may affect applications using this module. |
... instead of rejecting them if an other poll is currently active Added method `isPolling()` to get the current polling state May fix #59
@lynniemagoo Please see the pollqueue branch and let me know what you think. 🙂 |
I reviewed the code on the branch. Will be able to test later this evening. It's a more elegant solution than the test file I submitted above. I like it. However be forewarned that it does come with memory issues. Consistent Pin Changes or repeated calls to doPoll() in short timeframes can cause possible memory issues! Use Case 1 - Interrupt-Driven Polling Assumption is that PCF857x configured to manage interrupts and poll on interrupt. In theory, we'd queue 25 promises before 1 promise could be serviced. The promise-queue as-is would likely be OK for short bursts of pin changes as eventually the promise queue would be drained, but if is a constant interval of 5ms is present, then expect larger memory growth and if the condition does not cease, out of memory exceptions. The solution that I proposed in the attached test file limits the amount of retries that can be done to 5 and also provides the ability to hold off subsequent requests using setTimeout. There might be a better/hybrid solution. The interrupt flag in the chip is cleared when the chip is read. So, all we really need to guarantee is that a _poll() will occur after the interrupt occurs. In theory, if the promise queue depth is > 1, this condition will always be satisfied as a subsequent promise will eventually execute _poll() to clear the chip. Use Case 2 - Manual Polling Here, we have the same conditions described in Use Case 1 above, promises are being queued faster than they can be serviced. On might say you cannot fix stupidity but things happen. Proposal - Modified Promise-Queue implementation Within _handleInterrupt() and doPoll(), what we might want to do here is to do a conditional enqueue() where if depth is 0 or 1, add a promise and if depth is > 1 then do nothing in _handleInterrupt() or immediately reject in doPoll(). This would require either an additional signature or an overload on the PromiseQueue's enqueue() method but would allow us to elegantly solve the problem. Thoughts appreciated. |
Oh yes... I completely missed that a poll may be triggered more often than the i2c bus can handle the requests. You're totally right! I like your proposal of the modified promise queue implementation. I've added a commit containing maximum length limit for the queue. Now one poll may be active and one queue promise is allowed. Any additional try to add something to the queue will be rejected. |
Just reviewied. Nice job!!!! I will test it later this evening. I thought we might be able to centralize a finally block but upon further review, I was incorrect. So I edited deleted modified dequeue() code from this post. However, I think that in pcf857x.ts, we should change the max queue length to 2 (or 3 - see below) from 1. A max limit of 1 results the same behavior as before, where if back to back interrupts fire and we are in middle of first poll when subsequent interrupt occurs (we'd reject if queue full) or if the application code manually polls and this poll is active at the time of the interrupt (see race conditions notes below), we could fail to service the interrupt as well. Changing limit to 2 allows us to recover from back-to-back interrupts and would also allow an application that needs to call doPoll() alongside managing interrupts. But as usual, as I type this, I'm wondering if 2 is sufficient. Consider fast interrupts and a user doing a poll from application code. If the user poll is active, and then an interrupt occurs, we'd still be able to service the first interrupt as the limit is 2. However, if 2nd interrupt occurs, and we reject, there is a race condition where the last I2C read might have completed just prior to the interrupt and the chip would not be read which leaves us as before. For this reason, I'd say let's go the safer route and support a maximum limit of 3 vs 1 in pcf857x.ts. Thoughts? |
Hello Peter, I got some time to test using the maxQueueDepth = 1 and am seeing rejected promises from _handleInterrupt. I have updated the .js versions of promise-queue and pcf857x to put in .catch() for the promises as well as logging stack trace where reject(s) occur. The example.js in the zip attached here is the file I'm using to test. I changed the address to 0x20 and added pins 4, 5, 6 as additional inputs. Then, I added setInterval doPoll() for 3ms. On my test harness, from the log inside the zip, you can see the end result. If enqueue() resulting from doPoll() rejects inside the interval, all is well. However, if enqueue() resulting from _handleInterrupt rejects, we have handled promise rejection. In the JS files, I attempted to capture the promise reject and a stack trace with hopes of fixing this issue. Please see my changes in pcf857x.js as an attempt to fix. Thoughts appreciated. |
Update and good news - : I pulled the project a 2nd time and used a slightly modified example file. I'm happy to report that the unhandled promise on _handleInterrupt went away. Likely the original cause was due to me stupidly adding some catch handlers in pcf857x.js which I then reverted when I was able to isolate the cause. Anyway, attached is the example.js that I used successfully. I actually added a couple of console.log entries to the promise-queue.js file but did not include them here. With the short poll interval alongside manual button presses causing interrupts, I do see a rejects periodically in the interval, with no apparent consistency with a queue depth of 1, For my 3rd test run, I'm using maxQueueDepth = 1 and I've disabled the interval and am 100% reliable on interrupts. The good news is that possibly due to timing, I'm not seeing any occurrences of rejection due to fast manual button presses. To be 100% safe and if you agree, I'll reiterate what I said above and recommend we go with a depth of 3 to fully cover 2 interrupts plus one manual poll being handled concurrently. A max of 2 would likely prevent issue from occurring but I usually err on the side of caution and would choose 3 instead. Let me know when you want me to test the next round of changes. Thanks again for jumping into this quickly. As I mentioned, my cat9555 and mcp23017 projects are modelled after this project (literally a lift of the TS code with register reads/writes instead). With your permission, I'd also like to include this new logic in those private projects which I may make public when all done. Thanks, |
One final note. I have the 3ms interval poll disabled and using maxDepth = 1 and have now been able to see occurrences where back to back interrupts can occur and the reject happens during the 2nd _handleInterrupt call. When I see this happen, it's usually within the first 30 seconds of process spin-up. After that, things settle down. But this test does indeed reinforce in my mind the need to have a limit > 1 to avoid a race I mentioned above that could happen should the I2C read might have completed just prior to the interrupt and therefore interrupts could get stuck. Overall, the progress today makes me very happy and suggest to me that we need to have max depth as 2 or 3. I leave that decision up to you. :-) maxDepth = 2 - Should document limitation that application can either manually poll or or enable interrupts only, but no both. |
I left the process active (interrupt only) for over 5 hours and periodically ran multiple button presses simultaneously and over the 5 hours, I got around 11 rejects when the maxQueueDepth was 1. Going back to my thoughts earlier, what I did not test is 2 different intervals each doing a poll which would be way overkill. However, I think with a queue depth of 3, we could handle it. |
Thank you for the great analysis! I think we should use a queue length of 3 to be save, which means that there may be effectively up to 4 polls "active" (one that is currently processed and 3 waiting). So we can be absolutely sure to not miss anything. If someone requests more polls, this can/should be rejected since they are definitely not needed. The rejection in Also this limit and possible rejection should be documented. I'll do this probably tomorrow. Of course you can use this also in your two derived projects. :) |
Thanks. I will retest when you are done
When you use depth of 3, that means that the array can only have 2 entries max. There can be 1 currently processing that has been shifted out of the queue and then up to 2 more. If you try to add another (3rd promise to the array), it will be rejected. You mentioned 4 so was a bit confused. Did you mean 3 in flight and 1 completed meaning that the resolve or reject in pcf857x has been called?
The change to 3 is ideal but as you mentioned you wanted to work on documentation, I wanted to clarify the exact behavior.
|
It's a little bit confusing... The intended behavior with node-pcf8574/src/promise-queue.ts Lines 51 to 55 in 3b91474
So with |
Because of the >= on line 53, there can never be more than 2 entries in the queue, not 3. So with a limit of 3 there can be 2 in queue and one active for a total of 3, not 4 |
🤔 |
In the words of Homer Simpson, Dohhhh! Somehow I was not seeing that but when you outlined it, yes, I see it now. We shall stick with 3 as that is extra bullet-proofing. Let me know when you button it up and I'll regression if needed. |
Hey, I see you bumped the major version. I was wondering if it also makes sense at this time to change the pcf8574/pcf8575 constructor instead of an 'initialValue:? any' instead supply an 'options: NodeJS.Dict' and pull the initialValue from 'options'. If we were to do something like that, then we could also provide the ability to override the maxQueueDepth for the PromiseQueue. I struggle with this one quite often as putting a Dictionary in there (NodeJS.Dict) is not ideal solution and really goes against strong typing but is a well-established javascript constructor pattern. Thoughts? |
Hey. Wanted to update you. I put latest promise-queue into my MCP23017 module (maxDepth=3) which is the most sensitive IO chip I have and if you don't read the chip after an interrupt, for sure it will stop sending interrupts. (same goes for CAT9555). MCP23017 works beautifully with 0 rejects as we expected and so does PCF8574. I have really exercised the hardware for both PCF8574 and MCP23017 and this these changes are solid. On a side note, you should have also gotten some emails about being added as a collaborator on some private projects. I did this in transparency and in the MCP and CAT modules, you will see I removed unnecessary semicolons. These projects are scoped to my user space in case I do want to publish to NPM. Right now, I'm thinking I want to combine the code into a single project as @LynnieMagoo/automation-expander |
We could use an Interface to describe an options ( Great to hear, that all works as expected now! :) Thank you for the invitation to your projects. I'll have a look at them later today. No semicolons... uff ... I'm going crazy if there are missing semicolons. 😅 |
Just an idea... The current abstract Soon I'll need something for an MCP3424 (4 channel ADC). This could be added as well using some abstract base class for ADC IDs. |
Using 3 queued poll requests we can be sure to not miss any change by interrupt or manually triggered poll. See #59
Whatever you decide on the semicolons is fine. I will follow your lead there as well as the construction pattern in my derivatives. Check out the mcp and cat projects as to how I used you same model with chips supporting register reads |
Mcp23017 and pcf8574 One final suggestion. In the examples, change the default address to 0x20. I've tried in the states to get a PCF8574A from Amazon and every time I find a source with an image of PCF8574A, the always ship boards with PCF8574. |
Ignore this for now - See comment below with PeterStuff3.zip attached. Hi Peter, I'm doing some more testing today with PCF8574 and I can confirm that we have resolved the issue with the interrupt rejection. Now, what I'm starting to see are missed interrupts. As I'm receiving interrupts, and processing pin state changes, I'm writing values to the chip as well via I2C. It appears that on the PCF8574, writing to the chip will also clear the need to fire an interrupt. For example, my pins linked in software as follows: Pin 7 - Input -> software forwarded to Pin 0 - Output So what is happening is that as I write updates to I need to do some more digging into the datasheets but here's what I'm thinking.
I don't think this will have any affect but with the console.log lines I added for testing, I'm for sure seeing that a read followed by multiple writes is causing an issue. One other thing I thought about is to guarantee the ordering of things in the queue. For example, instead of having writes go directly to I2C (_setInternalState), have the write request added to the poll queue. Like I said, I need to do some more testing but these changes have definitely affected timing somewhere and I sometimes end up having usually 1 LED stuck on (Low Level) when I've released all the buttons following a press. |
Ignore this for now - See comment below with PeterStuff3.zip attached. And another note. I'm not just seeing this on the PCF8574, I'm also seeing this on the MCP23017 which leads me to believe the problem is in the chip poll algorithms which I've standardized or is in my application code which just follows the observer pattern and upon an input pin emitting an event, we write a value to the corresponding output pin. Will keep you posted over the next few days. |
Ignore this for now - See comment below with PeterStuff3.zip attached. I will continue to experiment. Right now, my delegation simply causes setPin() to be called on the PCF857X for each pin following each interrupt. If 4 pins changed, we do 4 writes, if 3, 3 writes, etc... The pollQueue is the appropriate solution. Just want to determine if we should also expand pollQueue to also handle writes. |
Ignore this for now - See comment below with PeterStuff3.zip attached. Happy Holidays. |
CODE STYLE ISSUE? - See Line 309 of pcf857x.ts. You are missing a 'break' on the 2nd case of the switch. Do you also want to add 'default:' to your switches? |
Peter, So, I went back to simple basic modified example.js to do exactly what my larger application is doing but without extra event listeners. I'm simply pressing and releasing 4 buttons from pics above at the same time. I am able to reproduce this issue rather quickly. From initial review, I first thought the cause was a single interrupt containing 2 pin changes which causes 2 writes but the 2nd interrupt that occurred has 2 inputs that changed. So, from all my comments above, ignore all of them for now please except for this one and the one right above regarding code style (missing the break; and default: in switches). This is starting to make me think the issue about missing interrupts, although we did see rejects without the poll queue was a combination of the on-off issue above as well as the reject when actively polling. If you get a chance, please review the logs in the zip files here. Start with PeterStuff4.zip. I've annotated the example3a failure run. I will admit the problem takes longer to reproduce when we use process.nextTick() to do the pin writes but it is still there. Here's my final assessment: Root Cause: Missing GPIO interrupts from epoll underneath onoff. The pcf857x.ts code is good. The only way to truly solve this problem will be using a combination of interrupt-driven and timer-based polling. My thoughts are to use setInterval polls of 250ms to get things back in sync. Curious about your thoughts about this... But.... Let me know your thoughts. Lyndel |
It's 1:06am here in Texas. I've now just reproduced the same issue on the MCP23017 module. The problem is either 1 of 2 things. 1) The epoll missed interrupt issue or 2) Mechanical switches that are generating too much noise for chips. The inputs are all in pullup mode and the mechanical switches when closed, ground the pin. I'm of the opinion that this is an epoll scenario as it does not occur regularly and in a repeatable timeframe. I will make plans in my larger application to simply add an configurable interrupt recovery poll interval that can be adjusted as needed. 500-1000ms should be adequate. I'm testing with 10000ms now and it allows me to see that indeed the timed poll will clear the issues. The behavior I am seeing for both these chips is that we typically get an interrupt per-pin. So, for the MCP23017, we could in theory get 16 quick interrupts. For PCF8574, we could get up to 8, and for the PCF8575, we could get up to 16 all based on the number of pins on the chip. Hypothetically, the first interrupt would complete way before the last interrupt occurs. To be 100% sure we are capable of servicing all possible interrupts, I recommend we do the following. Set the promise-queue max depth to the number of pins on the chip (as at the time promise-queue is constructed, we don't know which pins will be used as inputs vs outputs) plus an additional 1 for a manual poll. This is indeed overkill but does limit the memory footprint as we both want and should also support both manual polls as well as interrupt-driven polls. Getting some rest now. Hope these notes don't overwhelm too much. Regards, |
I guess I lied. I'm still awake. I've done additional testing with maxQueueDepth = 3 and I still cannot get a reject to happen. So my recommendation above is likely unneeded. Let's go with our original agreed value of 3 for now to be able to handle up to 4 polls (1 acttive and 3 in queue). |
This may be our issue: Consider the code from onoff that uses raw epoll events if debounce is not active (which is the case for PCF8574):
The PCF module is only configured for the 'falling' edge. This means we only poll on 'falling' detection when interrupt is low. If we were to configure for 'both', this would mean we'd be queuing 2 promises and doing double reads. But, if we onoff is not sent an event on an interrupt transitioning to low, we'd catch it on the high side when the chip is read and interrupt is cleared. I will test this later today and get back with my findings and we can discuss if we want to change or allow the rising/falling to be configured. |
Thanks again for your deep investigation! I'll have a deeper look at all this later today. Your buttons seem not to be hardware debounced. Maybe this causes some of your issues? If you press such a button it will bounce several times what may cause multiple interrupts at the port expander. If the interrupt/poll logic is fast enough you will get multiple input change events or false reads. If someone is having trouble using the internal interrupt logic, he could also setup the gpio outside of this module and just call Yeah 16 Inputs may trigger 16 interrupts in theory. But the current queue length should be enough nevertheless since one read can detect multiple changes at once. Event if some interrupts may occur if the queue is full (and so a new poll will be rejected), the final read will get the changes. |
OK. We are 3 days away from Christmas. Let's think this over the next few days before finalizing a decision. The thing I dislike most about Github is that there is no collaboration tool other than issue notes, etc. IMO, email, zoom, or facetime is better place for these discussions as sometimes lots of trial and error are needed. However, I want to drop in a 3 notes after sleeping a few hours.
a) Desired Enhancement - Support 'both' for the GPIO rising + falling which results in 2 polls. additional optional parameter 'gpioEdge: EdgeType' enableInterrupt() with default being 0x00 falling. (enum EdgeType{ falling = 0x00, rising = 0x01, both=0x02} ) b) User manages GPIO externally and calls doPoll(). Option b) might be best but as PCF8574 manages gpio.unexport as well, I'd opt for also supporting Option a) above. As I stated in 1) above - The module is simple, easy to use. If we could also support a) the user can chose whichever best meets their needs (for me, I would simply change my application layer to enableInterrupts using Both and let the PCF/CAT/MCP modules do the rest).
The benefit of such event would allow a user via a single event to manage the chip as a whole (not pin-by-pin). Adding such an event, requires a small change to setAllPins. We would need to accept either a boolean or a number as a single parameter, update the internal state based on values contained within. (by within, I mean honoring the inputBitMask, invertMask, etc). This would also align I2C write counts with read counts versus the user having to call setPin(pin, value) either 8 or 16 times resulting in 8 or 16 I2C writes. Let me know what you think! |
Attached here is a zip that includes changes above with the exception of the new event. I defer to your guidance and strengths on the declaration of the event. However, I did implement the change to setAllPins to support both number and boolean types. Hope it meets with your approval. I'm not too happy about re-exporting type GpioEdge but it seemed the simplest way to do this - Thoughts on how we might do it better? Overall, I think the code changes came out clean. I also ran es-lint to be able to tidy things up for you to review. I ran my previous example3.js file supplying 'both' for the Edge and I've not been able to reproduce any failures as 'both', although not ideal (requires double reads if see both edges), will adequately handle either the epoll or hardware issues (not sure which it is) EDIT - No sooner than I'd completed typing this post, I went back to my test harness and even 'both' does not address the issue. I'm still seeing failures. So for now, let's consider only adding new 'read' event and updates to setAllPins to support both number and boolean. Lyndel |
Greetings Peter, I think I'm finally done for the day. Let me know what you think of this. Attached are the changes that I propose we release. I've not updated the README nor have I updated examples in the attached but I did test the new 'poll' event locally and it works perfectly. Unzip above over the poll-queue branch src folder, run a diff and you will see my edits. All this being said, I'm still missing an interrupt every now and then. I suspect that the PCF/MCP is not sending it to the Pi. But, if it is, then it is being lost in epoll. Either way, anything more we do in this area of interrupt issues is a IMO, a waste of time. The only thing I can think of now that might be improved because it causes multiple reads/writes during startup are the setInputPin and setOutputPin. Would be very nice if we had a single function like:
|
In the zip I attached above, I accidentally did not remote the optional Edge parameter from enableInterrupt. I'm thinking that if we want to leave it in, we need to re-export a type GpioEdge that aliases to the Edge type in the onoff module. If you don't want to leave it in, then we can revert the import line at the top and then hard-code to 'falling' as was before. |
Peter take a look @ my private repo that I added you to. The master branch is promise-based, there's also a promise and a callback branch but Master has what think will benefit us both. I've been testing it and I have some news to share once you are back from the new year. Lyndel |
Hello Peter,
I've spent the past few days trying to get to the bottom of an issue. I have a project I'm working on that monitors various IO Expander chips and it relies solely on interrupts to detect changes once the Chip class is constructed. I've specifically modelled a cat9555 and mcp23017 after the pcf857X typescript code.
If one is solely relying on interrupts to drive chip state instead of using doPoll() , it is entirely possible that should multiple pins change arbitrarily, that multiple interrupts can occur. Should an interrupt occur when reading the chip asynchronously in _doPoll() then a rejection ("An other poll is active") would immediately occur. Once this rejection occurs, it has been proven via my testing on all my IO chips including both a PCF8574 and PCF8575 that no additional interrupts will fire until an external doPoll() is requested. In my base code, I added a timeout to repeatedly call doPoll() but that is not an ideal solution as 98%+ of the time, no issues arise. So, I decided to go one layer lower, to the chip code.
The condition I'm speaking of can be avoided if upon detecting the error, issue one or more setTimeout calls to invoke a poll after the first I2C read has completed and the chip is available for a subsequent doPoll() call.
I will be opening a PR with a fix for this issue. The proposed fix uses up to 2 attempts to read the chip via _doPoll().
Let me know your thoughts. I've tested this fix across all my IO Expander modules and it works beautifully. I've not been able to get into a condition where an interrupt occurs and the code doesn't finally read the chip to clear the interrupt.
Happy Holidays,
Lyndel
The text was updated successfully, but these errors were encountered: