-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload via bootloader times out for bigger sketches #9
Comments
From @quartzjer on November 12, 2014 1:43 FYI, I've seen the exact same behavior when flashing via the IDE and there's an stk500 timeout, sometimes it just blinks green, sometimes it's solid green, I think it just means that the flash was incomplete/corrupted and the bootloader is looping restart or stuck trying to boot bad code? |
From @erictj on November 12, 2014 1:44 Regarding the stuck-green-light-while-flashing issue that seems to be cropping up--some observations:
Sounds like we have a couple of moving targets with regards to stable flashing via chrome: one is the ongoing automatic updates of the chrome browser at random times, and the other is the increasing size of bootstraps with most subsequent updates. Perhaps a good approach for regression testing would be to make a custom bootstrap (maybe with a huge PROGMEM string initialized) to get the hex size close to, say 255k, and add that to part of the integration test. On every update of Chrome, we auto-run a 255k flash via chrome 10 times and ensure it works every time. |
The bootloader checks address 0, if it is 0xffff it doesn't boot the sketch and restarts itself, which results in the blinking you get when you just burn a bootloader after erasing the chip (the bootloader does one blink each time it is run). When a flash somehow fails, it seems unlikely that address 0 is 0xffff, but perhaps it runs the sketch, which halfway does something weird and ends up doing a jump 0 or something. Before the bootloader boots the sketch, it turns the leds off (that is - I think it accidentally leaves the pullups on, so they might light up very dimly, but they should certainly nog light up fully). This means that when the led lights up solid green, it never leaves the bootloader for some reason (one alternative explanation could be that the sketch jumps back to the bootloader before you can see the led is off, except that I think the led turns off halfway through the bootloader already). As for the solid green led - it seems I can reproduce this often when I first plug in a scout in the morning. Replugging, or using the on/off switch is then sufficient to solve this, so this is likely a separate problem. |
From @quartzjer on November 14, 2014 21:3 if it helps, I'm pretty sure I watched my scout (re)boot in the middle of a flash, which of course caused the stk500 timeout... so maybe the bootloader or the 16u2 is triggering a reset? |
I dove into the timeout with avrdude aspect of this issue. It somehow looks like this is more likely to occur on one scout than another, but that might be dumb coincidence, or perhaps the two scouts I tested are running slightly different 16u2 or bootloader code, not sure (the "better" one was one of my first scouts from before we were officially shipping). Anyway, I hung a logic analyzer on the TX/RX pins, and here's what happens:
So far facts, now speculation. I think that the bootloader misses a byte, or a checksum error occurs:
The value of this bootloader timeout is a bit ill-defined. There is a loop that checks the UART status which is executed F_CPU times, so the actual timeout depends on the instructions generated (but I'd expect 2-3 seconds, certainly > 1 second, which is inconsistent with my observations above). The avrdude source is a bit vague on the actual timeout, but when no data is being received, it should timeout after 5 seconds. AVR068, the STK500V2 spec, says:
Meaning our bootloader does not comply with the spec. So, I'm not really sure what is the root cause of our problem, but a flipped bit or slightly off bitrate (I'd have to capture at > 1Mhz samplerate to check this) doesn't seem entirely unlikely here. In any case, it seems like mismatches between timeouts (in particular, avrdude doesn't retry fast enough) is the reason it doesn't recover from this. I'll see if I can dive into the bootloader to see what is the underlying cause here. For HQ, we might be able to fiddle with retry timeouts and fix this there, for avrdude / Arduino IDE this might be more tricky. |
I had a quick look at the chrome app code and whipped up Pinoccio/chrome-app-pinoccio#31. It's rough and untested, but it might just work. If it does, we'll probably need to polish it a bit more before merging, though. |
I did a bit more digging today. It turns out my previous suspicions about missing a byte or failing a checksum weren't correct. The last datapacket (before the timeouts) is correctly received and writting to flash, but then the bootloader locks in I'm not entirely sure how it gets from this wait loop to actually starting the sketch. I'm sure it doesn't just continue with the code as you'd expect. Sometimes it seems it passes the start of the main() function again (based on some pin toggles I put there), but sometimes it doesn't. Based on the Looking at the disassembly doesn't show anything weird though. I can't really explain what happens there though - at worst it could infinitely loop, but the pin toggle I added suggests that's not what happens (and the "reset" also disproves that). Out of time for today. Summarizing, it seems something weird is going on. I can't really explain what I'm seeing based on the code. So either I'm missing some strange corner case, or there is some hardware problem (e.g. causing the CPU to behave out of spec)... |
From @gvdw on December 9, 2014 11:18 If it's blocking in Could be useful to use a serial port monitor that timestamps the serial traffic so we can try and see where it goes pear shaped. Also, making a sketch that manually sets an obvious RGB torch colour and while(1) as early in the sketch as possible (ie, in the setup function before Scout.Setup()) would help indicate if the bootloader has been left. Scout.setup() and Scout.loop() can be left in to chew up flash space, this sounds like a timing issue or lost message that is becoming more apparent/likely to occur with bigger sketch sizes. |
That's what I initially though, but it actually seems that the bootloader (running in the 256rfr2) stalls or locks up or otherwise breaks with the expected program flow, so it doesn't actually send any reply at all.
I'm debugging with a logic analyzer, so I can see the 'Hello from pinoccio' prompt show up on the serial lines when the sketch was booted, so that's already covered. |
One more test run I did today shows that at some point during the flashing, it runs the |
When I define a timer interrupt that triggers after every instruction and writes the current program counter to the second serial port, the problem stops occuring. When I set the timer to let the program to run for 16 us between interrupts (which themselves also take around 16us) the timeout still happens. The program counters that are output show that it is happily running around in a loop, waiting for flashing to finish and then the next interrupt the counter is 0xb7 (word address, 0x16e in bytes). This suggests that the program counter got reset to 0x0 at some point. I thought that the problem might be that an interrupt got accidentally enabled and triggered, but the fact that my timer is still triggered disproves this (since interrupts are disabled when an interrupt triggers). To doublecheck, I globally disabled interrupts and the timeout still happens. |
Looking at the contents of the flash afterwards, it seems 0x0 contains a |
One more datapoint: Bypassing the 16u2 using a sparkfun FTDI breakout didn't help either, I got the timeout once and the command failed once. It's a bit tricky to test, because (presumably because of the extra resistor, and/or the capacitor already on the reset pin, I couldn't get autoreset to work. |
From @gvdw on December 11, 2014 12:18 it only happens in the second half of the flash right? could it be some sort of near vs far issue? when I was testing the bootloader originally I tested with sketches right up to APP_END, and I don't remember having any issues. so I wonder what's different about the current situation. are you testing via chrome still or via direct serial/avrdude/arduino IDE? |
And this morning I tried to reproduce the problem on a Arduino Mega 2560 (running our bootloader minus RF stuff), but it ran for dozens of uploads without any timeouts. |
And one more test: Disabling the RF stuff (wibo and uracoli) and the LED code (as needed on the mega2560) doesn't fix the problem, I can still trigger the timeout then. |
@matthijskooijman Do you have a frequency counter? If so, what are you seeing on the 16U2 crystal output? Is it a solid 16MHz with a low ppm? (I'll check to see if we have one here at R/GA) |
(BTW, you must be using the new Saleae Logic+Scope product? Amazing product!) |
I can probably measure the frequency with the Logic8 I have. However, I'm not sure what you're looking for - it's the bootloader that is misbehaving, the 16u2 just seems to be an innocent bystander here. |
@matthijskooijman It seems that there are quiet a lot of hickups on VCC when TX/RX traffic starts. Have you tried bypassing the TI 3.3V regulator with a clean and stable 3.3V rail? Another way to fix it might be to add ~1mF cap at the output of the regulator. |
Eric tested on the scout he has with a more stable regulator, that didn't help at all. I think I tried running with a big cap on the 3V3 line, but I'm not sure, so I'll try again. Putting an actual external 3.3V on that line is a bit tricky, because the line isn't available on any header pin (ICSP pads are probably the best access) and I'd need to bypass USB power into the regulator... |
In earlier testing, I found that flashing a hex file with random data would usually (or even always?) work, but then a subsequent upload attempt would fail right away (avrdude never got any reply at all). I suspected this was caused by the bootloader overwriting itself and that the cause for this was the same as for the timeout, but it turns out neither are true - reading back the flash shows the bootloader is untouched. This problem was caused by my simplification of the bootloader, I removed some code that wasn't essential to the timeout (e.g. led handling and unused stk500 commands). It turns out that the bootloader timeout depends on the led handling. The timeout is 10,000 iterations of a loop that contains a 1us delay, which would result in 10ms of timeout (avrude seems to typically need 200-300ms for its first command). However, the led code contains The reason the first upload did work, is that, when the first flash word is 0xffff, the bootloader restarts instead of starting the application, so it kept listening on serial instead of actually timing out. I had also found that flashing all zeroes would always make avrude recover after the timeout. When the timeout happens because of a random jump into the program, I think the zeroes (together with any 0xff's left from erasing the flash when burning the bootloader) are just nop instructions, or at least instructions that do not cause any looping. This means that after a short while, all of the flash has been executed, and the bootloader starts again. If the timing is right, this allows avrdude to resync and continue where it left off. This doesn't work when writing a real sketch, since that sketch would then start running, and never run all the way up to the bootloader again. So, neither of these things seem to be actually related to the timeout, unfortunately. They do point out that the timeout handling in the bootloader could do with a bit of a cleanup... |
I have one more datapoint that contradicts this. When filling the flash with For reference, this is what I used to generate the rjmp instructions:
|
Very useful new info: If I modify the bootloader to map writes to the first half of the flash into the second half, the timeout happens a lot sooner (saw 4%, 6% and 7% just now). This rules out the "timeout happens after some time due to charge buildup / drift / whatever" and confirms that the problem is really related to the flash location (probably to the second half of the flash, but that's not entirely confirmed here). |
More importantly, it reduces the time of each test run by 30 seconds or so :-p |
And one more interesting observation: If I modify a register between It doesn't seem the register itself matters - I saw this with the It doesn't seem to be timing related - inserting nops equivalent to the register-modifying instruction in the same place does not seem to work. Removing the else branch of the address sanity check if seems to also prevent the problem, presumably because the structure of the code changes. Looking at the dissassembly indeed shows significant changes. Surprisingly, there is also an unexpected amount of change from just inserting a For reference, here is the huge assembly diff from adding that one line: https://gist.github.com/matthijskooijman/887c7a328f6cb3b7a225 |
A bit of closer inspection shows that, because I removed the EEPROM command from the switch, the |
Sidestepping from the previous "it stops breaking when I press here" observation, I further reduced the bootloader code to a minimal program that just flashes zeroes into the upper half of the application over and over again. No serial is involved, so it's not really a bootloader anymore, but it's enough to reproduce the problem. Only erasing didn't show the problem, so it seems that both erasing and programming is needed. I've added some debug output to port B (active low leds, blue on PB4, red on PB5, green on PB6) and PORTE (for my logic analyzer). When you run it, it blinks the red and blue led in turn. This should continue indefinitely, but after a while (usually within 10-15 seconds), it stops blinking and the green led lights up (usually together with the red led, though I think I've also seen it with the blue led lit). This indicates the bootloader has restarted, without any reset reason listed in The code is here, in the bootloader-debug branch: https://github.com/Pinoccio/hardware-pinoccio/blob/e201a7af3f16578c76e23e28ac86e361229c986d/firmware/bootloader/src/main.c There is a Makefile in the parent directory, but for reference, these are the compiler options used:
Using this code, I made a logic analyzer trace. This first image shows the start of the trace, showing the PE2 assertion at startup and then alternating between pulling PE4 low during page erase and pulling PE5 low during page write. In between (visible when you zoom in closer), both pins are high while filling up the page buffer. This continues exactly like this, until the next trace. This shows the end of the trace - during the page write (PE5 is low) execution somehow never passes the point where it sets PE5 high again and instead ends up at the start of the bootloader (presumably it ends up at 0x0 or somewhere else in the application section and then runs up to the bootloader again, but I did not verify this for this particular test). As expected from the code, the trace doesn't change anymore after this. Still suspecting the powersupply, I tried connecting a 100uF cap to VCC, which didn't help. It did seem to soften the voltage changes a bit - A bit more curvy instead of spiky (changes are not as pronounced as the traces I posted previously, min/max values measured were in the 3.15 - 3.32V range). However, I manually pushed the cap's pin against the ICSP VCC pad, so perhaps actually soldering something to the pad, or even directly to the AVR VCC pins might yield better results... |
For reference, here's the assembly generated for the code: https://gist.github.com/matthijskooijman/e0268c9d222aa72001f0 |
I finally managed to try this on a Zigbit module, and the problem also occurs there. This should rule out any hardware problems with the Pinoccio board and indicate that it's really a problem in the 256rfr2 chip. I used the "ATmega256RFR2 ZigBit Xplained Pro Extension" board (ATZB-256RFR2-XPRO-ND). The setup is simple: I connected power to J4 and my logic analyzer to pins 1-4 of J3 (which map to PF0 - PF3). I also connected the logic analyzer ground to a spare ground pin: For power, I tried both the 3.3V pin of a Pinoccio board, as well as an Arduino Mega (to rule out problems with the Pinoccio's power supply). Here's trace of the output, just like above: This trace was generated using this code: https://raw.githubusercontent.com/Pinoccio/hardware-pinoccio/20f636963d38e7681adf2ec3dc081486412139e4/firmware/bootloader/src/main.c To get the code onto the board, here's the steps to take:
Note that -Os is required, without it the problem does not seem to trigger (presumably because timing is essential). When running this code, the yellow and green leds should blink alternatingly (to fast to notice). If it breaks (that is, main is re-executed a second time, without MCUSR being set), the led led will turn on and typically the yellow led remains on, indicating that it went rampant during the writing of the page. For completeness, here's the dissassembly of the code, which confirms that there is no way that execution should be able to restart:
|
Daniel Wax from Atmel has confirmed that this is actually a bug in the
|
Great work Matthijs, this has been a real weird issue but it's good to get confirmation of where the problem lies and a way to mitigate. Unfortunately, it means existing bootloaders are susceptible and will need to be updated if people experience issues. I've been testing Daniel's fix and so far have flashed a full sketch over 100 times without any issues. We have seen variation in the past where the chip will work OK for 10+ flashes but hitting the hundred mark makes me quite optimistic. Have made a new branch for testing changes, really nothing to it at this stage but helps keep track: |
Nice, thanks for confirming! As for committing the fix - I was planning to move the bootloader into this repo instead of core-pinoccio. I still have to commit my work to master properly (this issue has sidetracked me), but the code is already in the debug branch. Could you commit your fix on top of 85ecda8 in this repo and push that to a branch? That commit includes a cleaned up file layout, with a single Makefile for compilation. |
Quickly looking at your code, it looks good. One remark though - I think the sei() should be exactly before the sleep instruction, to prevent a race condition (given that we can be certain that the spm instruction will take a bit of time, I don't think it's really going to happen here, but in general sei just before sleep is the recommended practice. This works becuase sei() guarantees that the next instruction will be executed before any interrupts fire, guaranteeing the SPM interrupt will wake us up, even if it is already pending when you go to sleep). |
your 85ecda8 commit has a file "datasheets\RF\2450FB15L0001 Balun:Filter.pdf" - the colon is an illegal character under windows so it is forcing me to delete the file if I commit. And for some reason I am getting 403 denied trying to push it under Debian. must be a sign I should go to bed, I'll try again tomorrow (but I'm happy for you to put the changes in the right place if you're keen). Good call on sei() too, let's put it just before the sleep. |
@gvdw, I added your fix to https://github.com/Pinoccio/hardware-pinoccio/commits/spm-fix. I moved the sei() and also shuffled some other code around in a second commit, does that look ok to you? I haven't actually tested this code yet, though, I'll do that ASAP. I've also removed the colon from that filename, hope that works better for you :-) |
thanks Matthijs :) your changes look good at a quick glance, I won't get a chance to test it for a couple of weeks but will revisit then. |
From @jingman on November 12, 2014 1:19
Copied from original issue: Pinoccio/server-hq#191
The text was updated successfully, but these errors were encountered: