Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster communication with host/chip #256

Draft
wants to merge 27 commits into
base: main
Choose a base branch
from
Draft

Faster communication with host/chip #256

wants to merge 27 commits into from

Conversation

hunse
Copy link
Collaborator

@hunse hunse commented Oct 28, 2019

This PR makes a number of changes that speed up superhost <-> chip communication:

  • Use a host snip that communicates to the superhost via a socket (faster than RPC).
  • Use a larger packet size.
  • Have chip and superhost running simultaneously (rather than one, then the other). This will add a single timestep delay between superhost and chip.

TODO:

  • Implement communication between host and learning snip.
  • Clean up commit history: squash timing commits, squash input channel buffering into larger packet size (the latter makes the former obsolete)
  • Clean up code (run black on each commit, ensure pylint etc. passing)

@hunse
Copy link
Collaborator Author

hunse commented Oct 29, 2019

The larger packet size 825a680 fixes #254.

@hunse hunse force-pushed the faster-comm branch 5 times, most recently from cc37ef6 to df56f68 Compare October 30, 2019 20:35
@xchoo
Copy link
Member

xchoo commented Oct 30, 2019

Just a comment that you need to do:
sudo apt-get install g++-arm-linux-gnueabihf
to get the host snips to compile. Might be something to add to the documentation (it might already be in the nxsdk docs though).

@hunse
Copy link
Collaborator Author

hunse commented Nov 8, 2019

I've added a number more commits to make things faster for learning specifically.

a384ca2 takes advantage of nengo/nengo#1581 to not do checking on the outputs of our internal nodes (from builder/inputs.py), since we control the outputs of these nodes and can ensure they're safe.

6e49272 is the one I'm least sure about, since it does add a one time-step delay between host and chip (I think), but with the benefit that host and chip can be running at the same time so things are faster. It might be nice to have a way to turn this on or off.

@arvoelke
Copy link
Contributor

arvoelke commented Nov 9, 2019

since it does add a one time-step delay between host and chip (I think), but with the benefit that host and chip can be running at the same time so things are faster. It might be nice to have a way to turn this on or off.

Can this be generalized in the same way as https://github.com/ctn-waterloo/nengo_brainstorm/pull/22 in particular by allowing multiple steps of delay to buffer on both the input and/or output sides while things run asynchronously? This would be to generalize cases for both before and after this change.

@hunse
Copy link
Collaborator Author

hunse commented Nov 11, 2019

It could be generalized (that's what #26 was doing way back in the day), but that's using a different mechanism (we actually buffer things), rather than just having the chip and superhost running at the same time. And even if we buffer, there would still be a performance advantage in running the chip and superhost at the same time. So I think they're independent features.

@hunse hunse force-pushed the faster-comm branch 2 times, most recently from 709725d to 2f16524 Compare November 13, 2019 14:14
hunse and others added 11 commits November 13, 2019 09:54
Some import statements just imported `scipy` when we needed
`scipy.sparse`. Import order differences made this an occasional
bug. Fixes #252.
This allows us to do a proper `bones-check` with `black`.
The hardware tests are still in 3.5.2 to support NxSDK.

This commit also fixes some slight changes by `nengo-bones` 0.6.0
that were missed in the upgrade commit because of the missing
`bones-check`.
Not backwards compatible with previous versions.
This is useful for testing SNIPs.
- Add a timer around the `Simulator._run_steps` call, to measure the
  time taken for all steps.
- Connect to the board outside the timing loop, so that this does not
  count towards the step time.
- Add a timer specific to SNIPs, to get the most accurate timing
  (after we call the board run function, so all setup has happened).
This reduces unnecessary communication with the chip
Previously, fixed checking of `neurons_per_dimension` and fixed
value for `add_to_container` made `get_ensemble` not particularly
useful for users trying to make their own `DecodeNeurons`. Now,
these are configurable, and default to the values that users would
likely want.
The host SNIP runs on the host and facilitates communication
with the superhost using sockets. This is faster than using
the default RPC interface.

We also take care to make sure both the host and chip SNIPs
end properly, by sending a message with a negative spike count.
This helps to eliminate board hangs.

To allow the host SNIP to work with multiple `run` calls, we
keep it idling in between `run` calls, waiting for a message.
If the board disconnects before a subsequent run call,
the negative spike count message will tell the host SNIP to stop.
This improves performance by reducing the number of channel reads.
The socket between the superhost and host was dropping data when
trying to send larger numbers of spikes. This seems to be solved
by getting rid of the step counter on the host snip. Something about
sending the number of steps as a separate message at the start
threw things off in the socket, I guess.
- Also assert one block per core with learning
- Also get core less often (outside loop) in learn SNIP
Previously, `Simulator._collect_receiver_info` spent significant time
calling `receive` on each receiver to load information into a queue
in the receiver, and then getting it back out again. We now skip that
step, and just do everything in right in `_collect_receiver_info`.

- Eliminating the `hasattr` call in `_collect_receiver_info` also
  has a significant effect on speed.
- Simpler queueing in `HostReceiveNode` avoids `while` loop and helps
  with speed there.
We used to do this copy in Nengo, now we don't, so need to copy here.
This allows the Nengo model on the (super)host to be running
simultaneously with the chip, reducing time per step but adding
in a one step delay between the (super)host model and chip model.
This makes things faster by not requiring us to increment a counter
through the node, and also makes sure we get all the data.

We also fix the time in this node to use zero-based timesteps
rather than one-based timesteps (to conform with core Nengo).
@hunse hunse mentioned this pull request Nov 15, 2019
@hunse hunse mentioned this pull request Mar 18, 2020
@tbekolay tbekolay marked this pull request as draft December 13, 2021 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants