Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DATA-2672] Refactor data capture and data sync #4161

Merged
merged 60 commits into from
Sep 26, 2024

Conversation

nicksanford
Copy link
Member

@nicksanford nicksanford commented Jul 2, 2024

Ticket

Also fixes the following flakey tests:
https://viam.atlassian.net/browse/DATA-2880
https://viam.atlassian.net/browse/RSDK-8249
https://viam.atlassian.net/browse/RSDK-7934

Also fixes pathological behavior around retrying uploading arbitrary files:
https://viam.atlassian.net/browse/DATA-3114

Also changes the max_sync_threads default from 1000 to runtime.NumCPU():
https://viam.atlassian.net/browse/DATA-3064

Also changes data collectors to only call the capture method from a single goroutine rather than spawning a new goroutine per capture interval:
https://viam.atlassian.net/browse/DATA-2771
For why see: https://viam.atlassian.net/browse/DATA-2771?focusedCommentId=34912

Also changes sync_interval_mins to a default of 0.1 if unset:
https://viam.atlassian.net/browse/DATA-2506

This PR refactors the builtin datamanager into datamanager.builtin.sync and datamanager.builtin.capture packages.

Screenshot 2024-08-20 at 4 29 56 PM

builtin is responsible for taking the deps & config from resource graph and validating that they are in a state where initialization / reconfiguration are possible (they are only impossible if new errors are introduced at compile time).

It also refactors tests to:

  1. reduce reliance on mocked out time
  2. cover more cases
  3. be faster

It also moves data collector functionality previously under the builtin data manager package into the data package, where the only users of that functionality were data collectors.

It also changes the signature of the builtin datamanager constructor to allow us to test how the builtin datamanager behaves when it's connection with app:

  1. hasn't yet initialized
  2. initialized but is unhealthy
  3. initialized and is healthy

Not managing the lifecycle of the connection with the app was the source of several bugs in the past few months, so investing the time to manage that connection appropriately seemed like an important goal of this refactor.

I'm splitting this PR into sub-PRs. Once each of these are merged, I'll rebase this PR on main & it will continue to shrink until there is nothing more we can split out of this PR.

Changes in this PR:

I believe all of these are fixing bugs / risks in the existing code:

Builtin:

  1. Document & centralize datamanger sync & capture configuration as well as default config values & business logic (whether or not scheduled sync is running, whether or not, where the capture dir is, and whether or not a new config is or is no equal to a previous config).
  2. Have config.Validate return an error if any of the following config values are negative which will cause reconfiguration to safely fail (previously this would panic or cause resource exhaustion / other buts): sync_interval_mins, file_last_modified_millis, maximum_capture_file_size_bytes, delete_every_nth_when_disk_full,

Sync:

  1. Add logging for reporting cloud connectivity state changes
  2. Add logging & reporting on data sync upload stats
  3. Change max_sync_threads default to runtime.NumCpu() / 2
  4. Change the goroutine which walks directories and enqueues the files to stop running if any of the following happen: a. the datamanger.Close method is called b. the datamanager. Reconfigure method is called with a config that requires changing 3. In the case of the data manager's Sync grpc method being called, the GRPC request's context is canceled.
  5. Fixes datamanager's Sync grpc method panicking if called while the robot is offline
  6. Don't enqueue files to be synced that are already in progress (previously, these files would be dropped by one of the goroutines in the upload pool now the code doesn't create this waste by checking if the file is in progress before enqueueing it.
  7. Don't enqueue files to be synced that are 0 bytes in length (previously, this would result in data capture files being moved to the failed directory and arbitrary files entering an exponential backoff retry loop, which would leave one upload worker busy for a long as the file remains empty, potentially forever).
  8. Add logging for when files are moved to the failed directory
  9. Add logging for when exponential backoff succeeds (debug)
  10. Add logging for when exponential backoff gives up due to a terminal error (warn)
  11. Add logging for when exponential backoff retries due to a transient error (info) and include connectivity information in transient error retries.
  12. Wrap errors that occur in sync workers so that it is easier to follow which files hit which errors.
  13. Changes the size of the files to sync go channel from the number of max sync threads (1000 by default) to a blocking channel. This prevents queueing more files than we have available workers. This makes the behavior of sync more reliable & reduces the likelihood that we try to upload the same file twice.
  14. While walking directories, check if a file is already being processed (i.e., the file deleter is about to delete it, or an upload worker is trying to upload it).

Capture:

  1. Change builtin.resourceMethodMetadata to be named capture.collectorMetadata

Linked PRs:

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 2, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 5, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 10, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 10, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Jul 12, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Sep 26, 2024
@nicksanford nicksanford merged commit ea997e0 into viamrobotics:main Sep 26, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
safe to test This pull request is marked safe to test from a trusted zone
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants