Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use copy of state for state dumps to file #365

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

johannaengland
Copy link
Contributor

Scope and purpose

Mentioned in #364.

Contributor Checklist

Every pull request should have this checklist filled out, no matter how small it is.
More information about contributing to Zino can be found in the
README.

  • Added a changelog fragment for towncrier
  • Added/amended tests for new/changed code
  • Added/changed documentation
  • Linted/formatted the code with black, ruff and isort, easiest by using pre-commit
  • The first line of the commit message continues the sentence "If applied, this commit will ...", starts with a capital letter, does not end with punctuation and is 50 characters or less long. See https://cbea.ms/git-commit/
  • If applicable: Created new issues if this PR does not fix the issue completely/there is further work to be done

@johannaengland johannaengland added the bug Something isn't working label Aug 30, 2024
@johannaengland johannaengland self-assigned this Aug 30, 2024
Copy link

codecov bot commented Aug 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.66%. Comparing base (e2164d5) to head (209d561).

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #365   +/-   ##
=======================================
  Coverage   98.66%   98.66%           
=======================================
  Files          77       77           
  Lines        9706     9707    +1     
=======================================
+ Hits         9576     9577    +1     
  Misses        130      130           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@johannaengland johannaengland force-pushed the bugfix/use-state-copy-for-statedump branch from 73e016c to 209d561 Compare August 30, 2024 13:38
Copy link

Copy link
Member

@lunkwill42 lunkwill42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this looks like the right approach, though I think you probably want a deep copy.

Also not sure how expensive an operation it is for PyDantic to copy the entire structure, which can grow quite large. If running against the entire set of Uninett routers, is there a noticable difference in the runtime of this function before and after this PR?

@@ -42,8 +42,9 @@ class ZinoState(BaseModel):
def dump_state_to_file(self, filename: str):
"""Dumps the full state to a file in JSON format"""
_log.debug("dumping state to %s", filename)
copied_state = self.model_copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may actually want a deep copy of the state to ensure nothing is being updated from another thread:

https://docs.pydantic.dev/latest/api/base_model/#pydantic.BaseModel.model_copy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I try to use deep copy I get the following exception

2024-09-03 09:48:07,045 - ERROR - apscheduler.executors.default (asyncio_0) - Job "ZinoState.dump_state_to_file (trigger: interval[0:05:00], next run at: 2024-09-03 09:53:07 CEST)" raised an exception
Traceback (most recent call last):
  File "/home/johanna/zino/.venv/lib/python3.10/site-packages/apscheduler/executors/base.py", line 125, in run_job
    retval = job.func(*job.args, **job.kwargs)
  File "/home/johanna/zino/src/zino/utils.py", line 33, in wrapper
    result = func(*args, **kwargs)
  File "/home/johanna/zino/src/zino/state.py", line 45, in dump_state_to_file
    copied_state = self.model_copy(deep=True)
  File "/home/johanna/zino/.venv/lib/python3.10/site-packages/pydantic/main.py", line 311, in model_copy
    copied = self.__deepcopy__() if deep else self.__copy__()
  File "/home/johanna/zino/.venv/lib/python3.10/site-packages/pydantic/main.py", line 777, in __deepcopy__
    _object_setattr(m, '__dict__', deepcopy(self.__dict__, memo=memo))
  File "/usr/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.10/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.10/copy.py", line 153, in deepcopy
    y = copier(memo)
  File "/home/johanna/zino/.venv/lib/python3.10/site-packages/pydantic/main.py", line 789, in __deepcopy__
    deepcopy({k: v for k, v in self.__pydantic_private__.items() if v is not PydanticUndefined}, memo=memo),
  File "/usr/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.10/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.10/copy.py", line 206, in _deepcopy_list
    append(deepcopy(a, memo))
  File "/usr/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.10/copy.py", line 238, in _deepcopy_method
    return type(x)(x.__func__, deepcopy(x.__self__, memo))
  File "/usr/lib/python3.10/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.10/copy.py", line 271, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.10/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.10/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.10/copy.py", line 271, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.10/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.10/copy.py", line 206, in _deepcopy_list
    append(deepcopy(a, memo))
  File "/usr/lib/python3.10/copy.py", line 172, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.10/copy.py", line 271, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.10/copy.py", line 211, in _deepcopy_tuple
    y = [deepcopy(a, memo) for a in x]
  File "/usr/lib/python3.10/copy.py", line 211, in <listcomp>
    y = [deepcopy(a, memo) for a in x]
  File "/usr/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.10/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.10/copy.py", line 211, in _deepcopy_tuple
    y = [deepcopy(a, memo) for a in x]
  File "/usr/lib/python3.10/copy.py", line 211, in <listcomp>
    y = [deepcopy(a, memo) for a in x]
  File "/usr/lib/python3.10/copy.py", line 161, in deepcopy
    rv = reductor(4)
TypeError: cannot pickle '_asyncio.Future' object

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh, that seems pretty weird... what could be putting a Future object into this structure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm investigating it - it is not FlappingStates - I checked out a commit before it was added and the same error happens

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with PlannedMaintenances

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I have figured out that the problem lies with ZinoState.events - I will investigate further which of them lead to problems

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then again, I'm not sure that this actually solves the issue at hand. I still get the occasional error about dicts changing size while Pydantic is iterating them to copy them, which was the reason for doing a copy to begin with. And, of course, this makes sense. The copy is being made in the same thread that is doing the dumping, so if the structure is being changed by another thread, the same problem will still occur.

The only way to ensure we don't get this error is to ensure that no other threads are concurrently modifying the data structure while it's being serialized to disk (or while it's being copied). This can be done by either making sure we're not running multiple threads, or by adding locks (which will complicate things quite a bit).

It could also be "worked around", by simply retrying the serialization several times if it fails due to these errors - which would be safe enough when #366 is merged, as that would protect us from leaving a corrupt state file when the error inevitably occurs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do believe the reason the dumping is happening in a separate thread is because the method dump_state_to_file() is not declared as async. When this function is scheduled to run using APscheduler, apscheduler will automatically defer its execution to a separate worker thread because it is a non-async function and would block the entire event loop.

I think the safest way to run the function would be to declare it as an async function, which lets it run in the main event loop thread. It will, of course, block the event loop while serializing - which could be ok if it's fast, disastrous if it is slow. A more complicated implementation would be to implement it as a function that copies the state in the main thread, and then defers the actual serialization of this copy to a worker thread (since the disk i/o is what would block the most).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants