Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC modification for Differentiated Artifacts Framework #1996

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

romain-intel
Copy link
Contributor

This includes both serialization/deserialization customization as well as directed artifact processing.

Examples coming soon.

@romain-intel romain-intel marked this pull request as ready for review August 29, 2024 06:19
@nflx-mf-bot
Copy link
Collaborator

Testing[813] @ 7bd1dcd

Copy link
Contributor Author

@romain-intel romain-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few clarifying comments.

@@ -60,10 +60,13 @@ def save_blobs(self, blob_iter, raw=False, len_hint=0):

Parameters
----------
blob_iter : Iterator over bytes objects to save
blob_iter : Iterator[Union[bytes, Tuple[bytes, str]]]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added documentation for the specific raw flag

raw : bool, optional
Whether to save the bytes directly or process them, by default False
len_hint : Hint of the number of blobs that will be produced by the
len_hint : int, default 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clearly need to fix the comment (it seems borked before too)

@@ -76,23 +79,40 @@ def save_blobs(self, blob_iter, raw=False, len_hint=0):

def packing_iter():
for blob in blob_iter:
raw_setting = raw
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic is to use raw setting but allow each blob to override. Compression methods would be fixed for now (not sure if there is a lot of value in allowing a huge amount of customization here) but we could add another one ourselves.

@@ -30,7 +27,12 @@ def __init__(
# produce a set of SHA keys to prefetch based on artifact names
prefetch = set()
for ds in self.task_datastores:
prefetch.update(ds.keys_for_artifacts(prefetch_data_artifacts))
l = [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chage is because there can now be more than one key for each artifact so this is now a list of list.

if self._datastore and name in self._datastore:
# load the attribute from the datastore...
x = self._datastore[name]
# ...and cache it in the object for faster access
setattr(self, name, x)
return x
# For MetaflowArtifact, we extract the representation of it to present to
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store representation for user but keep original artifact as well.

@@ -129,8 +149,15 @@ def __init__(
self._objects = {}
self._info = {}
elif self._mode == "r":
if data_metadata is not None:
# We already loaded the data metadata so just use that
if data_metadata is not None and not any(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic to fetch metadata from S3 if the one from the metadata is incomplete. This way, we don't change metadata view at all. This is only used through client.

@@ -740,8 +801,19 @@ def artifacts_iter():
# '_' as they are used by the Metaflow runtime.
delattr(flow, var)
yield var, val
art = flow._orig_artifacts.get(var)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change in behavior possible here since we keep things we serialize if they require it. For others, no change. Note that the idea would be that post_serialize could hollow out the MetaflowArtifact as well to keep memory consumption low.

@nflx-mf-bot
Copy link
Collaborator

Testing[813] @ 7bd1dcd had 14 FAILUREs.

@nflx-mf-bot
Copy link
Collaborator

Testing[813] @ 7bd1dcd had 15 FAILUREs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants