-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metadata support with tests #155
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, and sorry for the delay in getting to this PR 🙏
I think there are a couple of inconsistencies that should be straightforward to fix. I also suggested an additional test to verify metadata reuse.
More generally, I think we should exactly mimic the cache structure of the Hub CLI because we're almost there anyway, and it could potentially be useful down the line.
Great work, thanks a lot! 🙌
Sources/Hub/Downloader.swift
Outdated
@@ -11,6 +11,9 @@ import Combine | |||
|
|||
class Downloader: NSObject, ObservableObject { | |||
private(set) var destination: URL | |||
private(set) var metadataDestination: URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used by the Downloader at all, or is it a remnant of deleted commits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remnant of deleted commits, just reverted this to the original
Sources/Hub/HubApi.swift
Outdated
case consistencyError(String) | ||
case diskSpaceError(String) | ||
case permissionError(String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are unused
Sources/Hub/HubApi.swift
Outdated
} | ||
|
||
func isValidSHA256(_ hash: String) -> Bool { | ||
let sha256Pattern = "^[0-9a-f]{64}$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let sha256Pattern = "^[0-9a-f]{64}$" | |
let sha256Pattern = "^[0-9a-f]{40}$" |
I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you are right, but that's for SHA256 etags that represent LFS files. Commit hashes will never match this, we have to use this pattern instead.
Sources/Hub/HubApi.swift
Outdated
/// Reference: https://github.com/huggingface/huggingface_hub/blob/b2c9a148d465b43ab90fab6e4ebcbbf5a9df27d4/src/huggingface_hub/_local_folder.py#L263 | ||
/// | ||
/// - Parameters: | ||
/// - localDir: The local directory where files are downloaded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// - localDir: The local directory where files are downloaded. | |
/// - localDir: The local directory where metadata files are downloaded. |
Thanks for the review! I'm considering adding offline mode support under HubApi, controlled by a boolean flag in the constructor. If the flag isn’t provided, it could default to using something like isExpensive as a fallback. This would allow us to return already downloaded files after verification without requiring a new download when on cellular data. Curious to hear your thoughts! |
Tests/HubTests/HubApiTests.swift
Outdated
@@ -167,7 +167,7 @@ class HubApiTests: XCTestCase { | |||
} | |||
|
|||
class SnapshotDownloadTests: XCTestCase { | |||
let repo = "coreml-projects/Llama-2-7b-chat-coreml" | |||
var repo = "coreml-projects/Llama-2-7b-chat-coreml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this back to let
and update testDownloadSmolLargeFile
to define a new ivar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a confusion with the hash validation of the commit hashes versus the etags of LFS files.
I would suggest to:
- Add another validation function for commit hashes (40 chars)
- Write a test (if possible, it may be convoluted) that retrieves metadata from an LFS file and verifies that:
- The commit hash matches the commit validation function
- The etag hash matches the isValidSHA256 function
Sources/Hub/HubApi.swift
Outdated
} | ||
|
||
func isValidSHA256(_ hash: String) -> Bool { | ||
let sha256Pattern = "^[0-9a-f]{64}$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you are right, but that's for SHA256 etags that represent LFS files. Commit hashes will never match this, we have to use this pattern instead.
Sources/Hub/HubApi.swift
Outdated
let remoteCommitHash = remoteMetadata.commitHash ?? "" | ||
|
||
// Local file exists + metadata exists + commit_hash matches => return file | ||
if isValidSHA256(remoteCommitHash) && downloaded && localMetadata != nil && localCommitHash == remoteCommitHash { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The isValidSHA256
check will always fail here on the remoteCommitHash
.
Sources/Hub/HubApi.swift
Outdated
// => means it's an LFS file (large) | ||
// => let's compute local hash and compare | ||
// => if match, update metadata and return file | ||
if localMetadata != nil && isValidSHA256(remoteEtag) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isValidSHA256
will succeed when it's an LFS file.
Sure, sounds great! Maybe get this merged first and then open a new PR, to make things simpler? |
Sounds good! |
On a different note: after these changes get merged, calling snapshot will trigger a lot of redownloads on mobile devices due to the absence of existing metadata files. This is expected behavior, but please be mindful of the potential impact on bandwidth and storage usage. |
ah, you are right. Do you think we should add the offline mode first then? And/or tag this as a new version |
I'm curious about your opinion on changing the download logic to first download metadata files if they're missing (and redownload non-lfs files), then only redownload lfs files if their checksums don't match the etag or if they're missing. This could safeguard against high bandwidth usage without enabling offline mode, which would completely restrict any new downloads |
That sounds like a great idea! |
Just made the changes and added extra test cases! |
With these changes, full LFS files will still potentially be downloaded when calling snapshot, but only if they don't match the remote hash, which is good because the local model is stale in that case. I think this is safe to merge before offline mode since current behavior of snapshot is already expected to make a decision on downloading the full file internally. Once offline mode is implemented, devs will be able to safely call snapshot everytime with an extra offline mode parameter that simply checks if the local files are valid based on the stored metadata. This is useful for the case that a device is on an expensive or limited cell connection, and devs want to make a decision (or even prompt the user) if it is ok to download in this state, or just continue to use a stale model in offline mode until on a better connection. |
This PR aims to bring over some of huggingface_hub Python library's capabilities to swift-transformers: