Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metadata support with tests #155

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

ardaatahan
Copy link
Contributor

This PR aims to bring over some of huggingface_hub Python library's capabilities to swift-transformers:

  • Adds file metadata support to prevent unnecessary redownloads, ensure file integrity, and update to the latest file versions when needed.
  • Disables automatic redirection during head requests to fix getFileMetadata, which currently doesn't return expected commit hash and etag for lfs files.
  • Adds additional tests to verify metadata handling and download logic.

@pcuenca pcuenca mentioned this pull request Jan 26, 2025
Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, and sorry for the delay in getting to this PR 🙏

I think there are a couple of inconsistencies that should be straightforward to fix. I also suggested an additional test to verify metadata reuse.

More generally, I think we should exactly mimic the cache structure of the Hub CLI because we're almost there anyway, and it could potentially be useful down the line.

Great work, thanks a lot! 🙌

@@ -11,6 +11,9 @@ import Combine

class Downloader: NSObject, ObservableObject {
private(set) var destination: URL
private(set) var metadataDestination: URL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used by the Downloader at all, or is it a remnant of deleted commits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remnant of deleted commits, just reverted this to the original

Comment on lines 154 to 156
case consistencyError(String)
case diskSpaceError(String)
case permissionError(String)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are unused

Sources/Hub/HubApi.swift Show resolved Hide resolved
Sources/Hub/HubApi.swift Show resolved Hide resolved
}

func isValidSHA256(_ hash: String) -> Bool {
let sha256Pattern = "^[0-9a-f]{64}$"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let sha256Pattern = "^[0-9a-f]{64}$"
let sha256Pattern = "^[0-9a-f]{40}$"

I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you are right, but that's for SHA256 etags that represent LFS files. Commit hashes will never match this, we have to use this pattern instead.

Sources/Hub/HubApi.swift Show resolved Hide resolved
Sources/Hub/HubApi.swift Outdated Show resolved Hide resolved
Tests/HubTests/HubApiTests.swift Show resolved Hide resolved
/// Reference: https://github.com/huggingface/huggingface_hub/blob/b2c9a148d465b43ab90fab6e4ebcbbf5a9df27d4/src/huggingface_hub/_local_folder.py#L263
///
/// - Parameters:
/// - localDir: The local directory where files are downloaded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// - localDir: The local directory where files are downloaded.
/// - localDir: The local directory where metadata files are downloaded.

@ardaatahan
Copy link
Contributor Author

Thanks for the review!

I'm considering adding offline mode support under HubApi, controlled by a boolean flag in the constructor. If the flag isn’t provided, it could default to using something like isExpensive as a fallback.

This would allow us to return already downloaded files after verification without requiring a new download when on cellular data.

Curious to hear your thoughts!

@ardaatahan ardaatahan requested a review from pcuenca January 30, 2025 23:00
@@ -167,7 +167,7 @@ class HubApiTests: XCTestCase {
}

class SnapshotDownloadTests: XCTestCase {
let repo = "coreml-projects/Llama-2-7b-chat-coreml"
var repo = "coreml-projects/Llama-2-7b-chat-coreml"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this back to let and update testDownloadSmolLargeFile to define a new ivar.

Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a confusion with the hash validation of the commit hashes versus the etags of LFS files.

I would suggest to:

  • Add another validation function for commit hashes (40 chars)
  • Write a test (if possible, it may be convoluted) that retrieves metadata from an LFS file and verifies that:
    • The commit hash matches the commit validation function
    • The etag hash matches the isValidSHA256 function

}

func isValidSHA256(_ hash: String) -> Bool {
let sha256Pattern = "^[0-9a-f]{64}$"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you are right, but that's for SHA256 etags that represent LFS files. Commit hashes will never match this, we have to use this pattern instead.

let remoteCommitHash = remoteMetadata.commitHash ?? ""

// Local file exists + metadata exists + commit_hash matches => return file
if isValidSHA256(remoteCommitHash) && downloaded && localMetadata != nil && localCommitHash == remoteCommitHash {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The isValidSHA256 check will always fail here on the remoteCommitHash.

// => means it's an LFS file (large)
// => let's compute local hash and compare
// => if match, update metadata and return file
if localMetadata != nil && isValidSHA256(remoteEtag) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isValidSHA256 will succeed when it's an LFS file.

@pcuenca
Copy link
Member

pcuenca commented Jan 31, 2025

I'm considering adding offline mode support under HubApi, controlled by a boolean flag in the constructor. If the flag isn’t provided, it could default to using something like isExpensive as a fallback.

This would allow us to return already downloaded files after verification without requiring a new download when on cellular data.

Sure, sounds great! Maybe get this merged first and then open a new PR, to make things simpler?

@ardaatahan
Copy link
Contributor Author

I'm considering adding offline mode support under HubApi, controlled by a boolean flag in the constructor. If the flag isn’t provided, it could default to using something like isExpensive as a fallback.
This would allow us to return already downloaded files after verification without requiring a new download when on cellular data.

Sure, sounds great! Maybe get this merged first and then open a new PR, to make things simpler?

Sounds good!

@ardaatahan ardaatahan requested a review from pcuenca January 31, 2025 19:24
@ardaatahan
Copy link
Contributor Author

On a different note: after these changes get merged, calling snapshot will trigger a lot of redownloads on mobile devices due to the absence of existing metadata files. This is expected behavior, but please be mindful of the potential impact on bandwidth and storage usage.

@pcuenca
Copy link
Member

pcuenca commented Jan 31, 2025

calling snapshot will trigger a lot of redownloads

ah, you are right. Do you think we should add the offline mode first then? And/or tag this as a new version 0.2.0.

@ardaatahan
Copy link
Contributor Author

ah, you are right. Do you think we should add the offline mode first then? And/or tag this as a new version 0.2.0.

I'm curious about your opinion on changing the download logic to first download metadata files if they're missing (and redownload non-lfs files), then only redownload lfs files if their checksums don't match the etag or if they're missing. This could safeguard against high bandwidth usage without enabling offline mode, which would completely restrict any new downloads

@pcuenca
Copy link
Member

pcuenca commented Feb 2, 2025

That sounds like a great idea!

@ardaatahan ardaatahan requested a review from pcuenca February 3, 2025 17:58
@ardaatahan
Copy link
Contributor Author

Just made the changes and added extra test cases!

@ZachNagengast
Copy link
Collaborator

With these changes, full LFS files will still potentially be downloaded when calling snapshot, but only if they don't match the remote hash, which is good because the local model is stale in that case. I think this is safe to merge before offline mode since current behavior of snapshot is already expected to make a decision on downloading the full file internally.

Once offline mode is implemented, devs will be able to safely call snapshot everytime with an extra offline mode parameter that simply checks if the local files are valid based on the stored metadata. This is useful for the case that a device is on an expensive or limited cell connection, and devs want to make a decision (or even prompt the user) if it is ok to download in this state, or just continue to use a stale model in offline mode until on a better connection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants