Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC 2025] Privacy-preserving and efficient AI model training across multi-clusters #825

Open
yanmxa opened this issue Feb 10, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@yanmxa
Copy link
Member

yanmxa commented Feb 10, 2025

Description

Open Cluster Management (OCM) streamlines multi-cluster workload management through APIs that align with SIG-Multicluster standards. Beyond traditional workload orchestration, OCM enables scalable AI training and inference across distributed environments.

As machine learning (ML) expands across clusters, data privacy becomes a critical concern. ML models rely on vast datasets, making it essential to safeguard sensitive information across clusters without compromising model performance.

This project integrates Federated Learning (FL) into OCM, enabling privacy-preserving, collaborative model training without transferring raw data between clusters. Instead, training occurs locally where the data resides, ensuring compliance, enhancing efficiency, and reducing bandwidth and storage costs.

By leveraging OCM’s Placement, ManifestWork, and other APIs. we standardize FL workflows and seamlessly integrate frameworks like Flower and OpenFL through a unified interface. This approach harnesses OCM’s capabilities to deliver scalable, cost-efficient, and privacy-preserving AI solutions in multi-cluster environments.

Expected Outcome

  • Comprehensive Documentation:

    • Define the scenarios addressed by the prototype, highlighting its purpose and value.
    • Provide an intuitive and architectural comparison between Federated Learning (FL) and OCM, mapping FL terminology to OCM APIs to showcase OCM’s native support for FL.
    • Illustrate the complete Federated Learning workflow within Open Cluster Management.
  • Extended Prototype (or CRD) Support:

    • Enable the aggregation model persistence in AWS S3 (currently supports only native PVC).
    • Extend compatibility to support additional Federated Learning frameworks like OpenFL (currently supports Flower). This requires understanding how OpenFL works, containerizing it, and integrating it into the prototype.

Recommended Skills

Golang, Kubernetes, Federated Learning, Open Cluster Management, Scheduling

Mentor(s)

Meng Yan (@yanmxa, [email protected]) - primary
Qing Hao (@haoqing0110, [email protected])

References

Open Cluster Management
Federated Framework - Flower
Federated Framework - OpenFL
Placement concept
ManifestWork concept
Federated Learning Controller for Open Cluster Management
Implementing a controller
Generating CRDs

Discussion

Feel free to raise your questions here. Can also reach out to us in the slack channel.

@yanmxa yanmxa added the bug Something isn't working label Feb 10, 2025
@haoqing0110 haoqing0110 removed the bug Something isn't working label Feb 10, 2025
@qiujian16 qiujian16 added the enhancement New feature or request label Feb 11, 2025
@qiujian16 qiujian16 moved this to To do in OCM releases Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: To do
Development

No branches or pull requests

3 participants