Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Enhancing Unit Testing for FLA in the Context of Active Development and Diverse GPU Compatibility #209

Open
uniartisan opened this issue Mar 1, 2025 · 1 comment
Assignees
Labels
enhancement New feature or request todo To be implemented

Comments

@uniartisan
Copy link
Collaborator

Feature Request

Implement a comprehensive and phased unit testing strategy for FLA to ensure compatibility across a wide range of GPUs and improve the overall robustness of the project. This includes Refine the unit testing process to cover tests for different types of graphics cards and optimize the test triggering mechanism.

Motivation

  1. Current Development State: FLA is in an active development phase. However, the existing unit tests have significant flaws. Some samples in the unit tests are incorrect, and the tests fail to execute properly on certain consumer - grade graphics cards.
  2. User - Base Consideration: A large portion of the user community is using NVIDIA 30 - series and 40 - series graphics cards. Ensuring compatibility with these widely - used GPUs is crucial for user satisfaction and adoption.
  3. GPU Scarcity: High - end GPUs like A100 and H100 are scarce. This scarcity necessitates a strategic approach to testing, starting with more accessible resources such as CPUs.

Your Contribution

  1. Initial CPU - Based Testing:
    a. Conduct extensive CPU emulation tests for all unit tests. This will help identify and fix basic functional issues without relying on scarce GPUs.
    b. Focus on testing the changed files first. This targeted approach will save time and resources during the initial testing phase.
  2. GPU Validation:
    a. Once the CPU emulation tests are stable, perform unit tests on A100 and H100 GPUs. The goal is to make all unit tests pass on these high - end GPUs, as they are often used in professional and research settings related to the project.
  3. Manual Full - Test Trigger:
    a. Before any code is merged, the maintainers should manually trigger a full suite of unit tests. This ensures that the entire codebase, including newly changed and existing parts, is thoroughly tested.
  4. Expanding to Consumer - Grade GPUs:
    a. After achieving stability on A100 and H100, gradually introduce tests for NVIDIA 30 - series and 40 - series GPUs. This will address the needs of the majority of the user base.
    b. Subsequently, add tests for Intel and AMD graphics cards. This will further expand the project's compatibility across different hardware platforms, making the project more robust and adaptable.
@uniartisan uniartisan added enhancement New feature or request todo To be implemented labels Mar 1, 2025
@uniartisan uniartisan self-assigned this Mar 1, 2025
@Triang-jyed-driung
Copy link
Contributor

Currently, chunk_dplr fails for both H800 and 4090 for Triton 3.1.0 and 3.2.0, only triton 3.0.0 nightly can run. It fails exactly as FAQ described.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request todo To be implemented
Projects
None yet
Development

No branches or pull requests

2 participants