Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trial_timeout not working properly for hyperparameter sweep in pipeline #39549

Open
morrissharp opened this issue Feb 4, 2025 · 2 comments
Open
Assignees
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Machine Learning needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team Service Attention Workflow: This issue is responsible by Azure service team.

Comments

@morrissharp
Copy link

  • Package Name: azure-ai-ml
  • Package Version: 1.24.0
  • Operating System: Windows
  • Python Version: 3.11

Describe the bug
I am trying to submit a hyperparameter sweep job in a pipeline with a trial_timeout, but the trial_timeout does not get set (and the job continues to run past the trial_timeout time). It appears that when submitting from the python sdk, the maxRunDurationSeconds does not get set properly for the individual trial (see screenshot below).

To Reproduce
Steps to reproduce the behavior:

Run the code below:

benchmark_run_component_func = load_component(source="./benchmark.yml")
score_component_func = load_component(source="./aggregation.yml")

# define a pipeline
@pipeline()
def pipeline_with_hyperparameter_sweep():
    """Tune hyperparameters using sample components."""
    benchmark_run = benchmark_run_component_func(
        scenario=Choice(
            [
                44,
                45,
                46,
            ]
        ),
    )

    sweep_step = benchmark_run.sweep(
        primary_metric="correctness_score",
        goal="maximize",
        sampling_algorithm="grid",
        compute="small-cluster-low-priority",
    )

    sweep_step.set_limits(max_total_trials=200, max_concurrent_trials=3, timeout=100000, trial_timeout=1800)

    score_data = score_component_func(results_dir=sweep_step.outputs.benchmark_results_dir)


pipeline_job = pipeline_with_hyperparameter_sweep()

ml_client.jobs.create_or_update(pipeline_job)

Expected behavior
A trial should get cancelled if it goes past the trial_timeout time.

Screenshots
I have checked the raw json for the individual trials, and it appears that maxRunDurationSeconds is not set.

Image

Job running for 2+ hrs when trial_timeout is set to 1800 seconds (30 min)

Image

Additional context
Add any other context about the problem here.

@github-actions github-actions bot added customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Feb 4, 2025
@xiangyan99 xiangyan99 added bug This issue requires a change to an existing behavior in the product in order to be resolved. Machine Learning Service Attention Workflow: This issue is responsible by Azure service team. and removed question The issue doesn't require a change to the product in order to be resolved. Most issues start as that needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. labels Feb 5, 2025
@github-actions github-actions bot added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Feb 5, 2025
Copy link

github-actions bot commented Feb 5, 2025

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @Azure/azure-ml-sdk @azureml-github.

@kristapratico kristapratico added the Client This issue points to a problem in the data-plane of the library. label Feb 5, 2025
@kingernupur kingernupur self-assigned this Feb 5, 2025
@kingernupur
Copy link
Member

Thank you for reporting the issue @morrissharp. Our team is looking into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Machine Learning needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team Service Attention Workflow: This issue is responsible by Azure service team.
Projects
None yet
Development

No branches or pull requests

4 participants