Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solving mlflow=2.12.2=py39hb3b8efb_0 takes a lot of time #684

Open
JeanChristopheMorinPerso opened this issue May 27, 2024 · 5 comments
Open

Comments

@JeanChristopheMorinPerso
Copy link
Contributor

Hello, while trying rattler (via py-rattler), I noticed that some packages take ages to resolve. For example:

>>> import rattler
>>> import asyncio
>>> asyncio.run(rattler.solve(['main'], ['mlflow=2.12.2=py39hb3b8efb_0'], platforms=['osx-arm64', 'noarch']))

mlflow=2.12.2=py39hb3b8efb_0 is fast to resolve with conda-libmamba-solver.

The same is true with all these packages (on osx-arm64):

  • orange3=3.36.2=py39h46d7db6_0
  • ray-dashboard=2.6.3=py39hca03da5_2
  • ray-default=2.6.3=py39hca03da5_2
  • spark-nlp=5.1.2=py39hca03da5_0
  • spyder=5.5.1=py38hca03da5_0
  • spyder=5.5.1=py39hca03da5_0
  • spyder=5.5.1=py310hca03da5_0
  • streamlit-faker=0.0.2=py39hca03da5_0
@baszalmstra
Copy link
Collaborator

Thanks for these! I can confirm that these are horribly slow and instant with libsolv. We will investigate!

@JeanChristopheMorinPerso
Copy link
Contributor Author

Ok, thanks for confirming! If you are curious, I found these by iterating through every single record in the channel and trying to resolve them individually.

@baszalmstra
Copy link
Collaborator

I've found a performance improvement that fixes these cases (they resolve instantly) but makes all other solves about 50% slower... Need more investigation. 😅

@wolfv
Copy link
Contributor

wolfv commented May 31, 2024

We found a solution for this case by changing our selection heuristic. Currently we are deciding packages with few options first, in order to minimize backtracking. But it seems to not play well with Python constraints since apparently we are doing excessive backtracking.

If we reverse the heuristic, things are super fast (ie. decide packages with many options first).

Did you write some kind of benchmark script @JeanChristopheMorinPerso? I think we'll adjust the heuristic, but it would be great to have more extensive benchmarks to check that we don't introduce big regressions.

For the "fix", I just changed the < to > in this line: https://github.com/mamba-org/resolvo/blob/56fa93de027f483223c6525671e243d89cd805fe/src/solver/mod.rs#L761

@JeanChristopheMorinPerso
Copy link
Contributor Author

JeanChristopheMorinPerso commented May 31, 2024

Great news!

Did you write some kind of benchmark script @JeanChristopheMorinPerso? I think we'll adjust the heuristic, but it would be great to have more extensive benchmarks to check that we don't introduce big regressions.

I don't think what I have can be called a benchmark... I basically take repodata and try to resolve every single "latest" record one by one (including all its variants).

Something like

import collections

import rattler


timedelta = collections.namedtuple("timedelta", ["microseconds"])

channels = [rattler.Channel["main"]]
platforms = [rattler.Platform("osx-arm64"), rattler.Platform("noarch")]
virtual_packages = [p.into_generic() for p in rattler.VirtualPackage.current()]

repo_datas = await rattler.fetch_repo_data(
    channels=channels,
    platforms=platforms,
    cache_path="/tmp/py-rattler-cache/repodata",
    callback=None,
)

for subdir in repo_datas:
    for package_name in subdir.package_names():
        all_records = sorted(
            subdir.load_records(rattler.PackageName(package_name)),
            key=lambda x: (x.version, x.build_number),
        )

        latest = all_records[-1]
        records = [
            r
            for r in all_records
            if r.version == latest.version and r.build_number == latest.build_number
        ]

        futures = []
        for record in records:
            futures.append(
                packages = await rattler.solve(
                    channels,
                    [rattler.MatchSpec(str(record))],
                    platforms=platforms,
                    virtual_packages=virtual_packages,
                    timeout=timedelta(microseconds=30 * 1000000),
                )
            )
        await asyncio.gather(*futures)

(I did not test this specific code)

I'm doing this to see if a channel is solvable and also to do some channel analysis and thought it would be simpler and faster to use rattler instead of either using subprocs to conda conda create or use libmamba.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants