Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeds up the Path and derived Parameters #890

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

MarcSkovMadsen
Copy link
Collaborator

@MarcSkovMadsen MarcSkovMadsen commented Nov 12, 2023

Closes #889

  • Avoiding to use Parameterized resolve_path function as there is big, big overhead compared to using simple function!
  • Caching the actual resolve to not do this over and over and over again in Panel apps.

If I run pytest tests/testpathparam.py, I see.

main: 57 passed in 0.21s
this: 57 passed in 0.19s

So this is not where you see the benefit. That is because you are only testing new instances and new paths. But that is not how Path is being used in practice. For example in Panel the pn.config is rather static and holds paths to css etc. which is being requested every time a page is loaded.

Performance tests

My setup

Python 3.10 on Linux.

Profiling script

You can use the script below.

import pytest
from pyinstrument import Profiler
import param
from pathlib import Path

N = 100000

TESTS_ROOT = Path.cwd()

@pytest.fixture(autouse=True)
def auto_profile(request):
    PROFILE_ROOT = (TESTS_ROOT / ".profiles")
    # Turn profiling on
    profiler = Profiler()
    profiler.start()

    yield  # Run test

    profiler.stop()
    PROFILE_ROOT.mkdir(exist_ok=True)
    results_file = PROFILE_ROOT / f"{request.node.name}.html"
    profiler.write_html(results_file)

class Custom(param.Parameterized):
    path = param.Path("script.py")


def test_class():
    for i in range(N):
        Custom.path

custom = Custom()
def test_existing_instance():
    for i in range(N):
        custom.path

def test_new_instance():
    for i in range(N):
        Custom().path

try:
    from param.parameters import _resolve_path
    def test_no_instance():
        for i in range(N):
            _resolve_path("script.py")
except:
    pass

Results

N=100000

branch test_class test_existing_instance test_new_instance test_no_instance
main 11.5s 11.5s 15.8s
this - pathlib 0.7s 0.7s 4.9s 0.3s
this - os 0.6s 0.7s 4.9s 0.3s

It makes you think about the cost of using Parameterized classes and ParameterizedFunctions! You can compare this to FastApi and Pydantic. Pydantic 2.0 was a reimplementation in Rust to make applications like FastApi fast.

@MarcSkovMadsen
Copy link
Collaborator Author

MarcSkovMadsen commented Nov 12, 2023

Additional Speed up if using faster attribute lookup.

If the implementation of Path would not use costly self._resolve_path and self.search_paths attribute look ups then it would again be much faster.

The below was run on this branch and shows this.

image

The implementation would look like

image

Its a matter of whether you want an 8% speed up over code readability for humans and static analysis tools in your editor.

Additional speed up by not looking up cwd.

Path._resolve look up the cwd on each run because the default value of search_paths is None.

At least in Panel we should probably provide the search_paths once on instantiation of pn.config to avoid having this cost over and over again.

Maybe we should even change the default behaviour of Path.search_paths from None to the cwd when instantiated?


def _get_default_search_paths_as_list()->typing.List[str]:
return [str(path) for path in _get_default_search_paths()]

class resolve_path(ParameterizedFunction):
Copy link
Collaborator Author

@MarcSkovMadsen MarcSkovMadsen Nov 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere we should tell users that there is a overhead cost of using this class instead of the _resolve_path function. I don't know if we should even deprecate it?

@maximlt
Copy link
Member

maximlt commented Nov 12, 2023

There are broken tests so I assume this is still work in progress?

A test suite is not a benchmark so I'm not surprised we don't see performance penalties or regressions from the test suite itself. We now leverage asv to run some benchmark, feel free to contribute to it https://github.com/holoviz/param/tree/main/benchmarks.

@MarcSkovMadsen
Copy link
Collaborator Author

MarcSkovMadsen commented Nov 12, 2023

I hoped to get some feedback first. For example its seems Param does not utilize caching today and the focus is minimum of code. This is a change in favor of speed.

Thus the first step is really to spark the discussion with this PR. And I would like to know if you would like to accept this kind of change before I spend time finalizing.

The tests are breaking on macOS and Windows. To me it looks like the path strings are resolved in different But equivalent ways by os and pathlib.

Having thought about it, I assumed pathlib was better as its "modern" Python. But now I'm not sure. At least I should profile that too.

@MarcSkovMadsen
Copy link
Collaborator Author

MarcSkovMadsen commented Nov 12, 2023

Ok. Using pathlib is slow. The image below slows the new implementation without caching on left and old implementation without caching on right.

image

I will revert to using os methods and keep the cache.

@maximlt maximlt self-requested a review November 12, 2023 18:27
@maximlt
Copy link
Member

maximlt commented Nov 12, 2023

I need to spend more time to review this (not sure I'll have time this week!). My feedback from working on Param 2.0:

  • I'm not surprised there are performance improvements to gain, performance hasn't been a major focus, even if with the addition of the asv benchmark we've been able to catch some regressions and mitigate them.
  • don't trust the test suite! It has many gaps, I can't count how many times I broke Panel or HoloViews without breaking the test suite.
  • Param is so old that many of its bugs have turned into features, changing it requires great care. It's often worth adding tests before making any change, to capture the previous behavior and see how it's affected.

This piece of code breaks with the suggested changes, the last line used to raise an error and it no longer does. It shows that the difficulty with resources like file/directory paths is that they can be removed without Param knowing about it, which is why I think it always resolves. Maybe the way forward is to overload the newly added check_exists attribute (#800) with a new 'once' mode (better name welcome) that only checks whether the resource exists when the Parameter is instantiated.

import os
import param
import pathlib

fp = pathlib.Path('exists.txt')
fp.write_text('')

class P(param.Parameterized):
    path = param.Path('exists.txt')

p = P()
print(p.path)

fp.unlink()

print(p.path)

@MarcSkovMadsen
Copy link
Collaborator Author

MarcSkovMadsen commented Nov 13, 2023

Thanks for the feedback. I plan to give this a second iteration when I have gotten panel-chemistry working for Panel 1.x. First thing would be adding the use case you describe to the tests.

@maximlt
Copy link
Member

maximlt commented Dec 22, 2023

I quickly played around with another implementation (compared to main, not to this branch), (ab)using check_exists.

diff --git a/param/parameters.py b/param/parameters.py
index ed11b53b..8852b79b 100644
--- a/param/parameters.py
+++ b/param/parameters.py
@@ -2706,9 +2706,11 @@ class Path(Parameter):
     ----------
     search_paths : list, default=[os.getcwd()]
         List of paths to search the path from
-    check_exists: boolean, default=True
-        If True (default) the path must exist on instantiation and set,
-        otherwise the path can optionally exist.
+    check_exists: boolean or 'once', default=True
+        If True (default) the path must exist on instantiation and whenever
+        get/set is called, if 'once' the path must only exist on instantiation
+        and on get the first resolved path is returned, if False the path can
+        optionally exist.
     """
 
     __slots__ = ['search_paths', 'check_exists']
@@ -2733,8 +2735,8 @@ class Path(Parameter):
             search_paths = []
 
         self.search_paths = search_paths
-        if check_exists is not Undefined and not isinstance(check_exists, bool):
-            raise ValueError("'check_exists' attribute value must be a boolean")
+        if check_exists is not Undefined and check_exists not in (True, False, 'once'):
+            raise ValueError("'check_exists' attribute value must be a boolean or 'once'")
         self.check_exists = check_exists
         super().__init__(default,**params)
         self._validate(self.default)
@@ -2749,16 +2751,23 @@ class Path(Parameter):
         else:
             if not isinstance(val, (str, pathlib.Path)):
                 raise ValueError(f'{_validate_error_prefix(self)} only take str or pathlib.Path types')
+            path = None
             try:
-                self._resolve(val)
:+                path = self._resolve(val)
             except OSError as e:
-                if self.check_exists:
+                if self.check_exists is True:
                     raise OSError(e.args[0]) from None
+            if path is not None and self.check_exists == 'once':
+                # Abusing check_exists to store the resolved path on parameter
+                # instantiation.
+                self.check_exists = ('once', path)
 
     def __get__(self, obj, objtype):
         """
         Return an absolute, normalized path (see resolve_path).
         """
+        if isinstance(self.check_exists, tuple):
+            return self.check_exists[1]
         raw_path = super().__get__(obj,objtype)
         if raw_path is None:
             path = None

With these benchmarks:

class ParameterPathResolveSuite:

    def setup(self):
        import pathlib
        p = pathlib.Path('dummy')
        p.write_text('dummy')

        class P(param.Parameterized):
            x0 = param.Path(str(p))

        self.P = P
        self.p = P()

    def time_resolve_new(self):
        p = self.P()
        p.x0

    def time_resolve_same(self):
        self.p.x0


class ParameterPathResolveOnceSuite:

    def setup(self):
        import pathlib
        p = pathlib.Path('dummy')
        p.write_text('dummy')

        class P(param.Parameterized):
            x0 = param.Path(str(p), check_exists='once')

        self.P = P
        self.p = P()


    def time_resolve_new(self):
        p = self.P()
        p.x0

    def time_resolve_same(self):
        self.p.x0

Then I started to think about how this might be related to readonly and constant, to Parameter attribute inheritance and to __set__. To finally decide this could wait for January 🙃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up derived applications like Panel by speeding up Path Parameter
2 participants