Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommendations for including data in WASM notebook #3194

Open
1 of 4 tasks
gabrielgrant opened this issue Dec 17, 2024 · 9 comments
Open
1 of 4 tasks

Recommendations for including data in WASM notebook #3194

gabrielgrant opened this issue Dec 17, 2024 · 9 comments
Labels
documentation Improvements or additions to documentation

Comments

@gabrielgrant
Copy link
Contributor

Documentation is

  • Missing
  • Outdated
  • Confusing
  • Not sure?

Explain in Detail

The most common pattern of my notebooks is to read a file (CSV or JSON) into a pandas DF and then do some manipulations. When exporting WASM this fails with FileNotFound errors

Traceback (most recent call last):
  File "/lib/python3.12/site-packages/marimo/_runtime/executor.py", line 157, in execute_cell
    exec(cell.body, glbls)
  Cell marimo:///home/gabriel/repos/rxfood/data-notebooks/gabriel/nutrient_estimation_evals/nutrient_generation_manual_eval.py#cell=cell-3, line 2, in <module>
    mi_df = pd.read_json(e2e_comparison_dir + 'mi.json', orient="records", lines=True)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pandas/io/json/_json.py", line 791, in read_json
    json_reader = JsonReader(
                  ^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pandas/io/json/_json.py", line 904, in __init__
    data = self._get_data_from_filepath(filepath_or_buffer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pandas/io/json/_json.py", line 960, in _get_data_from_filepath
    raise FileNotFoundError(f"File {filepath_or_buffer} does not exist")
FileNotFoundError: File mi.json does not exist

Not so surprising that marimo has no concept of what external files I'm depending on. Is there a recommended path for how to include data along with the notebook when publishing? This seems like a pretty common need, but not seeing any mention of this in the export docs https://docs.marimo.io/guides/exporting.html#export-to-wasm-powered-html

Your Suggestion for Changes

It would be amazing if this just worked out of the box (by watching for open files and auto-including them as deps, i guess?), but just having some recommended way to do this (even with extra work) would be nice. Maybe there's something I should be doing with Marimo's built-in caching, for instance?

Did try just wrapping with a simple persistent cache:

with mo.persistent_cache(name="my_cache"):
    mi_df = pd.read_json('mi.json', orient="records", lines=True)

But seems that the cache doesn't get included in the WASM export assets, so it still fails trying to open the file (same error as with no cache):

marimo._save.cache.CacheException: Failure during save.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/lib/python3.12/site-packages/marimo/_runtime/executor.py", line 157, in execute_cell
    exec(cell.body, glbls)
  Cell marimo:///home/gabriel/repos/rxfood/data-notebooks/gabriel/nutrient_estimation_evals/nutrient_generation_manual_eval.py#cell=cell-3, line 1, in <module>
    with mo.persistent_cache(name="e2e_comparison_cache"):
  File "/lib/python3.12/site-packages/marimo/_save/save.py", line 500, in __exit__
    raise instance from CacheException("Failure during save.")
  Cell marimo:///home/gabriel/repos/rxfood/data-notebooks/gabriel/nutrient_estimation_evals/nutrient_generation_manual_eval.py#cell=cell-3, line 3, in <module>
    mi_df = pd.read_json(e2e_comparison_dir + 'mi.json', orient="records", lines=True)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pandas/io/json/_json.py", line 791, in read_json
    json_reader = JsonReader(
                  ^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pandas/io/json/_json.py", line 904, in __init__
    data = self._get_data_from_filepath(filepath_or_buffer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pandas/io/json/_json.py", line 960, in _get_data_from_filepath
    raise FileNotFoundError(f"File {filepath_or_buffer} does not exist")
FileNotFoundError: File mi.json does not exist
@gabrielgrant gabrielgrant added the documentation Improvements or additions to documentation label Dec 17, 2024
@dmadisetti
Copy link
Collaborator

+1

Persistent cache loading over network would be the ideal way to do this.
Otherwise manually loading a file over the network is the current best option.


What's your current WASM export workflow? Are you using wasm export, islands, or mkdocs?
Would it be easy enough to statically serve the __marimo__ folder from a relative path?

@gabrielgrant
Copy link
Contributor Author

gabrielgrant commented Dec 17, 2024

I was just trying to use direct command line wasm export. Have in the past used static html exports as a way to share notebook outputs/results with non-technical users as a report, without them having to install anything. Github's direct display of notebook outputs is one advantage of ipynb format (maybe the only one?) that is still keeping some team members from marimo (other than just the usual inertia), because there's no workflow at all needed. --watch output unfortunately isn't quite feasible, since it has to re-render entirely (and afaict matplotlib outputs can't be cached?)

But allowing those non-technical users to easily have interactive content via wasm export would be amazing and worth some amount of workflow change

My ideal would probably be having some way to package everything into a single file (embedding everything as base64 strings, i guess?). That seems likely to be infeasible, though, so an output dir that can be zipped and shared or otherwise easily deployed somewhere would be alright too.

Generally I'm doing dev on a local laptop, so wouldn't want to serve __marimo__ directly from this machine, but if the relevant bits could be included in the output dir, seems perhaps pushing that to github and having it served from GH pages might work. We're dealing with health data, though, so some reports need to be kept more private. right now they just get snapshotted as HTML and stored somewhere in our MS Sharepoint instance/Loop (MS' crappy Notion clone), then people just open them as files on their local machine. IIUC, tho, that exact workflow can't work with WASM outputs, since they need to be served from a real HTTP server, right?

@gabrielgrant
Copy link
Contributor Author

gabrielgrant commented Dec 18, 2024

@dmadisetti Can you share a bit more about what you have in mind regarding manually loading a file over the network?

What's the best way to detect whether we're running in WASM if I want to load data differently depending on environment (or skip some WASM -incompatible steps entirely) ?

The overarching need behind this issue is something I've found is a very common pattern: i do some complex/heavy analysis in a notebook and then also have code to visualize or interact with the results. The latter portion i want to share, but want it running off a snapshot of the earlier analysis results

Currently my options seem to be:

  1. One analysis notebook, one presentation notebook, but this requires manually running the presentation notebook when results change, partially losing the benefits of reactivity
  2. to duplicate the latter part of the notebook
  3. Export the whole notebook to HTML ( results in a bunch of weird blank spaces for outputs that require a kennel)

Not sure this is the place to discuss it, but would be great to have more explicit support (or documentation, if there's already a path I'm not using?) for this workflow

@dmadisetti
Copy link
Collaborator

So here are the options as I see them with cache:

  • Embedding content in the webpage / export
  • Expectation of a static remote source
  • A 3rd party remote cache (hosting done for you, no additional concern for where your cache data goes)

Outside of cache, wasm could leverage the virtual filesystem

  • Embedding data in the export
  • Tweak the virtual filesystem to make static file requests
  • As is, the filesystem can be configure to be integrated with something like S3 or cloud storage- I don't think this is a public option, but is something available on marimo.io (@mscolnick is that right?)

As is, I don't think this is a documentation issue so much as a missing feature

@gabrielgrant
Copy link
Contributor Author

gabrielgrant commented Dec 18, 2024

Seems there are kinda two parts to this:

  1. what the feature should look like long-term (what your list directly above all seem like viable options)
  2. what can be done today as a workaround (potentially to go in the docs until a longer-term solution is available)

Re: 2. -- you'd mentioned further up in the thread that "manually loading a file over the network is the current best option" - that sounded to me like you had a specific (albeit more manual) workaround in mind that would work today?

@gabrielgrant
Copy link
Contributor Author

gabrielgrant commented Dec 19, 2024

@dmadisetti I do think overall you're right that it sounds like this is certainly at least as much a feature request as a docs issue (probably more so). Unfortunately I don't seem to be able to change the tags (i think the issue template auto-adds them when creating)?

@mscolnick
Copy link
Contributor

I have some thoughts, but not fully fleshed out on which approach is the best, nor which we want to recommend at the moment.

  1. Hosting the assets is the easiest, but I know not always possible. For example, putting your assets on S3 or in a public directory in your github pages (e.g. public/data), then you can fetch from from https://my-org.github.io/my-repo/public/data/my.csv. This should work because it is same-origin, but if you load from another URL, you should make sure to allow CORS.

  2. We could inline all the files in the HTML <marimo-file data-mime-type="text/csv">some_data</marimo-file> and then when the page loads, we shove all the files in the emscripten/wasm filesystem. This will likely fail horribly later so probably not this.

  3. We have some logic in the Community Cloud where we build our own FUSE implementation (looks a lot like https://filesystem-spec.readthedocs.io/en/latest/features.html), such that we can can query endpoints like https://<domain>/data/cars.csv when running python open(data/cars.csv). This logic is quite hairy and hard to bring out, but something we can do down the road with more times. (I also haven't tried ffspec, which may work)

@gabrielgrant
Copy link
Contributor Author

gabrielgrant commented Dec 20, 2024

Ah, didn't register that i could drop df = pd.read_csv('https://raw.githubusercontent.com/marimo-team/marimo/refs/heads/main/tests/_plugins/ui/_impl/tables/snapshots/pandas.csv') in a WASM notebook and have it just work.

That's awesome as a workaround! Assuming that if I'm using cookie-based or HTTP basic auth for the notebook, then imagine putting the data behind the same auth scheme should work, right? Haven't actually tested this yet, but can't see why not (famous last words...)

One remaining question is: what's the best way to detect whether we're running in WASM if I want to load data differently depending on environment (or skip some WASM-incompatible steps entirely) ? I guess just try loading the local file, catch the FileNotFoundError, and then try the HTTP URL?

This is def workable as a step in the right direction. But still not ideal that the whole computation needs to be redone -- for heavy computations that could take quite a while in WASM.

If I also deploy/serve the __marimo__/cache dir will that also work over HTTP today? I guess if so it would be by setting save_path to a URL? Would be awesome if this worked by default for read using the dir deployed at ./__marimo__/cache relative to the notebook. Only working for read and not write might be confusing...but could maybe just write to the in-browser filesystem? (haven't looked at the implementation -- is this OPFS?). @dmadisetti is this what you had in mind when you asked above about serving the cache dir from a relative path?

@mscolnick
Copy link
Contributor

You can create a util int your notebooks:

def is_wasm() -> bool:
    return "pyodide" in sys.modules

I don't expect __marimo__/cache to work over http at the moment. Something that is planned (remote caching), but not yet supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants