Recommendations for including data in WASM notebook #3194

gabrielgrant · 2024-12-17T00:09:25Z

Documentation is

Missing
Outdated
Confusing
Not sure?

Explain in Detail

The most common pattern of my notebooks is to read a file (CSV or JSON) into a pandas DF and then do some manipulations. When exporting WASM this fails with FileNotFound errors

Traceback (most recent call last):
  File "/lib/python3.12/site-packages/marimo/_runtime/executor.py", line 157, in execute_cell
    exec(cell.body, glbls)
  Cell marimo:///home/gabriel/repos/rxfood/data-notebooks/gabriel/nutrient_estimation_evals/nutrient_generation_manual_eval.py#cell=cell-3, line 2, in <module>
    mi_df = pd.read_json(e2e_comparison_dir + 'mi.json', orient="records", lines=True)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pandas/io/json/_json.py", line 791, in read_json
    json_reader = JsonReader(
                  ^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pandas/io/json/_json.py", line 904, in __init__
    data = self._get_data_from_filepath(filepath_or_buffer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pandas/io/json/_json.py", line 960, in _get_data_from_filepath
    raise FileNotFoundError(f"File {filepath_or_buffer} does not exist")
FileNotFoundError: File mi.json does not exist

Not so surprising that marimo has no concept of what external files I'm depending on. Is there a recommended path for how to include data along with the notebook when publishing? This seems like a pretty common need, but not seeing any mention of this in the export docs https://docs.marimo.io/guides/exporting.html#export-to-wasm-powered-html

Your Suggestion for Changes

It would be amazing if this just worked out of the box (by watching for open files and auto-including them as deps, i guess?), but just having some recommended way to do this (even with extra work) would be nice. Maybe there's something I should be doing with Marimo's built-in caching, for instance?

Did try just wrapping with a simple persistent cache:

with mo.persistent_cache(name="my_cache"):
    mi_df = pd.read_json('mi.json', orient="records", lines=True)

But seems that the cache doesn't get included in the WASM export assets, so it still fails trying to open the file (same error as with no cache):

marimo._save.cache.CacheException: Failure during save.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/lib/python3.12/site-packages/marimo/_runtime/executor.py", line 157, in execute_cell
    exec(cell.body, glbls)
  Cell marimo:///home/gabriel/repos/rxfood/data-notebooks/gabriel/nutrient_estimation_evals/nutrient_generation_manual_eval.py#cell=cell-3, line 1, in <module>
    with mo.persistent_cache(name="e2e_comparison_cache"):
  File "/lib/python3.12/site-packages/marimo/_save/save.py", line 500, in __exit__
    raise instance from CacheException("Failure during save.")
  Cell marimo:///home/gabriel/repos/rxfood/data-notebooks/gabriel/nutrient_estimation_evals/nutrient_generation_manual_eval.py#cell=cell-3, line 3, in <module>
    mi_df = pd.read_json(e2e_comparison_dir + 'mi.json', orient="records", lines=True)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pandas/io/json/_json.py", line 791, in read_json
    json_reader = JsonReader(
                  ^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pandas/io/json/_json.py", line 904, in __init__
    data = self._get_data_from_filepath(filepath_or_buffer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.12/site-packages/pandas/io/json/_json.py", line 960, in _get_data_from_filepath
    raise FileNotFoundError(f"File {filepath_or_buffer} does not exist")
FileNotFoundError: File mi.json does not exist

The text was updated successfully, but these errors were encountered:

dmadisetti · 2024-12-17T10:17:31Z

+1

Persistent cache loading over network would be the ideal way to do this.
Otherwise manually loading a file over the network is the current best option.

What's your current WASM export workflow? Are you using wasm export, islands, or mkdocs?
Would it be easy enough to statically serve the __marimo__ folder from a relative path?

gabrielgrant · 2024-12-17T16:17:04Z

I was just trying to use direct command line wasm export. Have in the past used static html exports as a way to share notebook outputs/results with non-technical users as a report, without them having to install anything. Github's direct display of notebook outputs is one advantage of ipynb format (maybe the only one?) that is still keeping some team members from marimo (other than just the usual inertia), because there's no workflow at all needed. --watch output unfortunately isn't quite feasible, since it has to re-render entirely (and afaict matplotlib outputs can't be cached?)

But allowing those non-technical users to easily have interactive content via wasm export would be amazing and worth some amount of workflow change

My ideal would probably be having some way to package everything into a single file (embedding everything as base64 strings, i guess?). That seems likely to be infeasible, though, so an output dir that can be zipped and shared or otherwise easily deployed somewhere would be alright too.

Generally I'm doing dev on a local laptop, so wouldn't want to serve __marimo__ directly from this machine, but if the relevant bits could be included in the output dir, seems perhaps pushing that to github and having it served from GH pages might work. We're dealing with health data, though, so some reports need to be kept more private. right now they just get snapshotted as HTML and stored somewhere in our MS Sharepoint instance/Loop (MS' crappy Notion clone), then people just open them as files on their local machine. IIUC, tho, that exact workflow can't work with WASM outputs, since they need to be served from a real HTTP server, right?

gabrielgrant · 2024-12-18T09:15:07Z

@dmadisetti Can you share a bit more about what you have in mind regarding manually loading a file over the network?

What's the best way to detect whether we're running in WASM if I want to load data differently depending on environment (or skip some WASM -incompatible steps entirely) ?

The overarching need behind this issue is something I've found is a very common pattern: i do some complex/heavy analysis in a notebook and then also have code to visualize or interact with the results. The latter portion i want to share, but want it running off a snapshot of the earlier analysis results

Currently my options seem to be:

One analysis notebook, one presentation notebook, but this requires manually running the presentation notebook when results change, partially losing the benefits of reactivity
to duplicate the latter part of the notebook
Export the whole notebook to HTML ( results in a bunch of weird blank spaces for outputs that require a kennel)

Not sure this is the place to discuss it, but would be great to have more explicit support (or documentation, if there's already a path I'm not using?) for this workflow

dmadisetti · 2024-12-18T16:29:30Z

So here are the options as I see them with cache:

Embedding content in the webpage / export
Expectation of a static remote source
A 3rd party remote cache (hosting done for you, no additional concern for where your cache data goes)

Outside of cache, wasm could leverage the virtual filesystem

Embedding data in the export
Tweak the virtual filesystem to make static file requests
As is, the filesystem can be configure to be integrated with something like S3 or cloud storage- I don't think this is a public option, but is something available on marimo.io (@mscolnick is that right?)

As is, I don't think this is a documentation issue so much as a missing feature

gabrielgrant · 2024-12-18T19:21:41Z

Seems there are kinda two parts to this:

what the feature should look like long-term (what your list directly above all seem like viable options)
what can be done today as a workaround (potentially to go in the docs until a longer-term solution is available)

Re: 2. -- you'd mentioned further up in the thread that "manually loading a file over the network is the current best option" - that sounded to me like you had a specific (albeit more manual) workaround in mind that would work today?

gabrielgrant · 2024-12-19T18:50:43Z

@dmadisetti I do think overall you're right that it sounds like this is certainly at least as much a feature request as a docs issue (probably more so). Unfortunately I don't seem to be able to change the tags (i think the issue template auto-adds them when creating)?

mscolnick · 2024-12-20T00:06:06Z

I have some thoughts, but not fully fleshed out on which approach is the best, nor which we want to recommend at the moment.

Hosting the assets is the easiest, but I know not always possible. For example, putting your assets on S3 or in a public directory in your github pages (e.g. public/data), then you can fetch from from https://my-org.github.io/my-repo/public/data/my.csv. This should work because it is same-origin, but if you load from another URL, you should make sure to allow CORS.
We could inline all the files in the HTML <marimo-file data-mime-type="text/csv">some_data</marimo-file> and then when the page loads, we shove all the files in the emscripten/wasm filesystem. This will likely fail horribly later so probably not this.
We have some logic in the Community Cloud where we build our own FUSE implementation (looks a lot like https://filesystem-spec.readthedocs.io/en/latest/features.html), such that we can can query endpoints like https://<domain>/data/cars.csv when running python open(data/cars.csv). This logic is quite hairy and hard to bring out, but something we can do down the road with more times. (I also haven't tried ffspec, which may work)

gabrielgrant · 2024-12-20T06:10:08Z

Ah, didn't register that i could drop df = pd.read_csv('https://raw.githubusercontent.com/marimo-team/marimo/refs/heads/main/tests/_plugins/ui/_impl/tables/snapshots/pandas.csv') in a WASM notebook and have it just work.

That's awesome as a workaround! Assuming that if I'm using cookie-based or HTTP basic auth for the notebook, then imagine putting the data behind the same auth scheme should work, right? Haven't actually tested this yet, but can't see why not (famous last words...)

One remaining question is: what's the best way to detect whether we're running in WASM if I want to load data differently depending on environment (or skip some WASM-incompatible steps entirely) ? I guess just try loading the local file, catch the FileNotFoundError, and then try the HTTP URL?

This is def workable as a step in the right direction. But still not ideal that the whole computation needs to be redone -- for heavy computations that could take quite a while in WASM.

If I also deploy/serve the __marimo__/cache dir will that also work over HTTP today? I guess if so it would be by setting save_path to a URL? Would be awesome if this worked by default for read using the dir deployed at ./__marimo__/cache relative to the notebook. Only working for read and not write might be confusing...but could maybe just write to the in-browser filesystem? (haven't looked at the implementation -- is this OPFS?). @dmadisetti is this what you had in mind when you asked above about serving the cache dir from a relative path?

mscolnick · 2024-12-20T17:05:21Z

You can create a util int your notebooks:

def is_wasm() -> bool:
    return "pyodide" in sys.modules

I don't expect __marimo__/cache to work over http at the moment. Something that is planned (remote caching), but not yet supported.

gabrielgrant added the documentation Improvements or additions to documentation label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommendations for including data in WASM notebook #3194

Recommendations for including data in WASM notebook #3194

gabrielgrant commented Dec 17, 2024

dmadisetti commented Dec 17, 2024

gabrielgrant commented Dec 17, 2024 •

edited

Loading

gabrielgrant commented Dec 18, 2024 •

edited

Loading

dmadisetti commented Dec 18, 2024

gabrielgrant commented Dec 18, 2024 •

edited

Loading

gabrielgrant commented Dec 19, 2024 •

edited

Loading

mscolnick commented Dec 20, 2024

gabrielgrant commented Dec 20, 2024 •

edited

Loading

mscolnick commented Dec 20, 2024

Recommendations for including data in WASM notebook #3194

Recommendations for including data in WASM notebook #3194

Comments

gabrielgrant commented Dec 17, 2024

Documentation is

Explain in Detail

Your Suggestion for Changes

dmadisetti commented Dec 17, 2024

gabrielgrant commented Dec 17, 2024 • edited Loading

gabrielgrant commented Dec 18, 2024 • edited Loading

dmadisetti commented Dec 18, 2024

gabrielgrant commented Dec 18, 2024 • edited Loading

gabrielgrant commented Dec 19, 2024 • edited Loading

mscolnick commented Dec 20, 2024

gabrielgrant commented Dec 20, 2024 • edited Loading

mscolnick commented Dec 20, 2024

gabrielgrant commented Dec 17, 2024 •

edited

Loading

gabrielgrant commented Dec 18, 2024 •

edited

Loading

gabrielgrant commented Dec 18, 2024 •

edited

Loading

gabrielgrant commented Dec 19, 2024 •

edited

Loading

gabrielgrant commented Dec 20, 2024 •

edited

Loading