Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API kiara.list_all_values endpoint takes several minutes to process #73

Open
MariellaCC opened this issue Jun 4, 2024 · 8 comments
Open

Comments

@MariellaCC
Copy link

Describe the bug

I cleared the data store recently but ran a few operations since then without saving anything in the meantime (at least, not that I recall). I need to display the values present in the data store in a Jupyter notebook. When trying to run kiara.list_all_values() the cell takes more than 5 minutes to run, so I stopped the process. However, when I try to run kiara.list_all_value_ids() it works in a few seconds.

Additional info

partial output of kiara.list_all_value_ids():
(I just copied/pasted a few lines)

UUID('000537a8-4368-484c-ad6b-bd6c26bde2f6'),
 UUID('02484e0a-95cb-44bd-bcca-c0ae9701477a'),
 UUID('03065ed1-e927-48c4-9970-5f02fc71c74a'),
 UUID('04879195-1253-437a-ba19-ca16ffffa424'),
 UUID('04e2a9ac-9e13-4769-8fbb-f1793a52ceb1'),
 UUID('050dee08-ada7-455e-b15e-c72d745ac92d'),
 UUID('0555b482-f9f7-4c1f-bc1b-a7b8e5fbb83b'),
 UUID('062f7605-4f1d-4ba1-b999-fa202832207c'),
 UUID('063fad22-0fca-4485-ae2c-83612b95ca34'),
 UUID('06869012-ef7c-4c3d-9e66-905f48238ecc'),
 UUID('0706bb51-025a-400d-9ccc-8657325af591'),
 UUID('080788cd-3da4-43a0-977e-daf870faa56d'),

Output of kiara context explain:

Context 'default' ──────────────────────────╮
│                                              │
│   context name   default                     │
│   kiara_id       6978810c-3382-4953-a3cc-…   │
│   size on disk   0 bytes                     │
│   values                                     │
│                    no. values    356         │
│                    combined      5.82 GB     │
│                    size                      │
│                                              │
│   aliases                                    │
│                    no. aliases   0           │
│                    combined      0 bytes     │
│                    size                      │
│                    aliases                   │
│                                              │
│   archives                                   │
│                          it…   ar…           │
│                    al…   ty…   ty…   co…     │
│                   ───────────────────────    │
│                    de…   me…   sq…   {       │
│                                        …     │
│                                        …     │
│                                      }       │
│                    de…   da…   sq…   {       │
│                                        …     │
│                                        …     │
│                                        …     │
│                                      }       │
│                    de…   al…   sq…   {       │
│                                        …     │
│                                        …     │
│                                      }       │
│                    de…   jo…   sq…   {       │
│                                        …     │
│                                        …     │
│                                      }       │
│                    de…   wo…   fi…   {       │
│                                        …     │
│                                      }       │
│                                              │
│                                              │
╰──────────────────────────────────────────────╯

Expected behavior
List all values in the current context, incl. internal ones.

Environment, versions (please complete the following information):

  • OS: MacOS Sonoma 14.5
  • Python: 3.12
  • Versions of Kiara and Kiara plugins used in the related project
kiara                        0.5.10
kiara_plugin.core_types      0.5.1
kiara_plugin.develop         0.5.2
kiara_plugin.onboarding      0.5.1
kiara_plugin.tabular         0.5.4
kiara_plugin.topic_modelling 0.1.dev53+gff7ac02 /Users/mariella.decrouychan/Documents/GitHub/kiara_plugin.topic_modelling
  • Python environment used (e.g. system, conda, ...)
    conda
@makkus
Copy link
Contributor

makkus commented Jun 4, 2024

Thanks for the report. I'll have to setup some test cases with a context with comparable size and number of values. I'd think that 356 values (like you seemed to have) should be prohibitive in terms of getting metadata for all of them, so I think there is a good chance this is something that can be optimized away, but I'll have to spend some time trying to replicate the problem...

Anyway, please let me know if this happens again, queries like that should never take more than a few seconds, except if there is actually data loading (compared to metadata loading) involved....

@makkus
Copy link
Contributor

makkus commented Jun 4, 2024

Ah, also, if that happens again, it would be interesting to see whether this only happens with Jupyter, or, using the same context, also via 'pure Python' and/or the command-line.... Jupyter does have some quirks that could have caused this...

@MariellaCC
Copy link
Author

As I hadn't intentionally stored anything in the data store, I didn't know that there were as many values. I don't know if users will be aware of these values being there when they don't intentionally save elements in the data store.

@MariellaCC
Copy link
Author

Anyway, please let me know if this happens again, queries like that should never take more than a few seconds, except if there is actually data loading (compared to metadata loading) involved....

Something I don't understand though: why does the size of the data matter to list the values, why does it take more time to do kiara.list_all_values() than to do kiara.list_all_value_ids() in such a case?

@makkus
Copy link
Contributor

makkus commented Jun 5, 2024

As I hadn't intentionally stored anything in the data store, I didn't know that there were as many values. I don't know if users will be aware of these values being there when they don't intentionally save elements in the data store.

It was decided to store every value in every job run, which was mostly a consequence of the requirement of having a comment associated with every job run. Manually storing values is not necessary anymore, since everything gets stored anyway: #71 (comment) -- I did point out the potential issues that could arise in that meeting, esp. that I'm not sure about performance, since kiara wasn't designed with a pattern like that in mind. As I said, I think I can probably improve it in this instance, but I can't guarantee that we won't run into other similar issues.

Something I don't understand though: why does the size of the data matter to list the values, why does it take more time to do kiara.list_all_values() than to do kiara.list_all_value_ids() in such a case?

In some case kiara needs to read the data (or parts of the data). It shouldn't have to for list_all_values, so maybe there is a stray call somewhere that causes the issue, but list_all_values will always be much slower than list_all_value_ids, because the latter only needs to query the database for all unique ids, the former needs to also access all metadata for each id, and there is a lot of metadata associated with our values, Python objects needing to be created, some de-serialization happening. This is definitely not a trivial operation.

@MariellaCC
Copy link
Author

I will then look to use another operation for my needs, but it was interesting to see the impact of metadata auto-storing. Not sure, but I think this also had an impact on the performance of my laptop (ventilation was triggered a lot without me understanding why), but things are much better since I re-emptied the data store. Maybe it's just a coincidence, though but in any case, it may be important when getting to front-end considerations that users are aware of the amount of data that is in their data store (even if they didn't intentionally store anything).
Since the fact that this operation is time consuming is normal/proportionate to volume of data stored, I am closing this issue.

@makkus
Copy link
Contributor

makkus commented Jun 5, 2024

No, this needs to stay open, I need to investigate, as I said, we can't have that operation taking minutes....

@makkus makkus reopened this Jun 5, 2024
@MariellaCC
Copy link
Author

MariellaCC commented Jun 5, 2024

ah ok :-) my bad sorry, I thought that the time that it takes was because of too much data in my store, thanks for re-opening it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants