-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the kiara store? #12
Comments
Ok, one thing in advance: not all of this is implemented (or even thought through) yet, so I'll only speak to the things I implemented already and am sure it'll mostly stay that way. I might add some comments in other areas (and mark them clearly), but please don't mistake that for a plan or strategy, those are things that need to be discussed/finalized. Can be still included in docs, but we need to make sure they don't become stale and outdated. Also, I'll be answering fairly technical, so ideally someone would filter and translate the relevant parts into non-dev-speak.
The main reason is really that kiara needs to have full control over the byte blobs it deals with, and make sure not a single byte changed. If that is the case because some external entity did something to the file, all the metadata kiara has about it would be invalid, but kiara has no way of knowing that is the case. In theory, kiara could hash files every time it toches them to make sure that didn't happen, but that would be very inefficient, and it wouldn't help with the fact that all downstream operations that have been done with a particular input file would be invalid or out of date, and we would have no way of 'proofing' that with that input we get that output (because we lost access to the original data). One thing that is important to note is that if you store a value, kiara will also store every input and intermediate value that was used to create a value. This is important because otherwise we'd loose the integrity of the values lineage (basically a broken cold-chain for data). In addition, kiara also records metadata about the operations that were used, and the environment the operations ran in (but that is less important for this question I guess).
All the operations with 'import' in their names basically copy external (to kiara) into either a temporary location (often memory) and subsequently the internal kiara data store (if the API user chooses to 'store' a value), so what I wrote above doesn't happen.
Every import incurs some storage cost (duplication of the bytes that get imported). In some cases that is not relevant because we download a file from a remote location so we would have to pay that cost anyway. But for local files, it means doubling the amount of storage a dataset takes up (unless the user manually deletes the original file on their filesystem or doesn't use 'store').
This answer very much depends on what you try to do, and whether you use the Python API as end user or client developer. I'd imagine values would only be stored if a user wants to keep a specific result they are happy with, or that is relevant in some other way. And for imports of external files, but that would depend a bit on how that part of the app is designed, kiara stores any parent values of a value that is requested to be stored via the API, which would also include the value at 'import' time, so we might not need/require an explicit store command for that.
At the moment, only aliases are supported as non-automatically (aka user-specified) collected metadata when storing a value. We probably want to have more options here in the future (comments, notes, authors, ...). Even aliases are not really 'fixed' yet, since it's an area where I was waiting for frontend developers to share their opinions/ideas. Currently, an alias is a string value (no special chars except '.', '_', '-') that makes sense to a user so they can find the value they alias later on. Otherwise, users would have to deal with uuids, which are impossible to keep track of for humans. Aliases can be overridden by the user, to point to a new/updated value. Currently, aliases are not versioned, but there is some placeholder code to make that possible in the future if there is a requirement. Also, multiple aliases can point to the same value. The reason aliases are not finalized yet is because I think this is one of the central UX 'themes' in kiara (how to pick/reference/manage datasets), and I can see several different options that would have to be implemented in non-compatible ways (flat, hierarchical, namespaced, ...). I'm hoping that having to implement a real-world gui will spark some ideas or point to the best way of doing references to datasets. Via Python, the easiest way to store a value with one or multiple aliaes is via the
Currently, each context has it's own data store. There is some code to prepare for multiple data stores per context, but that's not implemented in any useable way at the moment. In the future, it might be important to be able to access multiple datastores (either within a single context, or accross multiple contexts -- in the latter case those would probably be read-only), but that would have to be designed and implemented once we come across a use-case.
This is sub-API level, so anything here should go into a different section of the docs. At the moment a kiara datastore is a Python interface/base class that can be extended to store data in specific ways. Technically, we have archives (read-only) and stores (read-write). The only implementation that is used at the moment is one that uses a folder in the users home directory to store the actual serialized bytes of each value. Location is OS-dependent (use Currently, there is no way of deleting single values/aliases from a kiara data store, that is on my TODO list, but it is non trivial and I'd prefer to figure out data export first before I tackle this. One important technical detail is serialization, and it relates to how data is serialized into bytes before it is stored. kiara implements this in a type-dependent way, which means that every data type has to implement it's own serialization (or inherit it from a parent). Its fairly important to implement that in an efficient way, so the kiara store can de-duplicate data that is the same. This is a much larger topic to talk about, and probably needs its own section in our docs (how to create your own data-type). That's something I need to write myself, but I'd prefer to wait until we have a basic structure of docs because I expect I'll need to link to a lot of other stuff, and also it's not something I expect anyone but myself to be doing in the near future. |
Just a lingo clarification on my front: are aliases the equivalent of variables in python? Or something different? (Trying to translate into my 'knowns' to also try and explain beyond this) |
Yeah, roughly equivalent I'd say. |
Ultimately, I think this discussion needs to end up in a big conceptual docs page which addresses things including but not limited to
There also needs to be a small summary and a link to this on the glossary page.
see also #11 (review)
I don't know the answer to any of these questions. Are there any other things I didn't know to ask about, are there any existing docs about this that I didn't find
The text was updated successfully, but these errors were encountered: