Skip to content

Questions (re. interop with indexed.el) #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
meedstrom opened this issue Mar 23, 2025 · 3 comments
Open

Questions (re. interop with indexed.el) #29

meedstrom opened this issue Mar 23, 2025 · 3 comments

Comments

@meedstrom
Copy link

meedstrom commented Mar 23, 2025

Hi! I recently wrote the library https://github.com/meedstrom/indexed. Afterwards, I began considering how to design an all-new DB, and glad to see prior art here!

It seems great if I can make Indexed just copy the org-sql schemata, so we have interop.

Some questions if you don't mind:

  1. Do you scan a huge list of files at some point, or just incrementally for every file visited?
    1. How long does the scan take? I assume it takes a while if the buffer was not already open, because org-ml relies on org-element-parse-buffer and therefore org-mode itself, but I could be wrong and you managed to optimize it.
  2. What do you do with properties? How would a query look, to e.g. find an entry that has property :CREATED: greater than [2020-12-31 Wed 15:00]?
    1. I'm wondering if you create a new table automatically for every unique property found; that's the idea I was going to go with.
    2. Also I was thinking of handling multivalued properties by splitting them according to some customizable "property-splitters-alist", where the default splitter just does it the same way as org-entry-get-multivalued-property. There exist other properties that must be split in a different way. ROAM_ALIASES is split with split-string-and-unquote; and ROAM_REFS is even more complicated - I have a function for that called indexed-roam-split-refs-field.
  3. You do not (yet) record the "zeroth-level" entry, correct? (I.e. content before first heading)
    1. Would you store it in the exact same way as other entries, or do you think it makes more sense to have a separate table?
  4. Have you considered storing the entire text contents of each entry and file, instead of a MD5 hash of it?
@meedstrom meedstrom changed the title Questions Questions (re. interop with indexed.el) Mar 23, 2025
@ndwarshuis
Copy link
Owner

I'm not sure what you mean by "incrementally"; in order to init the database every file you want needs to be parsed, but this process also notes the hash so that subsequent updates only consider files that have changed (ie have different hashes).

The parser is basically just org-element-parse-buffer as you noted, but the other consideration for speed is that the parsed results need to be formatted as a SQL query and sent to the DB. This is obviously dependent on the amount of data in question and the DB (SQLite being really slow, pgSQL being relatively fast).

The short answer is they go in the properties table so the query would just be SELECT * FROM properties (see erd). If you only care about the headline properties you would need to inner join with the headline_properties table which only includes the stuff under a headline (vs #+PROPERTY at the top of the file).

What happens after that would be up to you, since the output will just be a string. In the case of :CREATED: this isn't an actual timestamp since timestamps are not (technically) supposed to be part of properties (despite this happening all the time). This also means there is no separate "table" for each property. If I understand what you are describing with splitting multivalued properties this is somewhat analogous to pgSQL's enums, assuming that the range of values is finite and rather small.

No, except for whatever I described here. There would probably just need to be a new table for keywords, since everything else is captured somewhere unless I missed something.

Use case being to store the files not as raw text on disk but in a db instead? I haven't considered this since that wasn't the intended use case. I suppose it wouldn't be that hard to add if you wanted, and for those who don't need it then there would just be a column filled with NULL in the db.

@meedstrom
Copy link
Author

meedstrom commented Mar 25, 2025

Re. multivalued properties. I'm not a huge SQL user, at all, so I may be sailing the wrong direction entirely. But to explain my idea.

Let's say we have an Org entry that looks like

* Node 1
:PROPERTIES:
:ID: v7o8aew0qik0
:ROAM_ALIASES: "An alias for Node 1" "Another alias for Node 1"
:ROAM_REFS: [cite:@citekey1 blah blah @citekey2]
:CREATED: [2025-03-25 Wed 10:00]
:END:

Now you see those properties have to be treated differently. So my idea of a splitters-alist:

(defvar splitters-alist
  '(("ROAM_ALIASES" . split-string-and-unquote)
    ("ROAM_REFS" . indexed-roam-split-refs-field)
    ("CREATED" . list))
  "Alist specifying how to split a property value.

Elements are (PROPERTY-NAME . SPLITTER), where:
- PROPERTY-NAME is a string in all-uppercase.
- SPLITTER is a function taking one string argument that is the whole
  value of a property, and should return a list of strings.

Properties not in this list use the default splitter:
`split-like-official-org'.")

(defun split-like-official-org (value)
  "Split VALUE the same way as `org-entry-get-multivalued-property'."
  (mapcar #'org-entry-restore-space (split-string value)))

(defvar all-properties-ever (make-hash-table :test 'equal))
(dolist (entry (SOME-QUERY-GETTING-ALL-ENTRIES))
  (cl-loop
   for (name . value) in (entry-properties entry)
   do (dolist (atom (funcall (or (cdr (assoc name splitters-alist))
                                 #'split-like-official-org)
                             value))
        (push (list id atom)
              (gethash (downcase name) all-properties-ever)))))

Now `all-properties-ever' is a table that looks like e.g.

"roam_aliases": (("v7o8aew0qik0" "An alias for Node 1")
                 ("v7o8aew0qik0" "Another alias for Node 1")
"roam_refs": (("v7o8aew0qik0" "@citekey1")
              ("v7o8aew0qik0" "@citekey2")
"created":   (("v7o8aew0qik0" "[2025-03-25 Wed 10:00]"))

which could then be used for CREATE TABLE queries. You'd perhaps have them auto-named as prop_roam_aliases, prop_roam_refs, prop_created etc.

I figured that this would make it easier to write pure SQL queries like

SELECT * FROM entries WHERE created > "[2024-01-01";

(never mind the ugly method of date comparison, one could conceive of a splitter that transforms the date into something more suitable, but that's for another time)

@ndwarshuis
Copy link
Owner

It sounds like what you really want is a way to store all possible values of a property ever seen in/across org files. The only difference with this and what I currently do with org-sql is that you have extra splitting logic to access a property's "atomic" values.

which could then be used for CREATE TABLE queries

If it were me I would make two tables for keys and values. Something like:

prop_keys:
| index | key          | id           | splitter                      |
|-------+--------------+--------------+-------------------------------|
|     1 | roam_aliases | v7o8aew0qik0 | split-string-and-unquote      |
|     2 | roam_refs    | v7o8aew0qik0 | indexed-roam-split-refs-field |
|     3 | created      | v7o8aew0qik0 | list                          |

prop_values:
| index | key_id | value                    |
|-------+--------+--------------------------|
|     1 |      1 | An alias for Node 1      |
|     2 |      1 | Another alias for Node 1 |
|     3 |      2 | @citekey1                |
|     4 |      2 | @citekey2                |
|     5 |      3 | [2025-03-25 Tue 10:00]   |

index is the primary key in each, arguably good practice to just make this an integer for efficiency. In theory the splitter column would let you reverse what you put in if you wanted; obviously this would require the application to know what these mean. key_id in the second field is a foreign key that references index in prop_keys. I'm also not sure what id means but this mirrors the data you had.

Then query like SELECT * FROM prop_keys INNER JOIN prop_values ON prop_keys.index = prop_values.key_id WHERE key = "created" ... (more conditions) ... (etc etc);

Making new tables for each property doesn't seem necessary or advantageous. It also has lots of problems (the schema is undefined, unclear how foreign keys would work, hard to predict which tables to query, performance, different permissions, etc). Its true you wouldn't necessarily know the property key/value pairs without querying the database initially, but this is easy: SELECT DISTINCT key, value FROM prop_keys INNER JOIN prop_values ON prop_keys.index = prop_values.key_id; (using my example above).

Does this all make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants