Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP/experimental: add job-sql module #5847

Closed
wants to merge 8 commits into from

Conversation

garlick
Copy link
Member

@garlick garlick commented Apr 1, 2024

This is a quick prototype of an in-memory sqlite database of jobs that can be queried directly with SQL.

As noted in #4336 the sqlite JSON1 extensions allow queries based on fields inside a JSON object stored as a regular TEXT column. I thought we might want to play with that in a standalone context, so this adds a job manager journal consumer and a front end command flux sql QUERY .... A textual SQL query is sent in a request, and streaming responses are returned, one per row in the result.

The schema is just

CREATE TABLE jobs(
    id INT PRIMARY KEY,
    eventlog JSON,
    jobspec JSON,
    R JSON
)

Example query: what nodes did job 57562628096 run on?

$ flux sql select "json_extract(R,'$.execution.nodelist')" as nodelist from jobs where id = 57562628096
{"nodelist":["system76-pc"]}

Which jobs ran the hostname command:

$ flux sql select id from jobs where "json_extract(jobspec, '$.tasks[0].command[0]') = 'hostname'" 
{"id":63954747392}
{"id":33151778816}
{"id":50667192320}
{"id":57562628096}

I'm not a SQL aficionado but they exist :-) Just wondering if this might be useful in some way. With the recent job manager journal changes, it was super easy to prototype.

src/modules/job-sql/job-sql.c Fixed Show fixed Hide fixed
src/modules/job-sql/job-sql.c Fixed Show fixed Hide fixed
@chu11
Copy link
Member

chu11 commented Apr 1, 2024

ya know, when I first worked on that PR (almost 2 years ago) I didn't utilize the JSON features b/c it wasn't supported in many sqlite versions. But I guess it's in enough versions today that it can be used!

@garlick
Copy link
Member Author

garlick commented Apr 1, 2024

Sounds like 3.38 has it enabled by default, and earlier versions would only have JSON enabled if the packager opted in.

I'm testing on Ubuntu 22.04 which has sqlite3-3.37.2-2ubuntu0.3, and it seems to work.
Just tested on TOSS4 which has sqlite-3.26.0-19.el8_9.x86_64, and it does not 😢

@chu11
Copy link
Member

chu11 commented Apr 1, 2024

Just tested on TOSS4 which has sqlite-3.26.0-19.el8_9.x86_64, and it does not 😢

This sounds familiar. It may have been why I didn't even try :-(

@garlick
Copy link
Member Author

garlick commented Apr 2, 2024

Heh, there is this if we were sufficiently motivated.

https://sqlite.org/amalgamation.html

garlick added 4 commits April 2, 2024 09:52
Problem: the JSON1 extension to sqlite3 is not available
in RHEL 8 based distros.

They are enabled by default in sqlite3-3.38, or earlier with
an opt-in build option.  While Ubuntu 22.04 ships
sqlite3-3.37.2-2ubuntu0.3 which does include the JSON1 extension,
RHEL 8 / TOSS 4 ships sqlite-3.26.0-19.el8_9.x86_64 which does not.

Pull in the sqlite3 amalgomated source for 3.45.2 from here:
https://sqlite.org/download.html

License: "public domain".  See https://sqlite.org/copyright.html
Problem: content-sqlite and job-archive use external libsqlite3
but we now have an internal copy of the library.

Drop configure requirement and change the build rules for those
broker modules to use the internal copy of libsqlite3.
Problem: the debian control file and scripts that install
dependencies pull in libsqlite3-dev packages.

Drop sqlite from those scripts.
Problem: docker test images pull in libsqlite3 but this
is no longer a build requirement for flux-core.

Update Dockerfiles.
garlick added 4 commits April 2, 2024 10:45
Problem: libsqlite3 is included in coverage.

Add it to CODE_COVERAGE_IGNORE_PATTERN.
Problem: libsqlite3 is spell checked in CI

Add it to the .typos.toml extend-include list.
Problem: Flux doesn't have a raw SQL interface to job data that
can utilize the sqlite JSON1 extensions.

Add a service that consumes the job manager journal and populates
an in-memory sqlite database with all jobs (active and inactive jobs).
The schema simply stores the jobid, eventlog, jobspec, and R.
The last three are kept in JSON format so sqlite JSON1 extensions
can be used to construct queries.

https://www.sqlite.org/json1.html
Problem: the job-sql service has no command line client.

Add one.
@grondo
Copy link
Contributor

grondo commented Apr 5, 2024

Just bringing up the related #4234 before I forget about it.

@garlick
Copy link
Member Author

garlick commented Apr 5, 2024

Oh thanks, looks like I missed that one and could have learned a thing or two. :-(

@grondo
Copy link
Contributor

grondo commented Apr 5, 2024

Just wondering if this might be useful in some way.

My $0.02 is that this would be most useful as a persistent backend that captured historical job information including purged jobs, i.e. a replacement for the archive. I'm a bit worried about duplicating job data in memory a third (fourth?) time in case a user wants to do sql queries. Could the sqlite db go to disk instead? (not sure of the drawbacks there)

I guess we need to decide if we're going to throw away our recent work on job-list and the constraint syntax and adopt sql instead. In that case it seems like job-list should be replaced or augmented with the sql backend? Instead of having a parallel, equivalent implementation?

There are probably some huge benefits to keeping jobs in sqlite like this and now would be the time to go down this road instead of our current constraint matching approach (maybe a wrong turn there?)

@garlick
Copy link
Member Author

garlick commented Apr 5, 2024

Good thoughts.

One issue with this service as proposed is that it is restricted to the instance owner because it allows arbitrary SQL to be flung at the database engine. Very powerful but no way to limit the data people can read or write to. Possibly a read-only database handle could be provided for guests but limiting their queries to only their own job data seems hard if they can construct the SQL query and transform the query result however they like.

That would make replacing job-list's current interface with SQL problematic.

I think the schema as proposed is obviously inadequate. Try to select on the job state when all you have is an eventlog JSON object :-)

My $0.02 is that this would be most useful as a persistent backend that captured historical job information including purged jobs, i.e. a replacement for the archive. I'm a bit worried about duplicating job data in memory a third (fourth?) time in case a user wants to do sql queries. Could the sqlite db go to disk instead? (not sure of the drawbacks there)

Agreed. Yes we could certainly use the disk instead of memory - trivial if we could just recreate the db on every restart. Somewhat less trivial if we have to sync the journal and the existing data on restart, but doable.

Let's not consider this a competitor to job-list for now. It's very little code that enables an industry standard interface to our job data. Its usefulness remains to be seen, though IMHO.

@chu11
Copy link
Member

chu11 commented Apr 5, 2024

Random thoughts

My $0.02 is that this would be most useful as a persistent backend that captured historical job information including purged jobs, i.e. a replacement for the archive.

Yup, that's my belief too. I was going down that path with #4336 but gave up when it was clear json was not supported in sqlite as widely as I would have liked. Putting sqlite into flux-core would solve that problem of course.

(side note, to remove the job-archive, also need to look at getting flux-framework/flux-accounting#357 merged.)

I guess we need to decide if we're going to throw away our recent work on job-list and the constraint syntax and adopt sql instead. In that case it seems like job-list should be replaced or augmented with the sql backend?

some thoughts

would all job data go into sqlite vs only inactive jobs? if only inactive jobs, supporting sql query on active jobs is probably going to be hard. In contrast, going from "constraint syntax" to sql query should be quite doable.

are sql queries easier for the average user to do? in my head they are not, but that's me. We maybe could do a casual survey of foks.

Also, I am not an expert in sql injection attacks, but one benefit of not giving folks SQL query access is eliminating that possibility altogether. I'm personally a little frightened of them. Granted, modifying historical job data is much less dangerous than username/password info and the like. (Edit: I see garlick's comments that it's isolated to instance owner ... so maybe this is even less of concern than I had before).

@garlick
Copy link
Member Author

garlick commented Apr 5, 2024

Also, I am not an expert in sql injection attacks, but one benefit of not giving folks SQL query access is eliminating that possibility altogether. I'm personally a little frightened of them. Granted, modifying historical job data is much less dangerous than username/password info and the like.

I think if you have SQL access you don't need an injection attack :-)

https://xkcd.com/327/

@grondo
Copy link
Contributor

grondo commented Apr 5, 2024

Yeah, maybe a job-list frontend converts more controllable queries to sql. For example there is this (which probably isn't usable, but is a proof of concept)

https://github.com/fabioespinosa/json-logic-to-sql-compiler

RFC 31 was inspired by JSON Logic so maybe we wouldn't have to throw that out.

@garlick
Copy link
Member Author

garlick commented Apr 5, 2024

I think if you have SQL access you don't need an injection attack :-)

😕

I just meant we're already offering direct SQL access in this PR as proposed. If we opened that up to guests, it's already game over.

@garlick
Copy link
Member Author

garlick commented May 16, 2024

This was fun experiment in which I learned

  • there is no clear way to sandbox/restrict guest access to a job db if the interface is SQL+JSON1
  • it's hard to build useful JSON1 queries for Rv1 and jobspec due to our use of custom data types like idset and hostlist
  • JSON1 queries for things like job state given an eventlog would be pretty hard
  • we we can easily vendor sqlite3 using the amagomated release if we want a newer version (and incidentally, the amalgomated compile purports to be better optimized)

I'll keep this branch around but there's no reason for an open PR at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants