Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Architecture notes #108

Merged
merged 26 commits into from
Apr 1, 2024
Merged

Architecture notes #108

merged 26 commits into from
Apr 1, 2024

Conversation

mathemancer
Copy link
Contributor

@mathemancer mathemancer commented Feb 19, 2024

This adds architecture notes regarding the Models, Users and Permissions, and a quick overview.

Rendered index | Rendered models | Rendered permissions

Copy link
Contributor

@seancolsen seancolsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool. Thanks for sending this, @mathemancer!

Comments on models.md

  • I'm a little confused about the Database, DatabaseServerCredential, and UserDatabaseMap models. Do you think you could spell these out more explicitly using markdown tables that show the name and type of each column? In particular, the UserDatabaseMap model seems to have a database_id field and and database field which I find weird. Perhaps this is a typo. Or maybe I'm not understanding. Seeing all columns and their types would help me. Also, the Database model, on its own seems strange. For it to have only an OID and name without any other fields seems like it's kind of weird and detached. I guess I would expect these fields to be present directly in UserDatabaseMap instead of in a separate table. I think I'm missing something.

  • This stores the template defining what should be shown in a referencing fkey column for this table

    Within this wiki page, I'd suggest changing the above sentence to:

    This stores the record summary template for a table.

    "Record summary template" is the term that we use within the UI and I think it would be nice to use that consistently within our internal docs too.

    Additionally, if you do find yourself making changes to this model, then I'd suggest trying to adopt that "record summary template" terminology in the codebase too.

    "Record summary" is the text that gets rendered within the pill.

    "Record summary template" is the user-defined template that configures the fields to render.

    You also said:

    This would be much better as a ma_catalog table for efficiency reasons.

    Note that in my vision for record summaries within worksheets, the record summary template would be stored within the worksheet definition. I think we ought to consider this idea more thoroughly before making any big changes to record summaries. Likewise I have been ruminating on some more radical changes to record summaries which would make them much more powerful. I want to chat about that too before we make any big changes. As currently implemented, record summaries have some annoying limitations in terms of functionality. And I understand they are also a culprit of poor performance.

  • I'd be curious to see your proposal for what, specifically, the TableMetadata and ColumnMetadata models would look like. Would all fields be in one JSON blob? Or would each field be in a separate column? One reason I ask is that Mukesh and I had quite a bit of back and forth about updating column widths. He summarized it well in this comment. He wanted the API to accept a full JSON blob of display options for a column, all at once. I wanted the API to accept a partial JSON blob of some display options and then to merge that data received from the API request with the data already saved. For some reason that proved to be tricky. I think this use-case is worth considering as we consider architectural changes.

Comments on index.md

  • Generally, this all looks great.

  • At some point before implementation I'd like to discuss the list of RPC functions that we plan to implement. I want to assess them as a cohesive body. Here's an example of some weirdness we currently have: the GET /api/db/v0/tables/ endpoint you mention returns column data. But the front end doesn't actually use that column data — it relies on the columns endpoint for that. If we can find ways to make the RPC functions more mutually exclusive in terms of purpose, I think that would help to simplify things. Personally, I would prefer us to use API call that returns the entire structure of a given schema — all tables, columns, constraints, indexes, etc. Pavish and I have debated this. Even if we did agree on moving in this direction, I don't think the beta would necessarily be the most opportune time to make a large change like that. But I think at the very least it's worth having a discussion to plan out the set of RPC endpoints so that we all agree they work well as a set, with consistent patterns.

Comments on permissions.md

  • I'm withholding review of this part until I have time to read Pavish's Permissions doc as well.

@seancolsen seancolsen removed their assignment Feb 22, 2024
Copy link
Contributor

@kgodey kgodey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mathemancer Here's my initial review, I'll review the permissions after I talk to Pavish about them tomorrow.


## A Motivating Example: Getting the table info in a schema

To get the table info in a schema,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this under the current architecture?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this under the current architecture?

That's correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify in the spec as well?


## Bird's Eye View

The main goals of this architectural redesign are to improve the speed and reduce the complexity of the back end. Secondary goals are to improve the convenience of the API for the use case of our front end, and make it easier for users and contributors to identify which back end code supports a given front end feature. We plan to do this by:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you format these goals as bullet points, just so it's easier to identify the goals quickly?


To achieve this, we will install the following on the user database(s)

- Some custom Mathesar types.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add details about these types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will do.

To achieve this, we will install the following on the user database(s)

- Some custom Mathesar types.
- Some tables holding metadata that needs to be kept near the tables.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add details about these tables and why they need to be near "the tables"? What are "the tables"? Also, does this mean Mathesar won't work without adding new tables to the user's schemas? Do we need to add new tables into every schema, or will they be segregated into a Mathesar schema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This is mostly referring to the record summary template metadata. The current algorithm requires lots of back and forth between the table definition and the template definition, which (with our current architecture) is a serious performance drag.
  • We will put these into a separate mathesar schema (msar_cat probably).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I assume you'll add this to the spec as well.


- Some custom Mathesar types.
- Some tables holding metadata that needs to be kept near the tables.
- A set of functions that provide the bulk of Mathesar's back end logic and functionality.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar questions as the above bullet point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll flesh out this section in a new page, I think.


This service should provide an JSON-RPC API for use by the front end. When an API function is called, the service should:

- Grab an appropriate engine from the `DatabaseConnection` model by combining the user associated with the request with the `database`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What DatabaseConnection model? Could you describe the new model structure somewhere and compare it to the current structure?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, found the models file, but can you link it somewhere from here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I found the models file but I am finding it very hard to follow with the ad-hoc notes, can we do some sort of comparison of old structure vs. new structure?

@mathemancer
Copy link
Contributor Author

@seancolsen @kgodey This is ready for re-review. The main change is to add a section to models.md describing the models for beta. I've tried to add enough information that it should be possible to follow.

I've also made some more minor changes as per the review comments from @kgodey

Still TODO is to add more info about the setup of the DB-layer functions (and the DB-layer schemata and eventual schemata in general).

@mathemancer
Copy link
Contributor Author

@kgodey @seancolsen @pavish This should probably have another round of re-review.

The main changes are:

  • added a field to the UserDBRoleMap to note metadata permissions for a given user on a given DB.
  • added a default credential fkey to the Database model. Any admin connecting to that Database will use the default credential unless configured otherwise in the UserDBRoleMap.
  • Removed the no-longer-needed UIQueryDatabaseRole model. All needed functionality should be handled by logic around the metadata_role field in the UserDBRoleMap.

Minor changes are:

  • all foreign keys are now named for the table they refer to, with no _id suffix.
  • Changed "Database" to DB in most places for brevity.
  • Removed an oid field from the Database model, since it wouldn't be used by beta, and might never be used.
  • noted not-null constraints for all models.
  • Changed UIQuery to Exploration

I also moved the old model notes to their own page, since I kept find/replacing the wrong things.

@mathemancer
Copy link
Contributor Author

@kgodey To answer your remaining question:

(1) @mathemancer, we're not installing any additional Mathesar schemas as part of this, right? I see we're adding new types, but I assume they'll be installed in the existing schemas?

Correct. However, at some point we should rename the current schemas for consistency. We should consider whether to go through that pain before or after beta.

@kgodey
Copy link
Contributor

kgodey commented Mar 27, 2024

@mathemancer If we need to rename them, doing it while we're still in alpha seems preferable to later.

Copy link
Contributor

@seancolsen seancolsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant. This is looking really good now!

I'm a visual person, so I made this as a reference for myself. Perhaps it will help others.

Screenshot_20240329_112252

It's not comprehensive, but it gives you a picture of the relationships. It's also saved here: https://dbdiagram.io/d/Mathesar-internal-6604b63cae072629ce23f20a

@seancolsen seancolsen removed their assignment Mar 28, 2024
@mathemancer
Copy link
Contributor Author

mathemancer commented Mar 28, 2024

Brilliant. This is looking really good now!

I'm a visual person, so I made this as a reference for myself. Perhaps it will help others.

That's really great; thank you!

Copy link
Member

@pavish pavish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mathemancer I went through a finer comb this time, I think it looks great overall. I have a few questions/comments which require clarification.


With the new architecture, to get the same info,

1. The front end calls an RPC endpoint `/api/v0/rpc/`, calling a function `get_schema_table_details` with `database` and `schema` parameters. The database is identified by a Django id referring to a database model, and the schema is identified by an OID.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this endpoint going to remain /api/v0/rpc/ for all requests?

I'd rather have the function name being part of the endpoint url, i.e. /api/v0/rpc/get_schema_table_details.

  1. This makes it easier to debug during development.
  2. This will help us identify which function is being called when looking at request logs which usually only contain the url. This is particularly useful for analytics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is to have everything at the same endpoint. This is typically how such things are organized. It also lets us use the inbuilt documentation, batching, and error-handling that come with the framework. It also aligns with the JSON-RPC 2.0 spec.

How strongly do you feel about this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is how it's done typically and everytime I've used RPC style endpoints, I've found it annoying since it's been a pain debugging user reported issues by looking at their logs and when trying to analyze frequent requests by noticing the server request logs.

I don't feel strongly about this right now, however, I'd like to know if there's a possibility of doing it in the future if we decide to. I do think my concerns can also be addressed by improving our logging infrastructure.

Comment on lines +62 to +64
| username | character varying | not null |
| password | character varying | encrypted; not null |
| db\_server | integer | not null; references DBServer(id) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to be adding a unique key for (username, db_server)?

It's not mentioned as unique here, but I wanted to clarify this. I would like for us NOT to make this pair unique.

If we end up with an UX where the credentials are configured within a database page and there are multiple databases on the same database server, we would still expect the user to manually configure the credentials within each database page. This will cause a scenario where the same username is present in multiple rows in this table, one for each database, even though they all have the same db_server.

I want to ensure that this flow (if we were to take it) is not blocked by the architecture.

Copy link
Contributor Author

@mathemancer mathemancer Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't planning for it to be unique. As you mention, we may have reason to have more than copy of such a pair.

So, I think we're aligned there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Perhaps I'm confused, but I'm not seeing a compelling use case for having two DBServerCredential records with the same username and db_server. Rather my inclination would be to to make (username, db_server) unique for the purpose of keeping the data clean. I'm not sure why we would require the user to manually configure the credentials when we could allow them to select from an already saved credential. If there is some sort of security purpose here, then I'd imagine we'd actually want to validate that the newly-entered password agrees with the stored password. Another approach would be to test whether these passwords match and prompt the user to update the saved password. To me, these are UX considerations that I'd be inclined to make based on the models. And I'd want to approach the models from a perspective of data integrity and normalization. But I could be missing something here. I support merging this as-is though and perhaps discussing this at the next meeting, just to make sure we're all on the same page.

Copy link
Member

@pavish pavish Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configure the credentials when we could allow them to select from an already saved credential

Assuming that credentials are configured within the database page on the UX, this would break the scope of what it means to save a new connection since it'll break the boundaries of a database on the UI. It would also be a security concern since we don't want credentials configured within one database to be leaked to another.

I'd imagine we'd actually want to validate that the newly-entered password agrees with the stored password

It'll introduce new complexities such as figuring out which is the correct password. The stored password might be outdated. If we were to re-validate, do we now also update the old one? All these scenarios would complicate UX.

To me, these are UX considerations that I'd be inclined to make based on the models
I'd want to approach the models from a perspective of data integrity and normalization

I think the usecase should determine the modal structure and not the other way around. The purpose of this model is also to support the UX. Keeing it flexible would help us experiment with the UX more than having to restrict our UX affecting usability for little functional benefit.

I do think we can talk about this later, and merge the PR in.

Copy link
Contributor

@seancolsen seancolsen Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mathemancer, @pavish, and I discussed this on a call today. On the call I suggested two different directions to go, which I outline below as (B) and (C).

(In these diagrams I'm using ⭐ to indicate multi-column unique constraints.)

  • (A) This is what the PR currently has (listed here for reference)

    Screenshot_20240329_130345

  • (B) Databases and credentials are siblings within a DB server.

    Here we've added a unique constraint.

    image

    • Credentials would be shared across separate databases.
    • Updating a password in one place would apply to multiple databases.
    • The same is true for port numbers. To edit a port number, we'd need to make it clear to users That they'd be editing the port number of the DBServer object which could apply to many databases.
    • Databases, servers, and credentials could all be maintained separately from one another.
    • It would be possible to keep a credential stored within Mathesar while deleting all databases. Then you could connect to a new database using that same credential.
  • (C) Each database owns a set of credentials.

    image

    • Credentials and servers are both scoped underneath databases.
    • Since a database only has one server, the sever details are inlined into the database model.
    • Separate databases require separate credentials, and those credentials must be duplicated.
    • Any changes you make to server or credential settings within a database will only apply to that database.
    • Deleting a database from Mathesar would delete all the credentials associated with it.
    • One small note: we'd need some triggers to enforce the rule that each database has one and only one default role. There are potentially other ways to enforce this structure too, but the is_default approach seemed cleanest to me.

The point I'm trying to make is that (B) and (C) are both sound approaches, while (A) is error-prone and hacky (in my opinion). With (A) the DB structure does not provide clarity on whether a credential should be shared or duplicated across multiple databases. That's the sort of ambiguity that I strive to minimize when designing database schemas. Ideally any database state should be valid application state. So if our application expects separate sets of credentials per-database, then we should enforce that logic at the data layer to avoid bugs. If we're not sure if we want separate credential sets or not, then we should figure that out now instead of allowing for both structures to coexist in the data layer.

From a product perspective, I think it's worth considering (B) and (C) — and I think we ought to do this before getting into anything with Figma (for the sake of time). To me, (C) seems simplest which is a strong point in its favor. Design and implementation would be much more straightforward than with (C). Approach (B) seems more powerful. But perhaps power users would be better served by additional features (post-beta) which allow for configuration code (e.g. in yaml). Honestly I don't think I have a strong opinion on (B) vs (C). I just would prefer that we avoid (A).

Apologies that I failed to notice the lack of unique constraint. I would have mentioned it sooner had I noticed.

Copy link
Contributor Author

@mathemancer mathemancer Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking options (B) and (C) as a given, I strongly prefer (B), since it's more flexible and more representative of the actual underlying structure of the objects involved. The only reason I still support (A) is that we already have a proposed UI that needs that flexibility (or ambiguity), and (A) can easily be modified to (B) if we go that route.

I'm suspicious of (C), and any UI that depends on (C), since it introduces a misconception that we (and the user) will have to maintain, i.e., that a role on the PostgreSQL server is namespaced under a Database on that server. We have typically regretted creating simplifying abstractions that misrepresent the way PostgreSQL actually works on this project.

Regarding the concerns of @pavish w.r.t. organizational multitenancy, I think it's perfectly reasonable for that to be on a Mathesar installation level for now (rather than on a Database level).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. And if you both want to go with (A) I'll stop fussing about it, though I might have some more pointed questions and critique once we have designs in Figma.

Copy link
Contributor

@kgodey kgodey Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This discussion seems like bikeshedding to me at this point, we can always add a unique constraint later and I don't see why we need to make a decision about whether or how to share credentials between databases right now. I'm going to make the executive decision that we're going with (A).

If the UX we end up with means that (A) can end up with invalid DB states, then we can implement (B) on the backend before implementing the UX in the frontend. It's not like backend changes are verboten after this gets merged.

I do not think we should do (C) in any situation because it's not accurate to how credentials actually work and trying to build on an incorrect abstraction will probably cause us problems later.

Copy link
Contributor

@kgodey kgodey Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to make a decision about whether to share credentials now, I'd still argue for (A) anyway because implementing sharing credentials well seems like it would add scope to the beta (extra UX flows, security considerations, etc.) and I'm not sure that the feature of sharing credentials is valuable enough to add scope. I assume the majority of users will only use one database anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As already stated, I'm in agreement with @mathemancer and @kgodey to go with (A).


## Adding Connections (backend perspective)

Regardless of UI, the backend should receive a `POST` request to the new RPC endpoint defining a new connection. We'll use the same functions set up in [PR \#3348](https://github.com/mathesar-foundation/mathesar/pull/3348). The relevant info should be stored in the `Database`, `DatabaseServer`, and `DatabaseServerCredential` models. Also, if the user does not already have an entry defining their role on the given database, we could create such an entry automatically in the `UserDatabaseMap` model. This is optional, and doesn't really affect the architecture. Then, to let some other users access that connection, we will provide an RPC function that lets an admin set different users' credentials for the given database by creating or updating `UserDatabaseMap` resources. Note that this does _not_ directly modify anything to do with permissions on actual database objects (e.g., schemata or tables).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does 'connection' refer to, in the context of the new models?

The ways I see connections being added, are as follows:

  1. The first time, where the admin enters the database name, db server info, and db server credential.
    • In this case, we'll add an entry in the UserDatabaseMap model for the admin.
    • We'll also configure this db_server_credential to be the default for the database.
  2. Once the db is already configured, the admin will only need to specify new db server credentials to add more connections to that db.
    • In this case, we'll only be adding more entries to the DatabaseServerCredential table.

Do the existing functions satisfy these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Connection" here is referring to the front end concept. It needs a username, password (or other auth), host, port, and database. That fully specifies a connection string. The info is actually stored over various models for reasons you indicated, e.g., adding a second connection to a given server doesn't require storing that server's host and port again, and adding a second connection to a given database on a given server doesn't require anything except another username and password.

(1) This is mostly the way I see that working, except we don't need to store an entry in the UserDatabaseMap for the admin (or any admin) user to use the default credential. I don't care too much about that distinction, though.

(2) I agree this is the way that should work.

!!! danger "Potential Confusion"
In case the admin or DBA want multiple Mathesar users to be able to modify various DB objects, three options are available:
- Give all relevant Mathesar users access to connect as the owning DB user.
- `GRANT` the owning DB role (ostensibly a user) to DB users connectable by the relevant Mathesar users.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the UX thoughts was to have a configurable product setting which grants access to DB objects created via the Mathesar UI automatically to the roles belonging to all Mathesar users that have access to the database/schema.

Bringing this up so it's on your radar to mull over.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know.

Comment on lines 42 to 44
## Metadata object privileges

Examples of these are table properties like preview columns. Permissions for these are derived from permissions on relevant User DB objects. So, if a user wants to modify the preview template for a table, they would be checked against the permission to modify the underlying table structure. Thus, modifying this metadata will occur in a request to modify the associated table, and the privilege check will be whether the relevant DB user is allowed to `ALTER` that table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be updated based on our last discussion where we decided that all metadata permissions will now be identified using the metadata\_role column in UserDBRoleMap .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, you're right

@pavish pavish self-requested a review March 28, 2024 21:20
@mathemancer
Copy link
Contributor Author

Okay. I've updated the permissions page to match the new models. Also, it seems like we're at a stage of temporary agreement, or at least a truce. So, I'm going to merge this before anyone changes their mind.

@mathemancer mathemancer merged commit 43a26ad into master Apr 1, 2024
2 checks passed
@mathemancer mathemancer deleted the architecture_notes branch April 1, 2024 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-status: review A PR awaiting review
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

5 participants