Database Schema #36
Replies: 6 comments 31 replies
-
Seems talking in the Discord server theres reason to add back the package version table. @Digitalone1 has previously created a layout that I will link to here. versions
One concern I have here, is this was brought up recently as a way to quickly find the newest version, but without a SemVer parser this will still be difficult and I find this the same complexity as extracting that from the JSON with the current setup. But still this can be added if maybe we want to add another column to help determine that. |
Beta Was this translation helpful? Give feedback.
-
@mauricioszabo Not easy. I'd say something like this. How would you do it? SELECT packages.name
FROM packages, package_versions
WHERE packages.pointer = package_versions.package_id AND
package_versions.engine ->> 'keywords' ILIKE '%plank%' AND
package_versions.id IN (
SELECT MAX(package_versions.id)
FROM package_versions
GROUP BY package_versions.package_id
) |
Beta Was this translation helpful? Give feedback.
-
Is the |
Beta Was this translation helpful? Give feedback.
-
So I've gone ahead and updated this document to touch on some conversations that have been had here, if theres no other major points of contention we can begin looking at the create scripts, appreciating any feedback on this. Thanks! |
Beta Was this translation helpful? Give feedback.
-
@Digitalone1 considering the pros and cons of storing OAuth Token data in our database, I do believe the advantages of speed, or keeping everything self contained may be worth it for some extra effort on our part to keep it safe. Since you are more familiar with SQL queries, I wanted to ask if using native Decryption and Encryption on the tokens would be a good idea at all, Create Example Table: CREATE TABLE public.test_crypto
(
id bigint NOT NULL DEFAULT nextval('test_crypto_id_seq'::regclass),
plain_text text COLLATE pg_catalog."default",
crypted_text text COLLATE pg_catalog."default",
CONSTRAINT test_crypto_pkey PRIMARY KEY (id)
)
WITH (
OIDS = FALSE
)
TABLESPACE pg_default; Then encrypting data like so: (This example also stores the decrypted data, which we would have to not do) insert into public.test_crypt (plain_text, crypted_text)
values ('plaintext', encrypt('plaintext', 'salty', 'aes')) Then to decrypt data: select id, plain_text,
convert_from(decrypt(cryted_text::bytea, 'salty', 'aes'), 'SQL_ASCII')
from test_crypto Or if there is a better method to keep tokens at rest encrypted it may be best to do that. Since the decrypted token is only needed when querying GitHub. Really if there is a sufficiently advanced one way hash or something like that, we can store the data hashed, and just compare a new hash against the newly provided token passed by the user, then query for the hashed token in the database, if it matches then the backend API server holds onto the token for the lifetime of the call. |
Beta Was this translation helpful? Give feedback.
-
Are there any plans (in a future release?) to support namespaces in the fashion of npm (
If a package get deleted, then I think it's a terrible idea from a security perspective to make it available for someone else. This must never happen! |
Beta Was this translation helpful? Give feedback.
-
EDIT::
Alright, seems we have reached a decision on the final schema of the database, and any further modifications, will likely need to come during a 2.0 release of the backend, to allow this version to become completed.
Below will be the finalized version of this schema.
names
packages
stars
users
packages
stars
and updated when modified.versions
packages
uuidlatest
,published
but really that data will currently be determined by theid
of the field, since its auto incrementing nature, allows easy historical sorting. But oncebeta
packages are supported, could indicate this.Beyond this, some other choices that have been made via or since the creation of this thread.
Authentication
While originally the token was stored in the DB unencrypted, intended to be kept safe by the DB protections, the plan then became to force users to pass PAT tokens during each request, but finally, the best method or at least decided on one now, is to hash these tokens in the DB, and compare the hashed tokens to the ones in the DB when passed them. If it matches we can confirm a login, and then operate with the in memory token to complete any requests needed.
Otherwise feel free to update this with any other needed data that fits this schema. Thanks to everyone that helped to make these choices, and the fantastic PR's that already exist to integrate with them. If those more familiar with SQL want to try and replace all
data.js
andusers.js
functions with queries that return the same amount of information I can begin the consumption of them. If how the data will be returned is not obvious, a comment detailing such would be highly appreciated. Otherwise we are close to finishing a beta version 1.0.0, fantastic work to everyone.So our discussion over on #33 has become quite long, and I feel we are reaching a conclusion and wanted an easier format to track the communication. @Digitalone1 @mauricioszabo
Thought that it may be a good idea to layout how the database schema is currently created.
packages
pointers
stars
users
Obviously this format caused a lot of concern and sparked a lot of discussion. I kept modifying and recreating as we talked, and kept changing the code. But I have soon realized just how big of an overhaul this type of change will be to the code base, and want to work to definitively decide on the schema before continuing. So that we all are on the same page, and ideally work can be done together to migrate the codebase to the schema we eventually decide on.
Listed below are certain considerations and interactions with the Database that have been discussed and are likely the best going forward:
status
row of thepackage_versions
table is redundant, I feel it will be important to keep once we decide to supportalpha
,beta
tags of packages, otherwise for the meantime, while these should still be updated, for finding the most recent package, or creating a history of package updates, the AUTO-INCREMENTING ID should be prioritized as the newest version.stargazers_count
row ofpackages
will do its best to be accurate, but otherwise is more an estimation of total stars a package has received. When a package is starred, its entry should be added to thestars
table, then all items of that table are counted that apply to the specific package, this number then replaces whatever is present instargazers_count
. While this may have some edge cases where its count is slightly off, it doesn't seem of the most important to always have this number be correct.While I know my decision to use the pointer system may seem foreign, its in a goal to keep true to upstream functionality that I would like to preserve. That will be laid out below as well as some other considerations we should take going forward.
todo-package
one day, and the next decide to rename that package totodo-helper
anybody that says to install or follows a link totodo-package
will be automatically linked totodo-helper
. Forever and always, now of course package deletion should free up the namespace, but thats a consideration on how to delete a package fully.With all of that said, I've looked over previous suggestions, and current implementations and want to purpose a final schema for the database.
The Following tables have been modified since original posting to include new additions & decisions
names
stars
users
packages
versions
packages
uuid. Serving as the foreign key.latest
,published
,removed
. Or for future usebeta
,dev
Why:
To outline why I feel this would be the best schema going forward is a few points.
service_id
if found we ensure that theirauth
method aligns with the service we have checked. Now this may need to be reworked once we support other methods, like providing a username as well, but this would work for just GitHub for now. And would allow us to never store any auth information for the user, while still having unique user accounts.Considerations:
The biggest consideration is about Auth. If we never store any auth info, how do we create user accounts?
Well we could still setup OAuth on the website and collect our information during that time, but that still would require the user to pass their PAT token during subsequent requests. And negates the functionality of revoking access via OAuth. Plus it makes staying logged in or logging in again on the browser a pain. Since we can't store the logged in state, unless we have potentially problematic session storage specifying a user id, that could be maliciously used in the future.
Then we may need the user to sign up via OAuth, and immediately provide a PAT token, which we could implement as part of the sign up flow, which may not be to bad, and could really be explained as more of a feature, specifying that it is needed because we don't want to leak any sensitive credentials, and then we just store the PAT token in the local session storage.
Lastly I know it was mentioned to normalize the package versions, and while there is a single endpoint on the backend that does sort through these, I feel it may not be worth it to normalize this field for the single endpoint (that I never saw used in the actual APM code) while having to then recreate the JSON data with it for EVERY endpoint. It seems it'd be more cost effective to parse through it the single time its needed, and otherwise just be able to return it as JSON data the rest of the time.
Common Scenarios
Now I know there are some questions about common flows and how they are handled, and while they may have been answered previously thought it'd be good to summarize some info I've put out previously here.
Create a Package:
When a user attempts to create a package, first we verify the auth. The same way as mentioned about, using the provided PAT token to check the user identity, once the unique ID is found on the service, thats checked against all unique ID's in our db, and if the service used matches then we have our user. From there we take the
owner/repo
provided in the query parameter and check it against GitHub, using GitHub's API's we check if they have write permissions to the repo. If they do, we can generate the UUID, and take whatever the packages name is from thepackage.json
within the repo, and assign that as a new item onto thepointers
table. Then the data would be uploaded thepackages
table normalized.Change a Package Name:
Really when changing a name, other than everything else described above, like authentication methodology, then we just add another row to the
pointers
database, with the same UUID since that hasn't changed and the new package. Meaning that querying for either package name will resolve to the same UUID, and thus same package.Please feel free to delve further into any aspect of this, But I am hoping to get the schema finalized for the short term before beginning the modification of the codebase to support it. Thanks for any and all assistance, previous and continuing
Beta Was this translation helpful? Give feedback.
All reactions