Skip to content
This repository has been archived by the owner on Feb 8, 2018. It is now read-only.

Clean up db schema #1549

Closed
zbynekwinkler opened this issue Oct 5, 2013 · 85 comments
Closed

Clean up db schema #1549

zbynekwinkler opened this issue Oct 5, 2013 · 85 comments

Comments

@zbynekwinkler
Copy link
Contributor

Currently we have a mix of different approaches. Decide on a single way and implement it.


There is a $15 open bounty on this issue. Add to the bounty at Bountysource.

There is a $15 open bounty on this issue. Add to the bounty at Bountysource.

@zbynekwinkler
Copy link
Contributor Author

My preferred solution would involve

  • no magic - everthing is explicit, nothing happens behind the back
  • read optimized - no explains with 9 levels and more
  • keeps history - stores user actions

No magic means slightly more work when implementing but results in an approachable code base, allows for easier review (requires less context from the reviewer), enables straight forward reasoning about performance (big O).

Read optimized (big O) helps us to ignore the performance details of postgres. For anyone interested I suggest reading several chapters in Postgres 9.0 High Performace to see everything that can go wrong for example in query planning without proper maintenance of the db. The more complex the query the more can (will?) go wrong.

Storing user actions allows us to restore the current state into a whatever schema we choose. Guards us against errors in the code, business logic etc. A generic table with a timestamp, participant and data (probably hstore) columns would be my preferred way.

On the other hand I do not want to throw out the baby with the bath water so let's discuss 😄.

@chadwhitacre
Copy link
Contributor

I've asked @zwn to lead the charge on this one. IRC

Keep scrolling on that IRC link, basically the single events table design sounds sensible to me. The reason we have multiple logging tables (see #1502 for details) is historical accident, not intentional design.

Anyway, @zwn is in charge of this now.

@zbynekwinkler
Copy link
Contributor Author

I have gone over all (or maybe just most?) pages we have and noted what data is required where (just data reading not writing). Would something like this fit somewhere in the developer documentation? Feel free to comment/improve the list. It should form a list of requirements for the data layer. This is what should be optimized for reading. Did I miss anything?


@zbynekwinkler
Copy link
Contributor Author

I've studied a bit materialized views to see what could be done using them. The problem seems to be that update of the view cannot be done concurrently yet. This means that the update locks the view. If we used materialized views for the homepage tables, we would not be able to query them while the update is running. It is the same behavior as using TRUNCATE while rebuilding the tables (#1541 (comment)). Refreshing the view concurrently is planed only for Postgres 9.4.

@chadwhitacre
Copy link
Contributor

Refreshing the view concurrently is planned only for Postgres 9.4.

Oh well. I guess we wait. Can you tell from past history when 9.4 is likely to be out?

@chadwhitacre
Copy link
Contributor

@zbynekwinkler
Copy link
Contributor Author

I've been exploring what can be done with triggers and rules system in 50e298c and e9572b5.

I see several possible ways to implement the log table:

  • on the db level by guarding each table with ON UPDATE or ON INSERT rule(s) and/or trigger(s) creating the log table when anything is modified
  • on the db level by guarding the log table ON INSERT and build the proper tables in trigger behind the scenes
  • on application level by creating python module to handle all db modifications

The advantage of doing this on db level is that also handmade changes done through psql are logged. The disadvantage is that we cannot tie the action to a signed in user. Further disadvantage is that writing triggers in plpgsql language is yet another language to deal with. This could be somewhat mitigated by using plpython but that is not possible on current heroku offering. So I am starting to lean towards the application level implementation on the grounds that it is going to be easier to maintain.

I'd appreciate any feedback for and/or against any of the variants.

@chadwhitacre
Copy link
Contributor

@zwn What does it look like if we invert the pattern and write to the single event table first as the canonical source, and then work outwards from there to update the normalized tables as downstream consumers?

@zbynekwinkler
Copy link
Contributor Author

@whit537 I haven't tried to implement it yet. I got somewhat scared by plpgsql and EXECUTE that would have to be used to build and execute the queries. I am much more comfortable in python.

Do you think it is worth exploring in more depth?

@chadwhitacre
Copy link
Contributor

IRC

@chadwhitacre
Copy link
Contributor

Here's a few scaling posts from Facebook:

@chadwhitacre
Copy link
Contributor

We decided to pursue the idea of writing to a single master event log, with read-optimized tables computed from there. We're thinking in terms of eventually having Python workers asynchronously updating the downstream tables (with UI support for this async updating). As a step on the way, though, we're going to look at instrumenting all the places in the app where we currently write to the database to write to a single master event log as an additional part of the transaction. Then once we've lived with this for a while we'll start inverting the app bit by bit to write to the event log in one transaction and update computed tables in a separate transaction. IRC

@zbynekwinkler
Copy link
Contributor Author

I am thinking about how to structure the log table. From one point of view it would be good to structure it as close to the user as possible since this gives us the most resilience against bugs. If the user 'bob' says he wants to tip user 'alice' amount $1 we should record it as such. On the other hand if we support participant renaming we will not be able to support a general history page like 'I am bob, show me everything I did on gittip'. That leads us to #835. If we translated 'bob' and 'alice' to its ids and recorded that instead, it will be valid in the future no matter how are 'bob' and 'alice' called in the future.

I've decided to make the api be user facing (taking 'alice' and 'bob' as params) but storing the internal ids in the log table. It might make sense to reopen #835.

Also I'd like to convert the different 'toggle' operations to explicit idempotent set operations. This allows us not to search for all the previous 'toggles' all the way to the default at the beginning. Fortunately there isn't that many of them.

@zbynekwinkler
Copy link
Contributor Author

Should we do #412 and #224 before or after?

@zbynekwinkler
Copy link
Contributor Author

What about branch.sql? Some transforms are not that easy to describe in SQL. What about some branch.py? Instead of doing \i branch.sql in the psql prompt we could do heroku run branch.py.

@zbynekwinkler
Copy link
Contributor Author

What would be the best way to log take_over action? Within takeover a lot of tips manipulation is going on. I am thinking about having the action logged per partes. So if only one of the elsewheres were taken, it would result in one log item. If it was the last one, the tips zeroing would happen one by one by generic set_tip operation, as well as setting up the new tips. The same goes for absorptions. This will have the advantage of being able reconstruct all the tips in a generic way no matter if they come from takeover or not.

@chadwhitacre
Copy link
Contributor

Should we do #412 and #224 before or after?

After, I think. Those two are not as critical as this ticket.

@chadwhitacre
Copy link
Contributor

Some transforms are not that easy to describe in SQL.

E.g.? I haven't felt this limitation yet, but I expect the changes we're making here will be our most significant yet.

@chadwhitacre
Copy link
Contributor

What about branch.sql? Some transforms are not that easy to describe in SQL. What about some branch.py? Instead of doing \i branch.sql in the psql prompt we could do heroku run branch.py.

This makes our dev build more complicated, because we can't just run schema.sql. We should take that complication into account and try to avoid a branch.py if possible.

@zbynekwinkler
Copy link
Contributor Author

This makes our dev build more complicated, because we can't just run schema.sql. We should take that complication into account and try to avoid a branch.py if possible.

I am so far managing without it.

@seanlinsley
Copy link
Contributor

Seems like implementing at the application layer would give us flexibility to swap out the underlying datastore as we scale

I'm not convinced there's a better database to move to, even if we did need to "scale"

@chadwhitacre
Copy link
Contributor

I think we can get a long way with Postgres.

@adewes
Copy link

adewes commented Apr 14, 2014

OK so following our discussion, here's how I would organize the DB schema. Please add your comments and remarks, I'm just familiarizing myself with the codebase and I probably don't see all the use cases, so please correct me if I got it wrong.

In general, I would try to keep the current state of objects (i.e. participants, communities, teams and their respective data and membership) and the history of this state (i.e. when has a user joined a given community) in separate tables. In my opinion, this has several advantages:

  1. Updating state and aggregating data (i.e. how many users are in a given community) becomes more straightforward
  2. No materialized or calculated views are necessary for generating state information
  3. The number of rows in the state table does not grow linearly with the number of user actions (e.g. each time a user joins a community) but only with the actual number of objects and/or relationships.
  4. Event logging can be done in a more straightforward way using Postgres triggers, without modifying the Python code or application logic (see below)

So here's the proposed layout (all arrows represent m:n relations):

              +––––––––––––––––––+                
      +–––––––+participant_teams +–+              
      |       +––––––––––––––––––+ |              
      |                            |              
      |                            |              
+–––––+–––––+                  +–––++             
|participant|                  |team+–––+         
+––––––––+––+                  +––––+   |         
         |                              |         
         |                              |         
         |                              |         
 +–––––––+–––––––––––––––+    +–––––––––+––––––––+
 |participant_communities|    | team_communities |
 +––––––––––+––––––––––––+    +–––––––––+––––––––+
            |                           |         
            |                           |         
            |                           |         
            |       +–––––––––+         |         
            +–––––––+community+–––––––––+         
                    +–––––––––+                   

The M2M tables (participant_communities, team_communities, participant_teams) would roughly look like this:

    create table participant_communities (
    participant_id unsigned int REFERENCES participants,
    community_id unsigned int REFERENCES communities,
    suspicious boolean,
    state unsigned int, # e.g. 0 = pending, 1 = active, 2 = banned, ...
    ctime datetime,
    mtime datetime
    #possibly more status information
    )

participant_id and community_id are foreign keys pointing to the primary keys of the communities and participants tables, respectively. If a participant joins a community, the SQL operation would look like this:

insert into participant_communities (participant_id,community_id,suspicious) values (1,100, suspicious = false)

To get all participants in a given community xxx, the query would be

select * from participant where participant_id in (select participant_id from participant_communities where community_id = xxx)

To get all communities for a given participant yyy

select * form communities where community_id in (select community_id from participant_communities where participant_id = yyy)

For the other relations between participants and teams as well as teams and communities it could work in the same way. Instead of using numerical ids for referencing to other tables one could use strings as well (e.g. usernames), this would make updates of these values more difficult though.

##Event Logging

For logging event data, we could create automatic Postgres triggers that create rows in an event table each time a row in another table is inserted, updated or deleted.

CREATE TRIGGER log_community_updates
    BEFORE UPDATE ON user_communities
    FOR EACH ROW
    EXECUTE PROCEDURE log_event('community');


#Not checked if this works actually...
CREATE OR REPLACE FUNCTION log_event() RETURNS TRIGGER AS $log_event$
    BEGIN
        --
        -- Create a row in event_log to reflect the operation performed on the state table,
        -- make use of the special variable TG_OP to work out the operation.
        --
        IF (TG_OP = 'DELETE') THEN
            INSERT INTO event_log SELECT 'delete', now(), row_to_json(OLD.*);
            RETURN OLD;
        ELSIF (TG_OP = 'UPDATE') THEN
            INSERT INTO event_log SELECT 'update', now(), row_to_json(NEW.*);
            RETURN NEW;
        ELSIF (TG_OP = 'INSERT') THEN
            INSERT INTO event_log SELECT 'insert', now(), row_to_json(NEW.*);
            RETURN NEW;
        END IF;
        RETURN NULL;
    END;
$log_event$ LANGUAGE plpgsql;

the log_event function receives the old and new row values (http://www.postgresql.org/docs/9.2/static/plpgsql-trigger.html) and can insert them in a logging table. We could also define a separate logging function for each table or even operation, in order to e.g. store table-specific information in the logging table or have different logging tables for different state tables. This approach has the advantage of completely eliminating all calls to create_event in the Python code as well.

What do you think?

@chadwhitacre
Copy link
Contributor

Another blog post! 💃

!m @adewes

@clone1018
Copy link
Contributor

For logging event data, we could create automatic Postgres triggers that create rows in an event table each time a row in another table is inserted, updated or deleted.

Yes! Yes! Yes!

@arothenberg
Copy link

Someone was using my github as a private repo(with my permission) so I just changed my name and kicked him off.

Impressive - adewes. I'd hire you if I had a business. It looks like a good ole normalized rdbs schema.

My 2 cents(I don't know how the $$$contributions are given so take this with a grain of salt) -
The application works well right now so there is no pressing need to reinvent it. If it's possible, I would first try and rename the membership table to Team. Membership is not a very accurate name. But since there are only currently a few tables it's not crucial that the schema be perfect. It works and that's more important.

One thing I don't see is groups addressed. Is there a further purpose in identifying a participant as a group other than as a permission for team creation or as just a descriptor? iow are there tables/process/functions that are specific to group(not team) participants? Also, are you tracking or planning to track group(not team) members? If so then adewes schema should fold groups into Team and flag the record as "group" or "team".

Anyway, it was great meeting all of you and it is very impressive that you have a working crowd source project up. I can't imagine how hard it would be to get something like this going. I know I couldn't.

@adewes
Copy link

adewes commented Apr 15, 2014

@arothenberg @whit537 Yeah would be interesting to discuss how groups, teams and participants are related to each other. So far my understanding is the following:

-Participants can either be "singular" (i.e. representing one person) or "plural" (i.e. representing a group of persons). In the latter case they will be considered as a "group" and be able to add other participants to the group, thus creating a team. Is that correct @whit537 ?

@arothenberg
Copy link

Since I'm not very informed about this - this should be my last post.

Anyway, as per groups, my understanding is that it is a permissions flag as you stated. But in reality a group is also composed of multiple people(some may even be participants?) and contributions to groups may be handled differently. Your schema does not address the contributions and how they are allocated/recorded. What I'm trying to convey is that the business logic could alter your schema. From a superficial reading I think your schema looks perfect. But ours is a superficial gleaning of the Business logic. @whit537 would be the one to weigh in on this.

Take care and good luck.

@zbynekwinkler
Copy link
Contributor Author

For logging event data, we could create automatic Postgres triggers that create rows in an event table each time a row in another table is inserted, updated or deleted.

Yes! Yes! Yes!

No, no, no 😉 With a structure like that everyone is going to be afraid to even touch the db because of the hidden stuff. And when they do, they break it. Using the add_event call is simple enough, isn't it? Also you don't have the right kind of info at the db level (like who is doing the action). BTW: I am really sorry I am not able to participate more at this time. At least I try to monitor what is happening.

@chadwhitacre
Copy link
Contributor

So far my understanding is the following: [...] Is that correct @whit537 ?

@adewes Yup! Some participants are groups. Some groups are teams.

@chadwhitacre
Copy link
Contributor

With a structure like that everyone is going to be afraid to even touch the db because of the hidden stuff. And when they do, they break it.

If someone is afraid to touch the db that could be a sign that they're self-limited as a developer. Data is at the heart of software, and Postgres is a robust, mature, well-understood system for managing data. Why not use it? Why implement core data logic in Python instead of in Postgres?

That's question number one, and I don't think we've properly addressed it head-on yet. I see us as having taken one small step down each path, app-centric and db-centric. Should we take two steps down each path? What can we say based on our current level of experience with each pattern?

Question number two is whether our source of truth is granular or coarse. Do we update the participants table and then log to events (coarse)? Or do we insert into a low-level table and then bubble that up into a participants view/table (granular)?

Question number three is how questions number one and two are related. :-)

@chadwhitacre
Copy link
Contributor

Having lived with the add_event pattern for a while now, I prefer the multiple log tables approach. When a separate add_event call is required ("coarse-grained source of truth"), then it introduces a chance for drift between how we log the event and the change we actually made. By contrast, if the fine-grained log is our source of truth, then there is no chance for drift.

I'm looking at the current update_email implementation on #2752. We log a set action with a current_email value, but what about the nonce and ctime? Maybe we don't care to keep those values around. I don't like the idea that we lose that info. Mightn't we want that for debugging? And more to the point: I want to change how we're doing email verification, and with the add_event pattern I have to make sure to modify the add_event call to stay in sync with the actual event itself.

@chadwhitacre
Copy link
Contributor

@Changaco in IRC:

I'm quite happy that we removed the multiple log tables, it lightens the DB schema, however we could use DB triggers to populate the events table in some cases if we don't want/need to do it in the python code.

@chadwhitacre
Copy link
Contributor

Closing in light of our decision to shut down Gratipay.

Thank you all for a great run, and I'm sorry it didn't work out! 😞 💃

@adewes
Copy link

adewes commented Nov 3, 2017

@whit537 really sorry to hear this, it was a great project and as well as team behind it! I was very happy meeting you at the PyCon three years ago and I wish you good luck for your next project whatever that may be!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants