Clean up db schema #1549

zbynekwinkler · 2013-10-05T19:55:19Z

Currently we have a mix of different approaches. Decide on a single way and implement it.

There is a $15 open bounty on this issue. Add to the bounty at Bountysource.

zbynekwinkler · 2013-10-05T20:15:56Z

My preferred solution would involve

no magic - everthing is explicit, nothing happens behind the back
read optimized - no explains with 9 levels and more
keeps history - stores user actions

No magic means slightly more work when implementing but results in an approachable code base, allows for easier review (requires less context from the reviewer), enables straight forward reasoning about performance (big O).

Read optimized (big O) helps us to ignore the performance details of postgres. For anyone interested I suggest reading several chapters in Postgres 9.0 High Performace to see everything that can go wrong for example in query planning without proper maintenance of the db. The more complex the query the more can (will?) go wrong.

Storing user actions allows us to restore the current state into a whatever schema we choose. Guards us against errors in the code, business logic etc. A generic table with a timestamp, participant and data (probably hstore) columns would be my preferred way.

On the other hand I do not want to throw out the baby with the bath water so let's discuss 😄.

chadwhitacre · 2013-10-15T00:10:05Z

I've asked @zwn to lead the charge on this one. IRC

Keep scrolling on that IRC link, basically the single events table design sounds sensible to me. The reason we have multiple logging tables (see #1502 for details) is historical accident, not intentional design.

Anyway, @zwn is in charge of this now.

zbynekwinkler · 2013-10-15T20:12:17Z

I have gone over all (or maybe just most?) pages we have and noted what data is required where (just data reading not writing). Would something like this fit somewhere in the developer documentation? Feel free to comment/improve the list. It should form a list of requirements for the data layer. This is what should be optimized for reading. Did I miss anything?

Every page
- active users (n)
- money transfered $
Every page signed in user
- receiving $
- giving $
https://www.gittip.com/
- top receivers: participant ordered by receiving $ with
  - username
  - receiving $
  - avatar url
- top givers: participant ordered by giving $ with
  - username
  - giving $
  - avatar url
- new participants: participatn ordered by claim time
  - username
  - claim time
  - avatar url
https://www.gittip.com/for/
- all communities with
  - slug
  - name
  - members count
https://www.gittip.com/for/%community/
- same as homepage but filtred on members of the community
https://www.gittip.com/about/stats.html
- mostly data from last payday run
- sum of current giving or receiving (should be equal?)
- histogram number of tips versus tip value
- histogram total values of tips versus tip value
- 16 random users
https://www.gittip.com/about/charts.html
- all shown per week
  - total users
  - active users
  - total gifts
  - weekly gifts
  - charges (money in)
  - withdrawals (money out)
https://www.gittip.com/%username/
- receiving
- giving
- hide giving bool
- statement
- list of teams username is member
- list of communities usernaem is member
- funding goal
- connected accounts list
- has credit card
- has bank account
- api key
- something called tip distribution?
https://www.gittip.com/%username/giving/
- all current tips from username
https://www.gittip.com/%username/history/
- current balance
- money in
- money out
- transfers within gittip
https://www.gittip.com/%username/members/
- current team members with
  - username
  - last week take
  - this week take
- current balance
no db requirements

zbynekwinkler · 2013-10-19T16:10:36Z

I've studied a bit materialized views to see what could be done using them. The problem seems to be that update of the view cannot be done concurrently yet. This means that the update locks the view. If we used materialized views for the homepage tables, we would not be able to query them while the update is running. It is the same behavior as using TRUNCATE while rebuilding the tables (#1541 (comment)). Refreshing the view concurrently is planed only for Postgres 9.4.

chadwhitacre · 2013-10-20T22:32:47Z

Refreshing the view concurrently is planned only for Postgres 9.4.

Oh well. I guess we wait. Can you tell from past history when 9.4 is likely to be out?

chadwhitacre · 2013-10-21T15:03:23Z

Filing this here:

http://www.paperplanes.de/2013/10/18/the-smallest-distributed-system.html

zbynekwinkler · 2013-10-21T18:42:43Z

I've been exploring what can be done with triggers and rules system in 50e298c and e9572b5.

I see several possible ways to implement the log table:

on the db level by guarding each table with ON UPDATE or ON INSERT rule(s) and/or trigger(s) creating the log table when anything is modified
on the db level by guarding the log table ON INSERT and build the proper tables in trigger behind the scenes
on application level by creating python module to handle all db modifications

The advantage of doing this on db level is that also handmade changes done through psql are logged. The disadvantage is that we cannot tie the action to a signed in user. Further disadvantage is that writing triggers in plpgsql language is yet another language to deal with. This could be somewhat mitigated by using plpython but that is not possible on current heroku offering. So I am starting to lean towards the application level implementation on the grounds that it is going to be easier to maintain.

I'd appreciate any feedback for and/or against any of the variants.

chadwhitacre · 2013-10-21T19:03:09Z

@zwn What does it look like if we invert the pattern and write to the single event table first as the canonical source, and then work outwards from there to update the normalized tables as downstream consumers?

zbynekwinkler · 2013-10-21T20:02:57Z

@whit537 I haven't tried to implement it yet. I got somewhat scared by plpgsql and EXECUTE that would have to be used to build and execute the queries. I am much more comfortable in python.

Do you think it is worth exploring in more depth?

chadwhitacre · 2013-10-21T20:05:27Z

IRC

chadwhitacre · 2013-10-21T20:20:32Z

Here's a few scaling posts from Facebook:

Scaling Out (2008)
Scaling memcached at Facebook (2008)
Scaling Facebook to 500 Million Users and Beyond (2010)
Scaling memcache at Facebook (2013)

chadwhitacre · 2013-10-21T21:29:57Z

We decided to pursue the idea of writing to a single master event log, with read-optimized tables computed from there. We're thinking in terms of eventually having Python workers asynchronously updating the downstream tables (with UI support for this async updating). As a step on the way, though, we're going to look at instrumenting all the places in the app where we currently write to the database to write to a single master event log as an additional part of the transaction. Then once we've lived with this for a while we'll start inverting the app bit by bit to write to the event log in one transaction and update computed tables in a separate transaction. IRC

zbynekwinkler · 2013-10-27T19:41:29Z

I am thinking about how to structure the log table. From one point of view it would be good to structure it as close to the user as possible since this gives us the most resilience against bugs. If the user 'bob' says he wants to tip user 'alice' amount $1 we should record it as such. On the other hand if we support participant renaming we will not be able to support a general history page like 'I am bob, show me everything I did on gittip'. That leads us to #835. If we translated 'bob' and 'alice' to its ids and recorded that instead, it will be valid in the future no matter how are 'bob' and 'alice' called in the future.

I've decided to make the api be user facing (taking 'alice' and 'bob' as params) but storing the internal ids in the log table. It might make sense to reopen #835.

Also I'd like to convert the different 'toggle' operations to explicit idempotent set operations. This allows us not to search for all the previous 'toggles' all the way to the default at the beginning. Fortunately there isn't that many of them.

zbynekwinkler · 2013-10-27T21:18:33Z

Should we do #412 and #224 before or after?

zbynekwinkler · 2013-10-28T08:38:21Z

What about branch.sql? Some transforms are not that easy to describe in SQL. What about some branch.py? Instead of doing \i branch.sql in the psql prompt we could do heroku run branch.py.

zbynekwinkler · 2013-10-28T16:37:30Z

What would be the best way to log take_over action? Within takeover a lot of tips manipulation is going on. I am thinking about having the action logged per partes. So if only one of the elsewheres were taken, it would result in one log item. If it was the last one, the tips zeroing would happen one by one by generic set_tip operation, as well as setting up the new tips. The same goes for absorptions. This will have the advantage of being able reconstruct all the tips in a generic way no matter if they come from takeover or not.

chadwhitacre · 2013-10-28T17:57:07Z

Should we do #412 and #224 before or after?

After, I think. Those two are not as critical as this ticket.

chadwhitacre · 2013-10-28T17:57:20Z

Some transforms are not that easy to describe in SQL.

E.g.? I haven't felt this limitation yet, but I expect the changes we're making here will be our most significant yet.

chadwhitacre · 2013-10-28T18:01:20Z

What about branch.sql? Some transforms are not that easy to describe in SQL. What about some branch.py? Instead of doing \i branch.sql in the psql prompt we could do heroku run branch.py.

This makes our dev build more complicated, because we can't just run schema.sql. We should take that complication into account and try to avoid a branch.py if possible.

zbynekwinkler · 2013-10-28T18:02:09Z

This makes our dev build more complicated, because we can't just run schema.sql. We should take that complication into account and try to avoid a branch.py if possible.

I am so far managing without it.

seanlinsley · 2014-04-14T21:30:00Z

Seems like implementing at the application layer would give us flexibility to swap out the underlying datastore as we scale

I'm not convinced there's a better database to move to, even if we did need to "scale"

chadwhitacre · 2014-04-14T21:34:20Z

I think we can get a long way with Postgres.

adewes · 2014-04-14T21:46:11Z

OK so following our discussion, here's how I would organize the DB schema. Please add your comments and remarks, I'm just familiarizing myself with the codebase and I probably don't see all the use cases, so please correct me if I got it wrong.

In general, I would try to keep the current state of objects (i.e. participants, communities, teams and their respective data and membership) and the history of this state (i.e. when has a user joined a given community) in separate tables. In my opinion, this has several advantages:

Updating state and aggregating data (i.e. how many users are in a given community) becomes more straightforward
No materialized or calculated views are necessary for generating state information
The number of rows in the state table does not grow linearly with the number of user actions (e.g. each time a user joins a community) but only with the actual number of objects and/or relationships.
Event logging can be done in a more straightforward way using Postgres triggers, without modifying the Python code or application logic (see below)

So here's the proposed layout (all arrows represent m:n relations):

              +––––––––––––––––––+                
      +–––––––+participant_teams +–+              
      |       +––––––––––––––––––+ |              
      |                            |              
      |                            |              
+–––––+–––––+                  +–––++             
|participant|                  |team+–––+         
+––––––––+––+                  +––––+   |         
         |                              |         
         |                              |         
         |                              |         
 +–––––––+–––––––––––––––+    +–––––––––+––––––––+
 |participant_communities|    | team_communities |
 +––––––––––+––––––––––––+    +–––––––––+––––––––+
            |                           |         
            |                           |         
            |                           |         
            |       +–––––––––+         |         
            +–––––––+community+–––––––––+         
                    +–––––––––+

The M2M tables (participant_communities, team_communities, participant_teams) would roughly look like this:

    create table participant_communities (
    participant_id unsigned int REFERENCES participants,
    community_id unsigned int REFERENCES communities,
    suspicious boolean,
    state unsigned int, # e.g. 0 = pending, 1 = active, 2 = banned, ...
    ctime datetime,
    mtime datetime
    #possibly more status information
    )

participant_id and community_id are foreign keys pointing to the primary keys of the communities and participants tables, respectively. If a participant joins a community, the SQL operation would look like this:

insert into participant_communities (participant_id,community_id,suspicious) values (1,100, suspicious = false)

To get all participants in a given community xxx, the query would be

select * from participant where participant_id in (select participant_id from participant_communities where community_id = xxx)

To get all communities for a given participant yyy

select * form communities where community_id in (select community_id from participant_communities where participant_id = yyy)

For the other relations between participants and teams as well as teams and communities it could work in the same way. Instead of using numerical ids for referencing to other tables one could use strings as well (e.g. usernames), this would make updates of these values more difficult though.

##Event Logging

For logging event data, we could create automatic Postgres triggers that create rows in an event table each time a row in another table is inserted, updated or deleted.

CREATE TRIGGER log_community_updates
    BEFORE UPDATE ON user_communities
    FOR EACH ROW
    EXECUTE PROCEDURE log_event('community');


#Not checked if this works actually...
CREATE OR REPLACE FUNCTION log_event() RETURNS TRIGGER AS $log_event$
    BEGIN
        --
        -- Create a row in event_log to reflect the operation performed on the state table,
        -- make use of the special variable TG_OP to work out the operation.
        --
        IF (TG_OP = 'DELETE') THEN
            INSERT INTO event_log SELECT 'delete', now(), row_to_json(OLD.*);
            RETURN OLD;
        ELSIF (TG_OP = 'UPDATE') THEN
            INSERT INTO event_log SELECT 'update', now(), row_to_json(NEW.*);
            RETURN NEW;
        ELSIF (TG_OP = 'INSERT') THEN
            INSERT INTO event_log SELECT 'insert', now(), row_to_json(NEW.*);
            RETURN NEW;
        END IF;
        RETURN NULL;
    END;
$log_event$ LANGUAGE plpgsql;

the log_event function receives the old and new row values (http://www.postgresql.org/docs/9.2/static/plpgsql-trigger.html) and can insert them in a logging table. We could also define a separate logging function for each table or even operation, in order to e.g. store table-specific information in the logging table or have different logging tables for different state tables. This approach has the advantage of completely eliminating all calls to create_event in the Python code as well.

What do you think?

chadwhitacre · 2014-04-14T21:52:06Z

Another blog post! 💃

!m @adewes

clone1018 · 2014-04-14T21:57:33Z

For logging event data, we could create automatic Postgres triggers that create rows in an event table each time a row in another table is inserted, updated or deleted.

Yes! Yes! Yes!

arothenberg · 2014-04-14T22:01:37Z

Someone was using my github as a private repo(with my permission) so I just changed my name and kicked him off.

Impressive - adewes. I'd hire you if I had a business. It looks like a good ole normalized rdbs schema.

My 2 cents(I don't know how the $$$contributions are given so take this with a grain of salt) -
The application works well right now so there is no pressing need to reinvent it. If it's possible, I would first try and rename the membership table to Team. Membership is not a very accurate name. But since there are only currently a few tables it's not crucial that the schema be perfect. It works and that's more important.

One thing I don't see is groups addressed. Is there a further purpose in identifying a participant as a group other than as a permission for team creation or as just a descriptor? iow are there tables/process/functions that are specific to group(not team) participants? Also, are you tracking or planning to track group(not team) members? If so then adewes schema should fold groups into Team and flag the record as "group" or "team".

Anyway, it was great meeting all of you and it is very impressive that you have a working crowd source project up. I can't imagine how hard it would be to get something like this going. I know I couldn't.

adewes · 2014-04-15T19:26:33Z

@arothenberg @whit537 Yeah would be interesting to discuss how groups, teams and participants are related to each other. So far my understanding is the following:

-Participants can either be "singular" (i.e. representing one person) or "plural" (i.e. representing a group of persons). In the latter case they will be considered as a "group" and be able to add other participants to the group, thus creating a team. Is that correct @whit537 ?

arothenberg · 2014-04-15T22:50:34Z

Since I'm not very informed about this - this should be my last post.

Anyway, as per groups, my understanding is that it is a permissions flag as you stated. But in reality a group is also composed of multiple people(some may even be participants?) and contributions to groups may be handled differently. Your schema does not address the contributions and how they are allocated/recorded. What I'm trying to convey is that the business logic could alter your schema. From a superficial reading I think your schema looks perfect. But ours is a superficial gleaning of the Business logic. @whit537 would be the one to weigh in on this.

Take care and good luck.

zbynekwinkler · 2014-04-16T11:06:31Z

For logging event data, we could create automatic Postgres triggers that create rows in an event table each time a row in another table is inserted, updated or deleted.

Yes! Yes! Yes!

No, no, no 😉 With a structure like that everyone is going to be afraid to even touch the db because of the hidden stuff. And when they do, they break it. Using the add_event call is simple enough, isn't it? Also you don't have the right kind of info at the db level (like who is doing the action). BTW: I am really sorry I am not able to participate more at this time. At least I try to monitor what is happening.

chadwhitacre · 2014-04-16T13:32:26Z

So far my understanding is the following: [...] Is that correct @whit537 ?

@adewes Yup! Some participants are groups. Some groups are teams.

chadwhitacre · 2014-04-16T13:49:42Z

With a structure like that everyone is going to be afraid to even touch the db because of the hidden stuff. And when they do, they break it.

If someone is afraid to touch the db that could be a sign that they're self-limited as a developer. Data is at the heart of software, and Postgres is a robust, mature, well-understood system for managing data. Why not use it? Why implement core data logic in Python instead of in Postgres?

That's question number one, and I don't think we've properly addressed it head-on yet. I see us as having taken one small step down each path, app-centric and db-centric. Should we take two steps down each path? What can we say based on our current level of experience with each pattern?

Question number two is whether our source of truth is granular or coarse. Do we update the participants table and then log to events (coarse)? Or do we insert into a low-level table and then bubble that up into a participants view/table (granular)?

Question number three is how questions number one and two are related. :-)

chadwhitacre · 2014-10-08T21:53:03Z

Having lived with the add_event pattern for a while now, I prefer the multiple log tables approach. When a separate add_event call is required ("coarse-grained source of truth"), then it introduces a chance for drift between how we log the event and the change we actually made. By contrast, if the fine-grained log is our source of truth, then there is no chance for drift.

I'm looking at the current update_email implementation on #2752. We log a set action with a current_email value, but what about the nonce and ctime? Maybe we don't care to keep those values around. I don't like the idea that we lose that info. Mightn't we want that for debugging? And more to the point: I want to change how we're doing email verification, and with the add_event pattern I have to make sure to modify the add_event call to stay in sync with the actual event itself.

chadwhitacre · 2014-10-09T15:01:44Z

@Changaco in IRC:

I'm quite happy that we removed the multiple log tables, it lightens the DB schema, however we could use DB triggers to populate the events table in some cases if we don't want/need to do it in the python code.

chadwhitacre · 2017-11-02T20:01:55Z

Closing in light of our decision to shut down Gratipay.

Thank you all for a great run, and I'm sorry it didn't work out! 😞 💃

adewes · 2017-11-03T17:37:02Z

@whit537 really sorry to hear this, it was a great project and as well as team behind it! I was very happy meeting you at the PyCon three years ago and I wish you good luck for your next project whatever that may be!

zbynekwinkler mentioned this issue Oct 6, 2013

If mtime is supposed to track modified time, then use triggers (with example code) #1498

Closed

zbynekwinkler referenced this issue Oct 7, 2013

Eager-load participant when querying elsewhere

1ec5522

ghost assigned chadwhitacre Oct 12, 2013

zbynekwinkler mentioned this issue Oct 12, 2013

Issue tracker usage within a milestone #1556

Closed

This was referenced Oct 16, 2013

PoolError after release to Heroku #1541

Closed

send errors in homepage updater to sentry #1601

Closed

zbynekwinkler mentioned this issue Oct 19, 2013

someone managed to orphan their participant! #617

Closed

This was referenced Oct 19, 2013

Homepage updater fixes #1605

Closed

claimed users still have locked accounts elsewhere #420

Closed

chadwhitacre mentioned this issue Oct 20, 2013

upgrade to Postgres 9.3 #1158

Closed

chadwhitacre mentioned this issue Apr 14, 2014

implement view materialization for teams #2282

Closed

Changaco unassigned zbynekwinkler Aug 12, 2014

Changaco removed the 3 - Work in Progress label Aug 12, 2014

Changaco self-assigned this Aug 12, 2014

This was referenced Aug 12, 2014

Index all the things! #2255

Closed

Querying current_communities takes over 50ms in production #1495

Closed

This was referenced Aug 22, 2014

Reduce schema.sql #2673

Merged

Replace the community_summary view by a table #2701

Merged

Changaco added ★★☆ and removed ★★★ labels Sep 3, 2014

Changaco mentioned this issue Oct 19, 2014

Speed up /about/stats #2826

Closed

Changaco mentioned this issue Dec 13, 2014

Move is_locked from elsewhere to participants #3012

Merged

chadwhitacre mentioned this issue Mar 31, 2015

let's talk about process gratipay/inside.gratipay.com#163

Closed

Changaco removed their assignment Apr 15, 2015

chadwhitacre closed this as completed Nov 2, 2017

Clean up db schema #1549

Clean up db schema #1549

Comments

zbynekwinkler commented Oct 5, 2013

zbynekwinkler commented Oct 5, 2013

chadwhitacre commented Oct 15, 2013

zbynekwinkler commented Oct 15, 2013

zbynekwinkler commented Oct 19, 2013

chadwhitacre commented Oct 20, 2013

chadwhitacre commented Oct 21, 2013

zbynekwinkler commented Oct 21, 2013

chadwhitacre commented Oct 21, 2013

zbynekwinkler commented Oct 21, 2013

chadwhitacre commented Oct 21, 2013

chadwhitacre commented Oct 21, 2013

chadwhitacre commented Oct 21, 2013

zbynekwinkler commented Oct 27, 2013

zbynekwinkler commented Oct 27, 2013

zbynekwinkler commented Oct 28, 2013

zbynekwinkler commented Oct 28, 2013

chadwhitacre commented Oct 28, 2013

chadwhitacre commented Oct 28, 2013

chadwhitacre commented Oct 28, 2013

zbynekwinkler commented Oct 28, 2013

seanlinsley commented Apr 14, 2014

chadwhitacre commented Apr 14, 2014

adewes commented Apr 14, 2014

chadwhitacre commented Apr 14, 2014

clone1018 commented Apr 14, 2014

arothenberg commented Apr 14, 2014

adewes commented Apr 15, 2014

arothenberg commented Apr 15, 2014

zbynekwinkler commented Apr 16, 2014

chadwhitacre commented Apr 16, 2014

chadwhitacre commented Apr 16, 2014

chadwhitacre commented Oct 8, 2014

chadwhitacre commented Oct 9, 2014

chadwhitacre commented Nov 2, 2017

adewes commented Nov 3, 2017