-
Notifications
You must be signed in to change notification settings - Fork 308
Clean up db schema #1549
Comments
My preferred solution would involve
No magic means slightly more work when implementing but results in an approachable code base, allows for easier review (requires less context from the reviewer), enables straight forward reasoning about performance (big O). Read optimized (big O) helps us to ignore the performance details of postgres. For anyone interested I suggest reading several chapters in Postgres 9.0 High Performace to see everything that can go wrong for example in query planning without proper maintenance of the db. The more complex the query the more can (will?) go wrong. Storing user actions allows us to restore the current state into a whatever schema we choose. Guards us against errors in the code, business logic etc. A generic table with a timestamp, participant and data (probably hstore) columns would be my preferred way. On the other hand I do not want to throw out the baby with the bath water so let's discuss 😄. |
I've asked @zwn to lead the charge on this one. IRC Keep scrolling on that IRC link, basically the single events table design sounds sensible to me. The reason we have multiple logging tables (see #1502 for details) is historical accident, not intentional design. Anyway, @zwn is in charge of this now. |
I have gone over all (or maybe just most?) pages we have and noted what data is required where (just data reading not writing). Would something like this fit somewhere in the developer documentation? Feel free to comment/improve the list. It should form a list of requirements for the data layer. This is what should be optimized for reading. Did I miss anything?
|
I've studied a bit materialized views to see what could be done using them. The problem seems to be that update of the view cannot be done concurrently yet. This means that the update locks the view. If we used materialized views for the homepage tables, we would not be able to query them while the update is running. It is the same behavior as using TRUNCATE while rebuilding the tables (#1541 (comment)). Refreshing the view concurrently is planed only for Postgres 9.4. |
Oh well. I guess we wait. Can you tell from past history when 9.4 is likely to be out? |
I've been exploring what can be done with triggers and rules system in 50e298c and e9572b5. I see several possible ways to implement the log table:
The advantage of doing this on db level is that also handmade changes done through psql are logged. The disadvantage is that we cannot tie the action to a signed in user. Further disadvantage is that writing triggers in plpgsql language is yet another language to deal with. This could be somewhat mitigated by using plpython but that is not possible on current heroku offering. So I am starting to lean towards the application level implementation on the grounds that it is going to be easier to maintain. I'd appreciate any feedback for and/or against any of the variants. |
@zwn What does it look like if we invert the pattern and write to the single event table first as the canonical source, and then work outwards from there to update the normalized tables as downstream consumers? |
Here's a few scaling posts from Facebook:
|
We decided to pursue the idea of writing to a single master event log, with read-optimized tables computed from there. We're thinking in terms of eventually having Python workers asynchronously updating the downstream tables (with UI support for this async updating). As a step on the way, though, we're going to look at instrumenting all the places in the app where we currently write to the database to write to a single master event log as an additional part of the transaction. Then once we've lived with this for a while we'll start inverting the app bit by bit to write to the event log in one transaction and update computed tables in a separate transaction. IRC |
I am thinking about how to structure the log table. From one point of view it would be good to structure it as close to the user as possible since this gives us the most resilience against bugs. If the user 'bob' says he wants to tip user 'alice' amount $1 we should record it as such. On the other hand if we support participant renaming we will not be able to support a general history page like 'I am bob, show me everything I did on gittip'. That leads us to #835. If we translated 'bob' and 'alice' to its ids and recorded that instead, it will be valid in the future no matter how are 'bob' and 'alice' called in the future. I've decided to make the api be user facing (taking 'alice' and 'bob' as params) but storing the internal ids in the log table. It might make sense to reopen #835. Also I'd like to convert the different 'toggle' operations to explicit idempotent set operations. This allows us not to search for all the previous 'toggles' all the way to the default at the beginning. Fortunately there isn't that many of them. |
What about branch.sql? Some transforms are not that easy to describe in SQL. What about some branch.py? Instead of doing |
What would be the best way to log |
E.g.? I haven't felt this limitation yet, but I expect the changes we're making here will be our most significant yet. |
This makes our dev build more complicated, because we can't just run schema.sql. We should take that complication into account and try to avoid a branch.py if possible. |
I am so far managing without it. |
I'm not convinced there's a better database to move to, even if we did need to "scale" |
I think we can get a long way with Postgres. |
OK so following our discussion, here's how I would organize the DB schema. Please add your comments and remarks, I'm just familiarizing myself with the codebase and I probably don't see all the use cases, so please correct me if I got it wrong. In general, I would try to keep the current state of objects (i.e. participants, communities, teams and their respective data and membership) and the history of this state (i.e. when has a user joined a given community) in separate tables. In my opinion, this has several advantages:
So here's the proposed layout (all arrows represent m:n relations):
The M2M tables (participant_communities, team_communities, participant_teams) would roughly look like this:
participant_id and community_id are foreign keys pointing to the primary keys of the communities and participants tables, respectively. If a participant joins a community, the SQL operation would look like this:
To get all participants in a given community xxx, the query would be
To get all communities for a given participant yyy
For the other relations between participants and teams as well as teams and communities it could work in the same way. Instead of using numerical ids for referencing to other tables one could use strings as well (e.g. usernames), this would make updates of these values more difficult though. ##Event Logging For logging event data, we could create automatic Postgres triggers that create rows in an event table each time a row in another table is inserted, updated or deleted.
the What do you think? |
Another blog post! 💃 !m @adewes |
Yes! Yes! Yes! |
Someone was using my github as a private repo(with my permission) so I just changed my name and kicked him off. Impressive - adewes. I'd hire you if I had a business. It looks like a good ole normalized rdbs schema. My 2 cents(I don't know how the $$$contributions are given so take this with a grain of salt) - One thing I don't see is groups addressed. Is there a further purpose in identifying a participant as a group other than as a permission for team creation or as just a descriptor? iow are there tables/process/functions that are specific to group(not team) participants? Also, are you tracking or planning to track group(not team) members? If so then adewes schema should fold groups into Team and flag the record as "group" or "team". Anyway, it was great meeting all of you and it is very impressive that you have a working crowd source project up. I can't imagine how hard it would be to get something like this going. I know I couldn't. |
@arothenberg @whit537 Yeah would be interesting to discuss how groups, teams and participants are related to each other. So far my understanding is the following: -Participants can either be "singular" (i.e. representing one person) or "plural" (i.e. representing a group of persons). In the latter case they will be considered as a "group" and be able to add other participants to the group, thus creating a team. Is that correct @whit537 ? |
Since I'm not very informed about this - this should be my last post. Anyway, as per groups, my understanding is that it is a permissions flag as you stated. But in reality a group is also composed of multiple people(some may even be participants?) and contributions to groups may be handled differently. Your schema does not address the contributions and how they are allocated/recorded. What I'm trying to convey is that the business logic could alter your schema. From a superficial reading I think your schema looks perfect. But ours is a superficial gleaning of the Business logic. @whit537 would be the one to weigh in on this. Take care and good luck. |
No, no, no 😉 With a structure like that everyone is going to be afraid to even touch the db because of the hidden stuff. And when they do, they break it. Using the |
If someone is afraid to touch the db that could be a sign that they're self-limited as a developer. Data is at the heart of software, and Postgres is a robust, mature, well-understood system for managing data. Why not use it? Why implement core data logic in Python instead of in Postgres? That's question number one, and I don't think we've properly addressed it head-on yet. I see us as having taken one small step down each path, app-centric and db-centric. Should we take two steps down each path? What can we say based on our current level of experience with each pattern? Question number two is whether our source of truth is granular or coarse. Do we update the participants table and then log to events (coarse)? Or do we insert into a low-level table and then bubble that up into a participants view/table (granular)? Question number three is how questions number one and two are related. :-) |
Having lived with the I'm looking at the current |
Closing in light of our decision to shut down Gratipay. Thank you all for a great run, and I'm sorry it didn't work out! 😞 💃 |
@whit537 really sorry to hear this, it was a great project and as well as team behind it! I was very happy meeting you at the PyCon three years ago and I wish you good luck for your next project whatever that may be! |
Currently we have a mix of different approaches. Decide on a single way and implement it.
There is a $15 open bounty on this issue. Add to the bounty at Bountysource.
There is a $15 open bounty on this issue. Add to the bounty at Bountysource.
The text was updated successfully, but these errors were encountered: