Skip to content
This repository has been archived by the owner on Feb 8, 2018. It is now read-only.

Eating elephant, first step (events, db schema) #2006

Merged
merged 7 commits into from
Feb 27, 2014
Merged

Eating elephant, first step (events, db schema) #2006

merged 7 commits into from
Feb 27, 2014

Conversation

zbynekwinkler
Copy link
Contributor

As discussed at #1549 we need to pick a place where the elephant is softest and start there. Tracking claimed_time and username appears to be it.

The events table is our log. The ultimate goal being that any db state at any point of time in the past should be reconstructable. After several tries at this it seems prudent to keep two participant ids in this table. The first one is to be used "the participant" (tipper, claimer etc.). The second one is to be used as "the other" (tippee, team etc.). This setting allows somewhat sane filtering. Using ids instead of usernames makes the past unmutable which is what we need for any log.

The next steps are to fold goal, number, statement, bitcoin and other attributes into this architecture.

The more chewy parts of the elephant include user creation semantics, tipping and absorbing (take_over).

The ultimate test will be reconstructing the current state of the db just from the events table.

Please review & merge.

"WHERE username=%s "
"RETURNING username, username_lower"
, (suggested, lowercased, self.username)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intentionally remove the back_as=tuple argument here? Why isn't it needed any more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only is not needed, it is not allowed. The way it works is that back_as is a property of the cursor. Since now I create the cursor explicitly, I need to supply it as a param in the get_cursor call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now I see that. Makes sense.

@seanlinsley
Copy link
Contributor

@zwn is #1939 (comment) related to this PR?

@zbynekwinkler
Copy link
Contributor Author

Yes, it is. Actually that write up is pretty good. Thinking about showing the history to the users forces us to be honest and think everything through.

@zbynekwinkler
Copy link
Contributor Author

I think we should think of the events table as something quite low level—just id and payload—and we should expect to be writing significant abstractions atop it for day-to-day use.

@whit537 I must say I am pretty surprised (not in a bad way) you arguing this side. You might wonder why I am but have you recently looked at the gittip schema? 😉 You have used perhaps all the standard and nonstandard postgres features there are.

Ok, you are making sense. I see what I can do about it.

CREATE TABLE events
( id serial PRIMARY KEY
, ctime timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
, pid1 bigint NOT NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to reference the participants table here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because it is the other way around - participants should reference events (if anything). Participants should be recreateable from events from scratch.

chadwhitacre added a commit to gratipay/inside.gratipay.com that referenced this pull request Feb 10, 2014
 - Shortening "Branding Guidelines" to "Brand Guidelines" for punchiness
 - Broadening "Data Model" to "System Architecture." More parallel to
   "Information Architecture," and informs things like
   gratipay/gratipay.com#2006 and
   gratipay/gratipay.com#1939.
@zbynekwinkler zbynekwinkler mentioned this pull request Feb 10, 2014
47 tasks
@chadwhitacre
Copy link
Contributor

You have used perhaps all the standard and nonstandard postgres features there are.

:-)

Postgres is a pretty powerful system.

@zbynekwinkler
Copy link
Contributor Author

Some inspiration from what github stores and publishes: http://developer.github.com/v3/activity/events/

@zbynekwinkler
Copy link
Contributor Author

If anyone knows how to reasonably (=not by string manipulation functions) build complex (=2 levels deep) json objects in SQL let me know. So far I have found only https://github.com/pgexperts/json_build but that does not work for heroku. I need it to fill the events table from the current logging tables.

@zbynekwinkler
Copy link
Contributor Author

I am trying to put the timestamp inside the json payload and add an index for it but that is failing me. Postgres says that type cast from text to timestamp is not immutable. Google says that's because of time zones. I am trying to say "damn the time zones" but postgres would not listen 😟

CREATE INDEX events_ts ON events( cast(payload->>'ts' as timestamp) ASC)

is still not immutable (why? why?). Another bump is that such expression must be used in each query wanting to use such an index. So, we can index timestamp as text. Or we can keep the ts column. I am keeping the ts column. It is just not worth fighting.

@zbynekwinkler
Copy link
Contributor Author

Oh well, the time of the last commit is off probably because of my use of stgit.

Now I have events table with the schema id, ts, type, payload where the type can be only participant (so far) and when it is, there is id expected in the payload that identifies the actor and action that says what the participant wanted to do. If action is set the participant wants to set some of her properties like username, goal, api_key (so far).

An example of an event where participant is changing her username:

id: 9384
ts: 2014-02-15T....
type: participant
payload:
    id: 1234
    action: set
    values: { username: joe }

The data conversion from current tables to the events is in branch.py since I don't know how to create 2 level deep json in SQL.

@chadwhitacre
Copy link
Contributor

If action is set the participant wants to set some of her properties like username, goal, api_key (so far).

This reminds me a lot of MongoDB, which I used a fair bit at my last job. Do you have any experience with Mongo? It gives you JSON "documents" (rows) in "collections" (tables) in "databases" on "servers." And the $set operation of the update command works a lot like you describe here. What's more, Mongo has an "oplog," a record of all data-modifying operations ordered by time, which is queryable just like any other collection (it's a "capped collection," meaning configurably size-capped).

Do you think we should explore MongoDB as an option?

@zbynekwinkler
Copy link
Contributor Author

Interesting. I've never done anything with MongoDB. As you describe it, it is definitely worth checking out.

@chadwhitacre
Copy link
Contributor

As you describe it, it is definitely worth checking out.

I was afraid you would be like, "Chad! Enough with the rabbit holes!" But since you weren't ... ;-)

The problem is that I have a bad impression of MongoDB from an engineering perspective. It's like the difference between MySQL and PostgreSQL. After watching engineers at my last company deal with data loss due to MongoDB design decisions, I don't feel fully comfortable trusting my data to it. Basically, this:

But, the real problem:

You might object, my information is out of date; they've fixed these problems or intend to fix them in the next version; problem X can be mitigated by optional practice Y.

Unfortunately, it doesn't matter.

The real problem is that so many of these problems existed in the first place.

Database developers must be held to a higher standard than your average developer. Namely, your priority list should typically be something like:

  1. Don't lose data, be very deterministic with data
  2. Employ practices to stay available
  3. Multi-node scalability
  4. Minimize latency at 99% and 95%
  5. Raw req/s per resource

10gen's order seems to be, #5, then everything else in some order. #1 ain't in the top 3.

These failings, and the implied priorities of the company, indicate a basic cultural problem, irrespective of whatever problems exist in any single release: a lack of the requisite discipline to design database systems businesses should bet on.

@chadwhitacre
Copy link
Contributor

So if we're going to evaluate alternate databases (is that what we're going to do?), I think we should also look at other options besides MongoDB. We should probably start a new ticket, as well. :-)

@zbynekwinkler
Copy link
Contributor Author

Oh, I am not ready to evaluate alternate databases. I wanted to explore MongoDB for interesting ideas on how to work with data, what kind of APIs work for what etc. We don't have the bandwidth for switching the db technology. But that is a tangent anyway.

Are we ready to merge this PR? What else do we need here? My goal is to have an events table and gradually start adding event types as we work through them on building.gittip.com, IA. So far I see a clear path forward for the participants table. I'd like to do "weekend sized" PRs and ideally have them merged during the week to be ready for another weekend's work.

ORDER BY ts
""")

for event in heapq.merge(claimed, usernames, api_keys, goals):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heapq.merge

Nice. I've never used that before. :-)

CREATE TABLE events
( id serial PRIMARY KEY
, ts timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
, type text NOT NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to use kind instead of type, since (as we've mentioned somewhere) type is a builtin function in Python?

for event in heapq.merge(claimed, usernames, api_keys, goals):
payload = dict(action=event.action, id=event.id)
if event.action == 'set':
payload['values'] = { event.__dict__.keys()[-1]: event[-1] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting a Traceback here:

[gittip] $ honcho run -e defaults.env,local.env ./env/bin/python branch.py
Traceback (most recent call last):
  File "branch.py", line 47, in <module>
    main()
  File "branch.py", line 39, in main
    payload['values'] = { event.__dict__.keys()[-1]: event[-1] }
AttributeError: 'Record' object has no attribute '__dict__'
[gittip] $

Conflicts:
	gittip/models/__init__.py
	gittip/models/participant.py
Not sure why this API changed. Something we did on master in the mean
time?
The need for this import entered with #1369.

/* run branch.py before this
DROP RULE log_api_key_changes ON participants;
DROP RULE log_goal_changes ON participants;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have four rules on the participants table right now:

  1. bitcoin_addresses logs into bitcoin_addresses
  2. log_api_key_changes logs into api_keys
  3. log_goal_changes logs into goals
  4. log_participant_number logs into log_participant_number

This PR replaces (2) and (3) with calls to the new add_event function, and it also uses add_event to start logging changes to username and claimed_time, which were not previously logged. (1) and (4) are unaffected by this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. Now that we are somewhat unclogged I'll look into adding those over the weekend.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Be sure to give me bite-sized PRs, eh? :-)

@chadwhitacre
Copy link
Contributor

I'm going to merge this PR, because I want to land #1976, and account disconnects are something we should really be logging. The logging system in this PR is not perfect, and I'm having tons of ideas about what sort of system would be better. But that's the point of this PR: not to be the final word by any means, but to move us forward incrementally.

chadwhitacre added a commit that referenced this pull request Feb 27, 2014
Eating elephant, first step (events, db schema)
@chadwhitacre chadwhitacre merged commit f251a65 into master Feb 27, 2014
@chadwhitacre chadwhitacre deleted the events branch February 27, 2014 19:54
@chadwhitacre
Copy link
Contributor

Note for the future: would've been good to optimize the branch.py script, or at least log its progress. I'm sitting here waiting for it to finish, and I have no idea how long that's going to take. The app is in maintenance mode in the meantime. :-/

@chadwhitacre
Copy link
Contributor

Yeah, that sucked. We were down for, like, an hour. :-(

IRC

chadwhitacre added a commit that referenced this pull request Feb 27, 2014
This time we also had a branch.py. I've noted where that was run.
@zbynekwinkler
Copy link
Contributor Author

Sorry. 😟 Locally it was so fast I never felt the need to add any progress tracking and/or further optimizations. Were you running it locally or on heroku? I was imagining it running like heroku run python branch.py so it runs as close to the database as possible (so it would run similarly to \i branch.sql). Running on the laptop might have slowed it down.

chadwhitacre added a commit that referenced this pull request Feb 28, 2014
It broke in #2006, not sure how we missed it. Turns out we did have a
test for the logging rule after all. :-)
@chadwhitacre chadwhitacre mentioned this pull request Feb 28, 2014
@chadwhitacre
Copy link
Contributor

Were you running it locally or on heroku?

Locally.

I was imagining it running like heroku run python branch.py so it runs as close to the database as possible (so it would run similarly to \i branch.sql).

Good idea.

@konklone
Copy link

konklone commented Mar 8, 2014

Ringing in super late, after seeing the brief MongoDB back-and-forth above -- I think your instinct that a MongoDB-like model is helpful is right-on. If it's not Mongo itself, maybe RethinkDB? I haven't used them, but they have some pretty compelling ideas.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants