Eating elephant, first step (events, db schema) #2006

zbynekwinkler · 2014-02-09T12:44:04Z

As discussed at #1549 we need to pick a place where the elephant is softest and start there. Tracking claimed_time and username appears to be it.

The events table is our log. The ultimate goal being that any db state at any point of time in the past should be reconstructable. After several tries at this it seems prudent to keep two participant ids in this table. The first one is to be used "the participant" (tipper, claimer etc.). The second one is to be used as "the other" (tippee, team etc.). This setting allows somewhat sane filtering. Using ids instead of usernames makes the past unmutable which is what we need for any log.

The next steps are to fold goal, number, statement, bitcoin and other attributes into this architecture.

The more chewy parts of the elephant include user creation semantics, tipping and absorbing (take_over).

The ultimate test will be reconstructing the current state of the db just from the events table.

Please review & merge.

seanlinsley · 2014-02-09T16:19:29Z

gittip/models/participant.py

+                                    "WHERE username=%s "
+                                    "RETURNING username, username_lower"
+                                   , (suggested, lowercased, self.username)
+                                   )


Did you intentionally remove the back_as=tuple argument here? Why isn't it needed any more?

Not only is not needed, it is not allowed. The way it works is that back_as is a property of the cursor. Since now I create the cursor explicitly, I need to supply it as a param in the get_cursor call.

Ah, now I see that. Makes sense.

seanlinsley · 2014-02-09T18:53:40Z

@zwn is #1939 (comment) related to this PR?

zbynekwinkler · 2014-02-09T19:05:39Z

Yes, it is. Actually that write up is pretty good. Thinking about showing the history to the users forces us to be honest and think everything through.

zbynekwinkler · 2014-02-09T20:06:09Z

I think we should think of the events table as something quite low level—just id and payload—and we should expect to be writing significant abstractions atop it for day-to-day use.

@whit537 I must say I am pretty surprised (not in a bad way) you arguing this side. You might wonder why I am but have you recently looked at the gittip schema? 😉 You have used perhaps all the standard and nonstandard postgres features there are.

Ok, you are making sense. I see what I can do about it.

chadwhitacre · 2014-02-09T20:17:49Z

branch.sql

+    CREATE TABLE events
+    ( id        serial      PRIMARY KEY
+    , ctime     timestamp   NOT NULL DEFAULT CURRENT_TIMESTAMP
+    , pid1      bigint      NOT NULL


Is there a reason not to reference the participants table here?

Yes, because it is the other way around - participants should reference events (if anything). Participants should be recreateable from events from scratch.

- Shortening "Branding Guidelines" to "Brand Guidelines" for punchiness - Broadening "Data Model" to "System Architecture." More parallel to "Information Architecture," and informs things like gratipay/gratipay.com#2006 and gratipay/gratipay.com#1939.

chadwhitacre · 2014-02-10T21:50:05Z

You have used perhaps all the standard and nonstandard postgres features there are.

:-)

Postgres is a pretty powerful system.

zbynekwinkler · 2014-02-12T08:53:11Z

Some inspiration from what github stores and publishes: http://developer.github.com/v3/activity/events/

zbynekwinkler · 2014-02-15T22:40:46Z

If anyone knows how to reasonably (=not by string manipulation functions) build complex (=2 levels deep) json objects in SQL let me know. So far I have found only https://github.com/pgexperts/json_build but that does not work for heroku. I need it to fill the events table from the current logging tables.

zbynekwinkler · 2014-02-15T23:19:05Z

I am trying to put the timestamp inside the json payload and add an index for it but that is failing me. Postgres says that type cast from text to timestamp is not immutable. Google says that's because of time zones. I am trying to say "damn the time zones" but postgres would not listen 😟

CREATE INDEX events_ts ON events( cast(payload->>'ts' as timestamp) ASC)

is still not immutable (why? why?). Another bump is that such expression must be used in each query wanting to use such an index. So, we can index timestamp as text. Or we can keep the ts column. I am keeping the ts column. It is just not worth fighting.

zbynekwinkler · 2014-02-16T11:38:59Z

Oh well, the time of the last commit is off probably because of my use of stgit.

Now I have events table with the schema id, ts, type, payload where the type can be only participant (so far) and when it is, there is id expected in the payload that identifies the actor and action that says what the participant wanted to do. If action is set the participant wants to set some of her properties like username, goal, api_key (so far).

An example of an event where participant is changing her username:

id: 9384
ts: 2014-02-15T....
type: participant
payload:
    id: 1234
    action: set
    values: { username: joe }

The data conversion from current tables to the events is in branch.py since I don't know how to create 2 level deep json in SQL.

chadwhitacre · 2014-02-20T06:53:37Z

If action is set the participant wants to set some of her properties like username, goal, api_key (so far).

This reminds me a lot of MongoDB, which I used a fair bit at my last job. Do you have any experience with Mongo? It gives you JSON "documents" (rows) in "collections" (tables) in "databases" on "servers." And the $set operation of the update command works a lot like you describe here. What's more, Mongo has an "oplog," a record of all data-modifying operations ordered by time, which is queryable just like any other collection (it's a "capped collection," meaning configurably size-capped).

Do you think we should explore MongoDB as an option?

zbynekwinkler · 2014-02-20T08:48:54Z

Interesting. I've never done anything with MongoDB. As you describe it, it is definitely worth checking out.

chadwhitacre · 2014-02-20T13:45:26Z

As you describe it, it is definitely worth checking out.

I was afraid you would be like, "Chad! Enough with the rabbit holes!" But since you weren't ... ;-)

The problem is that I have a bad impression of MongoDB from an engineering perspective. It's like the difference between MySQL and PostgreSQL. After watching engineers at my last company deal with data loss due to MongoDB design decisions, I don't feel fully comfortable trusting my data to it. Basically, this:

But, the real problem:

You might object, my information is out of date; they've fixed these problems or intend to fix them in the next version; problem X can be mitigated by optional practice Y.

Unfortunately, it doesn't matter.

The real problem is that so many of these problems existed in the first place.

Database developers must be held to a higher standard than your average developer. Namely, your priority list should typically be something like:

Don't lose data, be very deterministic with data

Employ practices to stay available

Multi-node scalability

Minimize latency at 99% and 95%

Raw req/s per resource

10gen's order seems to be, #5, then everything else in some order. #1 ain't in the top 3.

These failings, and the implied priorities of the company, indicate a basic cultural problem, irrespective of whatever problems exist in any single release: a lack of the requisite discipline to design database systems businesses should bet on.

chadwhitacre · 2014-02-20T13:47:01Z

So if we're going to evaluate alternate databases (is that what we're going to do?), I think we should also look at other options besides MongoDB. We should probably start a new ticket, as well. :-)

zbynekwinkler · 2014-02-20T14:46:58Z

Oh, I am not ready to evaluate alternate databases. I wanted to explore MongoDB for interesting ideas on how to work with data, what kind of APIs work for what etc. We don't have the bandwidth for switching the db technology. But that is a tangent anyway.

Are we ready to merge this PR? What else do we need here? My goal is to have an events table and gradually start adding event types as we work through them on building.gittip.com, IA. So far I see a clear path forward for the participants table. I'd like to do "weekend sized" PRs and ideally have them merged during the week to be ready for another weekend's work.

chadwhitacre · 2014-02-27T17:30:40Z

branch.py

+            ORDER BY ts
+        """)
+
+        for event in heapq.merge(claimed, usernames, api_keys, goals):


heapq.merge

Nice. I've never used that before. :-)

chadwhitacre · 2014-02-27T18:39:52Z

branch.sql

+    CREATE TABLE events
+    ( id        serial      PRIMARY KEY
+    , ts        timestamp   NOT NULL DEFAULT CURRENT_TIMESTAMP
+    , type      text        NOT NULL


Would it be better to use kind instead of type, since (as we've mentioned somewhere) type is a builtin function in Python?

chadwhitacre · 2014-02-27T18:43:22Z

branch.py

+        for event in heapq.merge(claimed, usernames, api_keys, goals):
+            payload = dict(action=event.action, id=event.id)
+            if event.action == 'set':
+                payload['values'] = { event.__dict__.keys()[-1]: event[-1] }


I'm getting a Traceback here:

[gittip] $ honcho run -e defaults.env,local.env ./env/bin/python branch.py Traceback (most recent call last): File "branch.py", line 47, in <module> main() File "branch.py", line 39, in main payload['values'] = { event.__dict__.keys()[-1]: event[-1] } AttributeError: 'Record' object has no attribute '__dict__' [gittip] $

Conflicts: gittip/models/__init__.py gittip/models/participant.py

Not sure why this API changed. Something we did on master in the mean time?

The need for this import entered with #1369.

chadwhitacre · 2014-02-27T19:43:44Z

branch.sql

+
+    /* run branch.py before this
+    DROP RULE log_api_key_changes ON participants;
+    DROP RULE log_goal_changes ON participants;


We have four rules on the participants table right now:

bitcoin_addresses logs into bitcoin_addresses

log_api_key_changes logs into api_keys

log_goal_changes logs into goals

log_participant_number logs into log_participant_number

This PR replaces (2) and (3) with calls to the new add_event function, and it also uses add_event to start logging changes to username and claimed_time, which were not previously logged. (1) and (4) are unaffected by this PR.

That is correct. Now that we are somewhat unclogged I'll look into adding those over the weekend.

Cool. Be sure to give me bite-sized PRs, eh? :-)

chadwhitacre · 2014-02-27T19:46:08Z

I'm going to merge this PR, because I want to land #1976, and account disconnects are something we should really be logging. The logging system in this PR is not perfect, and I'm having tons of ideas about what sort of system would be better. But that's the point of this PR: not to be the final word by any means, but to move us forward incrementally.

Eating elephant, first step (events, db schema)

chadwhitacre · 2014-02-27T20:09:24Z

Note for the future: would've been good to optimize the branch.py script, or at least log its progress. I'm sitting here waiting for it to finish, and I have no idea how long that's going to take. The app is in maintenance mode in the meantime. :-/

chadwhitacre · 2014-02-27T21:07:24Z

Yeah, that sucked. We were down for, like, an hour. :-(

IRC

This time we also had a branch.py. I've noted where that was run.

zbynekwinkler · 2014-02-27T22:16:26Z

Sorry. 😟 Locally it was so fast I never felt the need to add any progress tracking and/or further optimizations. Were you running it locally or on heroku? I was imagining it running like heroku run python branch.py so it runs as close to the database as possible (so it would run similarly to \i branch.sql). Running on the laptop might have slowed it down.

It broke in #2006, not sure how we missed it. Turns out we did have a test for the logging rule after all. :-)

chadwhitacre · 2014-02-28T22:29:06Z

Were you running it locally or on heroku?

Locally.

I was imagining it running like heroku run python branch.py so it runs as close to the database as possible (so it would run similarly to \i branch.sql).

Good idea.

konklone · 2014-03-08T15:24:57Z

Ringing in super late, after seeing the brief MongoDB back-and-forth above -- I think your instinct that a MongoDB-like model is helpful is right-on. If it's not Mongo itself, maybe RethinkDB? I haven't used them, but they have some pretty compelling ideas.

zbynekwinkler added 3 commits February 9, 2014 13:46

Create events table & handle setting claimed_time.

f0de1a7

Track username changes, add /%username/events/ view.

f3c0ecc

Switch travis to postgres 9.3.

fd6357c

seanlinsley reviewed Feb 9, 2014
View reviewed changes

chadwhitacre reviewed Feb 9, 2014
View reviewed changes

Try addressing comments from PR.

c35be7e

zbynekwinkler mentioned this pull request Feb 10, 2014

refactor accounts elsewhere #1369

Merged

47 tasks

chadwhitacre reviewed Feb 27, 2014
View reviewed changes

chadwhitacre mentioned this pull request Feb 27, 2014

Add ability to disconnect elsewhere accounts. #1976

Merged

chadwhitacre reviewed Feb 27, 2014
View reviewed changes

chadwhitacre added 3 commits February 27, 2014 13:54

Merge branch 'master' into events

3da974b

Conflicts: gittip/models/__init__.py gittip/models/participant.py

Fix up branch.py file; #2006

ba6181a

Not sure why this API changed. Something we did on master in the mean time?

Fix bug in events page for #2006

d36e6e2

The need for this import entered with #1369.

chadwhitacre reviewed Feb 27, 2014
View reviewed changes

chadwhitacre added a commit that referenced this pull request Feb 27, 2014

Merge pull request #2006 from gittip/events

f251a65

Eating elephant, first step (events, db schema)

chadwhitacre merged commit f251a65 into master Feb 27, 2014

chadwhitacre deleted the events branch February 27, 2014 19:54

chadwhitacre added a commit that referenced this pull request Feb 27, 2014

Append branch.sql to schema.sql; #2006

1ad8318

This time we also had a branch.py. I've noted where that was run.

chadwhitacre added a commit that referenced this pull request Feb 28, 2014

Fix a broken test

cabf6b5

It broke in #2006, not sure how we missed it. Turns out we did have a test for the logging rule after all. :-)

chadwhitacre mentioned this pull request Feb 28, 2014

Fix a broken test #2101

Merged

chadwhitacre mentioned this pull request Apr 14, 2014

Clean up db schema #1549

Closed

Changaco mentioned this pull request May 7, 2014

Drop remaining DB rules #2358

Merged

chadwhitacre mentioned this pull request Feb 10, 2016

backfill remaining exchanges #3912

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eating elephant, first step (events, db schema) #2006

Eating elephant, first step (events, db schema) #2006

zbynekwinkler commented Feb 9, 2014

seanlinsley Feb 9, 2014

zbynekwinkler Feb 9, 2014

seanlinsley Feb 9, 2014

seanlinsley commented Feb 9, 2014

zbynekwinkler commented Feb 9, 2014

zbynekwinkler commented Feb 9, 2014

chadwhitacre Feb 9, 2014

zbynekwinkler Feb 9, 2014

chadwhitacre commented Feb 10, 2014

zbynekwinkler commented Feb 12, 2014

zbynekwinkler commented Feb 15, 2014

zbynekwinkler commented Feb 15, 2014

zbynekwinkler commented Feb 16, 2014

chadwhitacre commented Feb 20, 2014

zbynekwinkler commented Feb 20, 2014

chadwhitacre commented Feb 20, 2014

chadwhitacre commented Feb 20, 2014

zbynekwinkler commented Feb 20, 2014

chadwhitacre Feb 27, 2014

chadwhitacre Feb 27, 2014

chadwhitacre Feb 27, 2014

chadwhitacre Feb 27, 2014

zbynekwinkler Feb 27, 2014

chadwhitacre Feb 28, 2014

chadwhitacre commented Feb 27, 2014

chadwhitacre commented Feb 27, 2014

chadwhitacre commented Feb 27, 2014

zbynekwinkler commented Feb 27, 2014

chadwhitacre commented Feb 28, 2014

konklone commented Mar 8, 2014

Eating elephant, first step (events, db schema) #2006

Eating elephant, first step (events, db schema) #2006

Conversation

zbynekwinkler commented Feb 9, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seanlinsley commented Feb 9, 2014

zbynekwinkler commented Feb 9, 2014

zbynekwinkler commented Feb 9, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chadwhitacre commented Feb 10, 2014

zbynekwinkler commented Feb 12, 2014

zbynekwinkler commented Feb 15, 2014

zbynekwinkler commented Feb 15, 2014

zbynekwinkler commented Feb 16, 2014

chadwhitacre commented Feb 20, 2014

zbynekwinkler commented Feb 20, 2014

chadwhitacre commented Feb 20, 2014

chadwhitacre commented Feb 20, 2014

zbynekwinkler commented Feb 20, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chadwhitacre commented Feb 27, 2014

chadwhitacre commented Feb 27, 2014

chadwhitacre commented Feb 27, 2014

zbynekwinkler commented Feb 27, 2014

chadwhitacre commented Feb 28, 2014

konklone commented Mar 8, 2014