Skip to content

Commit

Permalink
Initial import
Browse files Browse the repository at this point in the history
git-svn-id: http://s3storage.googlecode.com/svn/trunk@2 50179d6d-2b1f-0410-9cf2-fbaf5ad40ad0
  • Loading branch information
lrowe committed Oct 20, 2006
0 parents commit 4bc51f7
Show file tree
Hide file tree
Showing 9 changed files with 1,022 additions and 0 deletions.
200 changes: 200 additions & 0 deletions README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
s3storage - zope in the cloud

s3storage is an interesting experiment. Amazon's S3 storage service provides an
eventual consistancy guarentee, but propogation delays mean that one must be
careful when trying to cram transactional semantics on top of the model. That
said, it has the potential (along with the elastic compute cloud) to provide
a massively scalable object system.

Benefits of S3:

* Simple model, it's just a massive hash table
* Cheap. $0.15 per GB per month
* Scalable
* Reliable (eventually)

Drawbacks:

* Indeterminate propogation time (the time between when a write occurs and
a read on the resource can be relied upon)
* No built in transaction support
* No renaming (so we can't reuse Directory Storage)

So that leaves EC2 (the elastic compute cloud) to fill in the gaps.

Benefits of EC2:

* Powerful virtual machines on demand
* Cheap. $0.10 per vm per hour
* It's Linux (pretty much any distribution you want)

Drawbacks:

* Individual VMs are not reliable, this runs on commodity hardware, expect them
to fail occaisionally

Bandwidth for both services is charge at $0.20 per GB, but only outside the AWS
system (so it's free between S3 and EC2)

So the challenge is to build a scalable zope system that leverages the S3's
scalability while making up for it's shortcomings with EC2.

Vision
======

Build an S3 storage backend that can scale to many readers (lots of zope
clients) but delegate locking to a known system such as zeo or postgres
for writes. Realtime systems are not the target here, think content and asset
management.

Model
-----

S3 provides no locking. So this must be provided for in EC2.

After a commit an object while spend an indeterminate time before it becomes
reliably readable. Until that time has passed the view is not reliable. For
sake of argument lets say Tp = 5 minutes. This means that the last five minutes
of object state must be manged through a transactional system (think zeo or
postgres). There's no use trying to solve the read only case if the write case
is the bottleneck...

This gives us three pools of object revisions:

* Uncommitted pool.

* Committed, but not to be relied upon. The centrally accounted pool.

* Propogated and relied upon. The distributed pool.

And two distribution methods:

* Scalable but limited S3

Performance considerations
--------------------------

Read performance encourages us to opt for the distributed path. But this incurs
a time gap that is potentially disasterous for write performance.
Potentially we get a conflict error if any object is updated within Tp.

TODO: Understand commit retries. Could selected transactions be upgraded to a
more current view? Must avoid this central hit on every read.


Possible approaches to scalability
----------------------------------

* Data partitioning. i.e. spread load between several locking servers. e.g.
folder A on one server, folder B on another. Complicated to manage. Though
it would be necessary for very large systems.

* Something like memcached? Potentially simpler to manage.

* Or Amazon SQS for locking? Use a queue as a lock, make oids prefixed by a
client id (16 bits) followed by 48 bits of client increasing

Reliability considerations
--------------------------

* EC2 is not reliable. We must consider the possibility transaction manager
node failure. This should not be disasterous. As S3 has a guarentee of
eventual consistancy so potentially no more than ~Tp down time would be
necessary on node failure, and then only for writes. The storage scheme must
prepare for this.

NOTE: this assumes a guarantee of a successful write eventually succeeding.
I _think_ S3 gives this guarantee.

* But overhead of disaster preparedness must not be so onerous as to overly
limit the common case.


Filesystem Structure
====================

* S3 is a giant hash table, not a hierarchical filesystem

* S3 provides an efficient list operation. Listing a bucket can be filtered by
prefix, limited by max-results and also have results rolled up with a
delimiter (such as '/'). A marker can be provided to list only results that
are alphabetically after a particular key. List results are always returned
in alphabetical order.

* S3 is atomic only for single file puts.

* Only existance is certain

* Existing filestorage strategies do not map well

* Three places to store information, the key, the file contents and
metadata. Only the key is returned on a list. (actually some s3 system
metatdata is returned here too). keys must be UTF-8 and up to 1024 bytes.
If keys contains data you don't already know then two requests are required
to read the contents.

Transaction Log
---------------

* Only a single file operations are atomic. They either succeed or fail.
Latest timestamp wins for operations on the same resource.

On a transaction commit a transaction log is written such as

type:transaction,serial:<serial repr>,prev:<prev serial repr>,tid:<tid repr>,ltid:<prev tid repr>

It contains a list of modified oids

Precondition:
all pickle writes must have successfully completed before this file is written

The prev: information is there so that readers can be certain they are seeing a
complete list (only existance is certain)

Efficiency of reading the transaction log
-----------------------------------------

We have a design decision here. We can choose to efficiently read from the end
of the filelist or from a certain point in the list depending on the
representation format used for tids, i.e. if tid=1 ends up alphabetically lower
or higher than tid=2.

(Note we could use another index: field but it would add complication)

Records
-------

* A record is the state of an object at a particular transaction

On a store a record is written with a key

type:record,oid:<oid repr>,serial:<serial repr>

Note that records can exist for data in aborted or in progress transactions

Indexing
--------

We want to avoid having to read through the entire transaction log in order
to find out

* The record for an oid currently or at a particular transaction
loadBefore, load

this should be implemented as a list operation. So we write a 1 byte file

type:index_commit,oid:<oid repr>,serial:<serial repr>,prev:<prev serial repr>,tid:<tid repr>,ltid:<ltid repr>

and the serial for a tid

type:index_tid,tid:<tid repr>,serial:<serial repr>,prev:<prev serial repr>,ltid<ltid repr>

So this means that sequence must be reverse alphabetically sorted

These are starting to look very like rdf data structures

References
==========

* http://www.zope.org/Wikis/DevSite/Proposals/DirectoryTreeStorage

2 changes: 2 additions & 0 deletions __init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@


22 changes: 22 additions & 0 deletions config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
from ZODB.config import BaseConfig, ZODBDatabase
from storage import ConnectionSpecificStorage
from connection import StoragePerConnection

class S3StorageDatatype(BaseConfig):
"""Open a storage configured via ZConfig"""
def open(self):
return ConnectionSpecificStorage(
name=self.name,
aws_access_key_id=self.config.aws_access_key_id,
aws_secret_access_key=self.config.aws_secret_access_key,
bucket_name=self.config.bucket_name,
memcache_servers=self.config.memcache_servers)

class S3ZODBDatabaseType(ZODBDatabase):
"""Open a database with storage per connection customizations"""
def open(self):
import pdb; pdb.set_trace()
db = ZODBDatabase.open(self)
assert hasattr(db, 'klass')
db.klass = StoragePerConnection
return db
96 changes: 96 additions & 0 deletions connection.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Copyright (c) 2006 Shane Hathaway.
# Made available under the MIT license; see LICENSE.txt.

"""PGStorage-aware Connection.
Currently, ZODB connections provide MVCC, but PGStorage
provides MVCC in the storage. PGConnection overrides
certain behaviors in the connection so that the storage can
implement MVCC.
"""

from ZODB.Connection import Connection


class StoragePerConnection(Connection):
"""Connection enhanced with per-connection storage instances."""

# The connection's cache is current as of self._last_tid.
_last_tid = None

# Connection-specific database wrapper
_wrapped_db = None

def _wrap_database(self):
"""Link a database to this connection"""
if self._wrapped_db is None:
my_storage = self._db._storage.get_instance()
self._wrapped_db = DBWrapper(self._db, my_storage)
self._normal_storage = self._storage = my_storage
self.new_oid = my_storage.new_oid
self._db = self._wrapped_db

def _check_invalidations(self, committed_tid=None):
storage = self._db._storage
invalid, new_tid = storage.check_invalidations(
self._last_tid, committed_tid)
if invalid is None:
self._resetCache()
elif invalid:
self._invalidated.update(invalid)
self._flush_invalidations()
self._last_tid = new_tid

def _setDB(self, odb, *args, **kw):
"""Set up the storage and invalidate before _setDB."""
self._db = odb
self._wrap_database()
self._check_invalidations()
super(StoragePerConnection, self)._setDB(self._db, *args, **kw)

def open(self, *args, **kw):
"""Set up the storage and invalidate before open."""
self._wrap_database()
self._check_invalidations()
super(StoragePerConnection, self).open(*args, **kw)

def tpc_finish(self, transaction):
"""Invalidate upon commit."""
committed_tids = []
self._storage.tpc_finish(transaction, committed_tids.append)
self._check_invalidations(committed_tids[0])
self._tpc_cleanup()

def _storage_sync(self, *ignored):
self._check_invalidations()


class DBWrapper(object):
"""Override the _storage attribute of a ZODB.DB.DB.
The Connection object uses this wrapper instead of the
real DB object. This wrapper allows each connection to
have its own storage instance.
"""

def __init__(self, db, storage):
self._db = db
self._storage = storage

def __getattr__(self, name):
return getattr(self._db, name)

def _returnToPool(self, connection):
# This override is for a newer ZODB
# Reset transaction state in the storage instance.
self._storage.closing_connection()
connection._db = self._db # satisfy an assertion
self._db._returnToPool(connection)

def _closeConnection(self, connection):
# This override is for an older ZODB
# Reset transaction state in the storage instance.
self._storage.closing_connection()
connection._db = self._db # satisfy an assertion
self._db._closeConnection(connection)

Loading

0 comments on commit 4bc51f7

Please sign in to comment.