-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
git-svn-id: http://s3storage.googlecode.com/svn/trunk@2 50179d6d-2b1f-0410-9cf2-fbaf5ad40ad0
- Loading branch information
0 parents
commit 4bc51f7
Showing
9 changed files
with
1,022 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,200 @@ | ||
s3storage - zope in the cloud | ||
|
||
s3storage is an interesting experiment. Amazon's S3 storage service provides an | ||
eventual consistancy guarentee, but propogation delays mean that one must be | ||
careful when trying to cram transactional semantics on top of the model. That | ||
said, it has the potential (along with the elastic compute cloud) to provide | ||
a massively scalable object system. | ||
|
||
Benefits of S3: | ||
|
||
* Simple model, it's just a massive hash table | ||
* Cheap. $0.15 per GB per month | ||
* Scalable | ||
* Reliable (eventually) | ||
|
||
Drawbacks: | ||
|
||
* Indeterminate propogation time (the time between when a write occurs and | ||
a read on the resource can be relied upon) | ||
* No built in transaction support | ||
* No renaming (so we can't reuse Directory Storage) | ||
|
||
So that leaves EC2 (the elastic compute cloud) to fill in the gaps. | ||
|
||
Benefits of EC2: | ||
|
||
* Powerful virtual machines on demand | ||
* Cheap. $0.10 per vm per hour | ||
* It's Linux (pretty much any distribution you want) | ||
|
||
Drawbacks: | ||
|
||
* Individual VMs are not reliable, this runs on commodity hardware, expect them | ||
to fail occaisionally | ||
|
||
Bandwidth for both services is charge at $0.20 per GB, but only outside the AWS | ||
system (so it's free between S3 and EC2) | ||
|
||
So the challenge is to build a scalable zope system that leverages the S3's | ||
scalability while making up for it's shortcomings with EC2. | ||
|
||
Vision | ||
====== | ||
|
||
Build an S3 storage backend that can scale to many readers (lots of zope | ||
clients) but delegate locking to a known system such as zeo or postgres | ||
for writes. Realtime systems are not the target here, think content and asset | ||
management. | ||
|
||
Model | ||
----- | ||
|
||
S3 provides no locking. So this must be provided for in EC2. | ||
|
||
After a commit an object while spend an indeterminate time before it becomes | ||
reliably readable. Until that time has passed the view is not reliable. For | ||
sake of argument lets say Tp = 5 minutes. This means that the last five minutes | ||
of object state must be manged through a transactional system (think zeo or | ||
postgres). There's no use trying to solve the read only case if the write case | ||
is the bottleneck... | ||
|
||
This gives us three pools of object revisions: | ||
|
||
* Uncommitted pool. | ||
|
||
* Committed, but not to be relied upon. The centrally accounted pool. | ||
|
||
* Propogated and relied upon. The distributed pool. | ||
|
||
And two distribution methods: | ||
|
||
* Scalable but limited S3 | ||
|
||
Performance considerations | ||
-------------------------- | ||
|
||
Read performance encourages us to opt for the distributed path. But this incurs | ||
a time gap that is potentially disasterous for write performance. | ||
Potentially we get a conflict error if any object is updated within Tp. | ||
|
||
TODO: Understand commit retries. Could selected transactions be upgraded to a | ||
more current view? Must avoid this central hit on every read. | ||
|
||
|
||
Possible approaches to scalability | ||
---------------------------------- | ||
|
||
* Data partitioning. i.e. spread load between several locking servers. e.g. | ||
folder A on one server, folder B on another. Complicated to manage. Though | ||
it would be necessary for very large systems. | ||
|
||
* Something like memcached? Potentially simpler to manage. | ||
|
||
* Or Amazon SQS for locking? Use a queue as a lock, make oids prefixed by a | ||
client id (16 bits) followed by 48 bits of client increasing | ||
|
||
Reliability considerations | ||
-------------------------- | ||
|
||
* EC2 is not reliable. We must consider the possibility transaction manager | ||
node failure. This should not be disasterous. As S3 has a guarentee of | ||
eventual consistancy so potentially no more than ~Tp down time would be | ||
necessary on node failure, and then only for writes. The storage scheme must | ||
prepare for this. | ||
|
||
NOTE: this assumes a guarantee of a successful write eventually succeeding. | ||
I _think_ S3 gives this guarantee. | ||
|
||
* But overhead of disaster preparedness must not be so onerous as to overly | ||
limit the common case. | ||
|
||
|
||
Filesystem Structure | ||
==================== | ||
|
||
* S3 is a giant hash table, not a hierarchical filesystem | ||
|
||
* S3 provides an efficient list operation. Listing a bucket can be filtered by | ||
prefix, limited by max-results and also have results rolled up with a | ||
delimiter (such as '/'). A marker can be provided to list only results that | ||
are alphabetically after a particular key. List results are always returned | ||
in alphabetical order. | ||
|
||
* S3 is atomic only for single file puts. | ||
|
||
* Only existance is certain | ||
|
||
* Existing filestorage strategies do not map well | ||
|
||
* Three places to store information, the key, the file contents and | ||
metadata. Only the key is returned on a list. (actually some s3 system | ||
metatdata is returned here too). keys must be UTF-8 and up to 1024 bytes. | ||
If keys contains data you don't already know then two requests are required | ||
to read the contents. | ||
|
||
Transaction Log | ||
--------------- | ||
|
||
* Only a single file operations are atomic. They either succeed or fail. | ||
Latest timestamp wins for operations on the same resource. | ||
|
||
On a transaction commit a transaction log is written such as | ||
|
||
type:transaction,serial:<serial repr>,prev:<prev serial repr>,tid:<tid repr>,ltid:<prev tid repr> | ||
|
||
It contains a list of modified oids | ||
|
||
Precondition: | ||
all pickle writes must have successfully completed before this file is written | ||
|
||
The prev: information is there so that readers can be certain they are seeing a | ||
complete list (only existance is certain) | ||
|
||
Efficiency of reading the transaction log | ||
----------------------------------------- | ||
|
||
We have a design decision here. We can choose to efficiently read from the end | ||
of the filelist or from a certain point in the list depending on the | ||
representation format used for tids, i.e. if tid=1 ends up alphabetically lower | ||
or higher than tid=2. | ||
|
||
(Note we could use another index: field but it would add complication) | ||
|
||
Records | ||
------- | ||
|
||
* A record is the state of an object at a particular transaction | ||
|
||
On a store a record is written with a key | ||
|
||
type:record,oid:<oid repr>,serial:<serial repr> | ||
|
||
Note that records can exist for data in aborted or in progress transactions | ||
|
||
Indexing | ||
-------- | ||
|
||
We want to avoid having to read through the entire transaction log in order | ||
to find out | ||
|
||
* The record for an oid currently or at a particular transaction | ||
loadBefore, load | ||
|
||
this should be implemented as a list operation. So we write a 1 byte file | ||
|
||
type:index_commit,oid:<oid repr>,serial:<serial repr>,prev:<prev serial repr>,tid:<tid repr>,ltid:<ltid repr> | ||
|
||
and the serial for a tid | ||
|
||
type:index_tid,tid:<tid repr>,serial:<serial repr>,prev:<prev serial repr>,ltid<ltid repr> | ||
|
||
So this means that sequence must be reverse alphabetically sorted | ||
|
||
These are starting to look very like rdf data structures | ||
|
||
References | ||
========== | ||
|
||
* http://www.zope.org/Wikis/DevSite/Proposals/DirectoryTreeStorage | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
from ZODB.config import BaseConfig, ZODBDatabase | ||
from storage import ConnectionSpecificStorage | ||
from connection import StoragePerConnection | ||
|
||
class S3StorageDatatype(BaseConfig): | ||
"""Open a storage configured via ZConfig""" | ||
def open(self): | ||
return ConnectionSpecificStorage( | ||
name=self.name, | ||
aws_access_key_id=self.config.aws_access_key_id, | ||
aws_secret_access_key=self.config.aws_secret_access_key, | ||
bucket_name=self.config.bucket_name, | ||
memcache_servers=self.config.memcache_servers) | ||
|
||
class S3ZODBDatabaseType(ZODBDatabase): | ||
"""Open a database with storage per connection customizations""" | ||
def open(self): | ||
import pdb; pdb.set_trace() | ||
db = ZODBDatabase.open(self) | ||
assert hasattr(db, 'klass') | ||
db.klass = StoragePerConnection | ||
return db |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
# Copyright (c) 2006 Shane Hathaway. | ||
# Made available under the MIT license; see LICENSE.txt. | ||
|
||
"""PGStorage-aware Connection. | ||
Currently, ZODB connections provide MVCC, but PGStorage | ||
provides MVCC in the storage. PGConnection overrides | ||
certain behaviors in the connection so that the storage can | ||
implement MVCC. | ||
""" | ||
|
||
from ZODB.Connection import Connection | ||
|
||
|
||
class StoragePerConnection(Connection): | ||
"""Connection enhanced with per-connection storage instances.""" | ||
|
||
# The connection's cache is current as of self._last_tid. | ||
_last_tid = None | ||
|
||
# Connection-specific database wrapper | ||
_wrapped_db = None | ||
|
||
def _wrap_database(self): | ||
"""Link a database to this connection""" | ||
if self._wrapped_db is None: | ||
my_storage = self._db._storage.get_instance() | ||
self._wrapped_db = DBWrapper(self._db, my_storage) | ||
self._normal_storage = self._storage = my_storage | ||
self.new_oid = my_storage.new_oid | ||
self._db = self._wrapped_db | ||
|
||
def _check_invalidations(self, committed_tid=None): | ||
storage = self._db._storage | ||
invalid, new_tid = storage.check_invalidations( | ||
self._last_tid, committed_tid) | ||
if invalid is None: | ||
self._resetCache() | ||
elif invalid: | ||
self._invalidated.update(invalid) | ||
self._flush_invalidations() | ||
self._last_tid = new_tid | ||
|
||
def _setDB(self, odb, *args, **kw): | ||
"""Set up the storage and invalidate before _setDB.""" | ||
self._db = odb | ||
self._wrap_database() | ||
self._check_invalidations() | ||
super(StoragePerConnection, self)._setDB(self._db, *args, **kw) | ||
|
||
def open(self, *args, **kw): | ||
"""Set up the storage and invalidate before open.""" | ||
self._wrap_database() | ||
self._check_invalidations() | ||
super(StoragePerConnection, self).open(*args, **kw) | ||
|
||
def tpc_finish(self, transaction): | ||
"""Invalidate upon commit.""" | ||
committed_tids = [] | ||
self._storage.tpc_finish(transaction, committed_tids.append) | ||
self._check_invalidations(committed_tids[0]) | ||
self._tpc_cleanup() | ||
|
||
def _storage_sync(self, *ignored): | ||
self._check_invalidations() | ||
|
||
|
||
class DBWrapper(object): | ||
"""Override the _storage attribute of a ZODB.DB.DB. | ||
The Connection object uses this wrapper instead of the | ||
real DB object. This wrapper allows each connection to | ||
have its own storage instance. | ||
""" | ||
|
||
def __init__(self, db, storage): | ||
self._db = db | ||
self._storage = storage | ||
|
||
def __getattr__(self, name): | ||
return getattr(self._db, name) | ||
|
||
def _returnToPool(self, connection): | ||
# This override is for a newer ZODB | ||
# Reset transaction state in the storage instance. | ||
self._storage.closing_connection() | ||
connection._db = self._db # satisfy an assertion | ||
self._db._returnToPool(connection) | ||
|
||
def _closeConnection(self, connection): | ||
# This override is for an older ZODB | ||
# Reset transaction state in the storage instance. | ||
self._storage.closing_connection() | ||
connection._db = self._db # satisfy an assertion | ||
self._db._closeConnection(connection) | ||
|
Oops, something went wrong.