Initial import

git-svn-id: http://s3storage.googlecode.com/svn/trunk@2 50179d6d-2b1f-0410-9cf2-fbaf5ad40ad0
lrowe · Oct 20, 2006 · 4bc51f7 · 4bc51f7
commit 4bc51f7
Show file tree

Hide file tree

Showing 9 changed files with 1,022 additions and 0 deletions.
diff --git a/README.txt b/README.txt
@@ -0,0 +1,200 @@
+s3storage - zope in the cloud
+
+s3storage is an interesting experiment. Amazon's S3 storage service provides an
+eventual consistancy guarentee, but propogation delays mean that one must be
+careful when trying to cram transactional semantics on top of the model. That
+said, it has the potential (along with the elastic compute cloud) to provide
+a massively scalable object system.
+
+Benefits of S3:
+
+ * Simple model, it's just a massive hash table
+ * Cheap. $0.15 per GB per month
+ * Scalable
+ * Reliable (eventually)
+
+Drawbacks:
+
+ * Indeterminate propogation time (the time between when a write occurs and
+   a read on the resource can be relied upon)
+ * No built in transaction support
+ * No renaming (so we can't reuse Directory Storage)
+
+So that leaves EC2 (the elastic compute cloud) to fill in the gaps.
+
+Benefits of EC2:
+
+ * Powerful virtual machines on demand
+ * Cheap. $0.10 per vm per hour
+ * It's Linux (pretty much any distribution you want)
+
+Drawbacks:
+
+ * Individual VMs are not reliable, this runs on commodity hardware, expect them
+   to fail occaisionally
+
+Bandwidth for both services is charge at $0.20 per GB, but only outside the AWS
+system (so it's free between S3 and EC2)
+
+So the challenge is to build a scalable zope system that leverages the S3's
+scalability while making up for it's shortcomings with EC2.
+
+Vision
+======
+
+Build an S3 storage backend that can scale to many readers (lots of zope
+clients) but delegate locking to a known system such as zeo or postgres
+for writes. Realtime systems are not the target here, think content and asset
+management.
+
+Model
+-----
+
+S3 provides no locking. So this must be provided for in EC2.
+
+After a commit  an object while spend an indeterminate time before it becomes
+reliably readable. Until that time has passed the view is not reliable. For
+sake of argument lets say Tp = 5 minutes. This means that the last five minutes
+of object state must be manged through a transactional system (think zeo or
+postgres). There's no use trying to solve the read only case if the write case
+is the bottleneck...
+
+This gives us three pools of object revisions:
+
+ * Uncommitted pool.
+
+ * Committed, but not to be relied upon. The centrally accounted pool.
+
+ * Propogated and relied upon. The distributed pool.
+
+And two distribution methods:
+
+ * Scalable but limited S3
+
+Performance considerations
+--------------------------
+
+Read performance encourages us to opt for the distributed path. But this incurs
+a time gap that is potentially disasterous for write performance.
+Potentially we get a conflict error if any object is updated within Tp.
+
+TODO: Understand commit retries. Could selected transactions be upgraded to a
+more current view? Must avoid this central hit on every read.
+
+
+Possible approaches to scalability
+----------------------------------
+
+ * Data partitioning. i.e. spread load between several locking servers. e.g.
+   folder A on one server, folder B on another. Complicated to manage. Though
+   it would be necessary for very large systems.
+
+ * Something like memcached? Potentially simpler to manage.
+
+ * Or Amazon SQS for locking? Use a queue as a lock, make oids prefixed by a
+   client id (16 bits) followed by 48 bits of client increasing
+
+Reliability considerations
+--------------------------
+
+ * EC2 is not reliable. We must consider the possibility transaction manager
+   node failure. This should not be disasterous. As S3 has a guarentee of
+   eventual consistancy so potentially no more than ~Tp down time would be
+   necessary on node failure, and then only for writes. The storage scheme must
+   prepare for this.
+
+   NOTE: this assumes a guarantee of a successful write eventually succeeding.
+   I _think_ S3 gives this guarantee.
+
+ * But overhead of disaster preparedness must not be so onerous as to overly
+   limit the common case.
+
+
+Filesystem Structure
+====================
+
+ * S3 is a giant hash table, not a hierarchical filesystem
+
+ * S3 provides an efficient list operation. Listing a bucket can be filtered by
+   prefix, limited by max-results and also have results rolled up with a
+   delimiter (such as '/'). A marker can be provided to list only results that
+   are alphabetically after a particular key. List results are always returned
+   in alphabetical order.
+
+ * S3 is atomic only for single file puts.
+
+ * Only existance is certain
+
+ * Existing filestorage strategies do not map well
+
+ * Three places to store information, the key, the file contents and
+   metadata. Only the key is returned on a list. (actually some s3 system
+   metatdata is returned here too). keys must be UTF-8 and up to 1024 bytes.
+   If keys contains data you don't already know then two requests are required
+   to read the contents. 
+
+Transaction Log
+---------------
+
+ * Only a single file operations are atomic. They either succeed or fail.
+   Latest timestamp wins for operations on the same resource.
+
+On a transaction commit a transaction log is written such as
+
+    type:transaction,serial:<serial repr>,prev:<prev serial repr>,tid:<tid repr>,ltid:<prev tid repr>
+
+It contains a list of modified oids
+
+Precondition:
+all pickle writes must have successfully completed before this file is written
+
+The prev: information is there so that readers can be certain they are seeing a
+complete list (only existance is certain)
+
+Efficiency of reading the transaction log
+-----------------------------------------
+
+We have a design decision here. We can choose to efficiently read from the end
+of the filelist or from a certain point in the list depending on the
+representation format used for tids, i.e. if tid=1 ends up alphabetically lower
+or higher than tid=2.
+
+(Note we could use another index: field but it would add complication)
+
+Records
+-------
+
+ * A record is the state of an object at a particular transaction
+
+On a store a record is written with a key
+
+    type:record,oid:<oid repr>,serial:<serial repr>
+
+Note that records can exist for data in aborted or in progress transactions
+
+Indexing
+--------
+
+We want to avoid having to read through the entire transaction log in order
+to find out
+
+  * The record for an oid currently or at a particular transaction
+    loadBefore, load
+
+  this should be implemented as a list operation. So we write a 1 byte file
+
+    type:index_commit,oid:<oid repr>,serial:<serial repr>,prev:<prev serial repr>,tid:<tid repr>,ltid:<ltid repr>
+
+  and the serial for a tid
+
+    type:index_tid,tid:<tid repr>,serial:<serial repr>,prev:<prev serial repr>,ltid<ltid repr>
+
+So this means that sequence must be reverse alphabetically sorted
+
+These are starting to look very like rdf data structures
+
+References
+==========
+
+ * http://www.zope.org/Wikis/DevSite/Proposals/DirectoryTreeStorage
+
diff --git a/__init__.py b/__init__.py
@@ -0,0 +1,2 @@
+
+
diff --git a/config.py b/config.py
@@ -0,0 +1,22 @@
+from ZODB.config import BaseConfig, ZODBDatabase
+from storage import ConnectionSpecificStorage
+from connection import StoragePerConnection
+
+class S3StorageDatatype(BaseConfig):
+    """Open a storage configured via ZConfig"""
+    def open(self):
+        return ConnectionSpecificStorage(
+            name=self.name,
+            aws_access_key_id=self.config.aws_access_key_id,
+            aws_secret_access_key=self.config.aws_secret_access_key,
+            bucket_name=self.config.bucket_name,
+            memcache_servers=self.config.memcache_servers)
+
+class S3ZODBDatabaseType(ZODBDatabase):
+    """Open a database with storage per connection customizations"""
+    def open(self):
+        import pdb; pdb.set_trace()
+        db = ZODBDatabase.open(self)
+        assert hasattr(db, 'klass')
+        db.klass = StoragePerConnection
+        return db
diff --git a/connection.py b/connection.py
@@ -0,0 +1,96 @@
+# Copyright (c) 2006 Shane Hathaway.
+# Made available under the MIT license; see LICENSE.txt.
+
+"""PGStorage-aware Connection.
+
+Currently, ZODB connections provide MVCC, but PGStorage
+provides MVCC in the storage.  PGConnection overrides
+certain behaviors in the connection so that the storage can
+implement MVCC.
+"""
+
+from ZODB.Connection import Connection
+
+
+class StoragePerConnection(Connection):
+    """Connection enhanced with per-connection storage instances."""
+
+    # The connection's cache is current as of self._last_tid.
+    _last_tid = None
+
+    # Connection-specific database wrapper
+    _wrapped_db = None
+
+    def _wrap_database(self):
+        """Link a database to this connection"""
+        if self._wrapped_db is None:
+            my_storage = self._db._storage.get_instance()
+            self._wrapped_db = DBWrapper(self._db, my_storage)
+            self._normal_storage = self._storage = my_storage
+            self.new_oid = my_storage.new_oid
+        self._db = self._wrapped_db
+
+    def _check_invalidations(self, committed_tid=None):
+        storage = self._db._storage
+        invalid, new_tid = storage.check_invalidations(
+            self._last_tid, committed_tid)
+        if invalid is None:
+            self._resetCache()
+        elif invalid:
+            self._invalidated.update(invalid)
+            self._flush_invalidations()
+        self._last_tid = new_tid
+
+    def _setDB(self, odb, *args, **kw):
+        """Set up the storage and invalidate before _setDB."""
+        self._db = odb
+        self._wrap_database()
+        self._check_invalidations()
+        super(StoragePerConnection, self)._setDB(self._db, *args, **kw)
+
+    def open(self, *args, **kw):
+        """Set up the storage and invalidate before open."""
+        self._wrap_database()
+        self._check_invalidations()
+        super(StoragePerConnection, self).open(*args, **kw)
+
+    def tpc_finish(self, transaction):
+        """Invalidate upon commit."""
+        committed_tids = []
+        self._storage.tpc_finish(transaction, committed_tids.append)
+        self._check_invalidations(committed_tids[0])
+        self._tpc_cleanup()
+
+    def _storage_sync(self, *ignored):
+        self._check_invalidations()
+
+
+class DBWrapper(object):
+    """Override the _storage attribute of a ZODB.DB.DB.
+
+    The Connection object uses this wrapper instead of the
+    real DB object.  This wrapper allows each connection to
+    have its own storage instance.
+    """
+
+    def __init__(self, db, storage):
+        self._db = db
+        self._storage = storage
+
+    def __getattr__(self, name):
+        return getattr(self._db, name)
+
+    def _returnToPool(self, connection):
+        # This override is for a newer ZODB
+        # Reset transaction state in the storage instance.
+        self._storage.closing_connection()
+        connection._db = self._db  # satisfy an assertion
+        self._db._returnToPool(connection)
+
+    def _closeConnection(self, connection):
+        # This override is for an older ZODB
+        # Reset transaction state in the storage instance.
+        self._storage.closing_connection()
+        connection._db = self._db  # satisfy an assertion
+        self._db._closeConnection(connection)
+