S3Pool -- a S3 cache on local disk
Copyright (c) 2019 CK Tan
[email protected]
S3Pool can be used for free under the GNU General Public License
version 3, where anything released into public must be open source,
or under a commercial license. The commercial license does not
cover derived or ported versions created by third parties under
GPL. To inquire about commercial license, please send email to
[email protected].
s3pool -p port -D homedir
The executable will chdir into the homedir given on the command line. The homedir shall have the following subdirectories:
- log : where log files reside
- tmp : where temp files reside
- data : subdirs in
data/
are BUCKETDIRs, which contain files in their respective buckets
A special __list__
file is maintained in each BUCKETDIR, and is
updated when the REFRESH command is given. It contains a full listing
of all keys in the bucket, delimited by a NEWLINE character. The
DQUOTE char and NEWLINE char are not expected to be part of key names.
S3 Objects of a bucket are stored in its BUCKETDIR, using the slash character as path separators. Two consequtive slashes in a key name is not handled.
For each object file FNAME
, there is a corresponding meta file named
FNAME__meta__
. This stores the output of aws s3api get-object
meta
information that was returned along with the file. Here is a sample:
{
"LastModified": "Mon, 14 Oct 2019 19:51:18 GMT",
"ETag": "\"83839df1582f29ada551f698b39fc3ac\"",
"ContentLength": 555,
"ContentType": "text/plain",
"Metadata": {},
"AcceptRanges": "bytes"
}
The ETag entry is used to determine if a file has been modified and needs to be downloaded from S3.
Requests are submitted as JSON array objects that are single-line in form. Replies consist of a status line followed by payload. Status line is either OK or ERROR. In the case of OK, the content is defined by the command submitted. In the case of ERROR, the content is a pertinent error message.
Refresh the __list__
file of a bucket. This file will rot over time,
and user should schedule a CRON job to call REFRESH from time to
time. If there are multiple s3pools serving the same bucket, the
REFRESH need to be scheduled on only one of the s3pool servers.
Syntax: ["REFRESH", "bucket"]
Returns a list of keys matching a glob pattern.
SYNTAX: ["GLOB", "bucket", "pattern"]
This will result in a call to REFRESH if the list file does not exist.
If the file is cached AND is unchanged on S3, return it. Otherwise, pull the file from S3.
Syntax: ["PULL", "bucket-name", "key-name", "key-name-2"]
The key-name-2
above is optional. If supplied, pull the key-name
object and return it while firing off a background job to prefetch
key-name-2
.
The reply is an absolute path in the local filesystem that the user
can use to access the S3 object corresponding to key-name
.
Note: only check if file is unchanged on S3 if the file was not cached recently (2 minutes).
Push a file to S3.
Syntax: ["PUSH", "bucket", "key", "absolute-path-to-file"]
A watchdog keeps the disk utilization under 90%. Whenever this high
water mark is reached, the watchdog starts to delete files cached in
the data/
directory using some form of LRU algorithm based on access
time.