Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to maintain quotas and file logging #9

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 14 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,22 @@ The configuration file must contain the following keys:
Allow cross-origin access to all endpoints unconditionally. This is needed
to allow web clients to use the upload feature.

Setting an upload quota
-----------------------

A quota will be enforced if it's received from the XMPP server in the PUT
request as a `q` URL parameter.

The quota is enforced on the parent directory of the UUID directory containing
the file and metadata.

If this parent directory is global to all users, then the quota is also global,
otherwise if this parent directory is specific to each user, then the quota is
also per user.

Issues, Bugs, Limitations
=========================

* This service **does not handle any kind of quota**.
* The format in which the files are stored is **not** compatible with ``mod_http_upload`` -- so you'll lose all uploaded files when switching.
* This blindly trusts the clients Content-Type. I don't think this is a major issue, because we also tell the browser to blindly trust the clients MIME type. This, in addition with forcing all but a white list of MIME types to be downloaded instead of shown inline, should provide safety against any type of XSS attacks.
* I have no idea about web security. The headers I set may be subtly wrong and circumvent all security measures I intend this to have. Please double-check for yourself and report if you find anything amiss.
Expand Down Expand Up @@ -87,4 +99,4 @@ Enable systemd service::
Configure your webserver:

As final step you need to point your external webserver to your xmpp-http-upload flask app.
Check the ``contrib`` directory, there is an example for nginx there.
Check the ``contrib`` directory, there is an example for nginx there.
60 changes: 55 additions & 5 deletions xhu.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,14 @@
import contextlib
import errno
import fnmatch
import json
import hashlib
import hmac
import json
import logging
import os
import pathlib
import shutil
import stat
import typing

import flask
Expand All @@ -34,6 +38,14 @@
app.config.from_envvar("XMPP_HTTP_UPLOAD_CONFIG")
application = app

if app.config['LOGDIR']:
logging.basicConfig(
filename='{}/xhu.log'.format(app.config['LOGDIR']),
filemode='a',
level=logging.INFO
)
logger = logging.getLogger(__name__)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to log to standard output and have the service manager deal with it. Alternatively, support systemd journal (optionally) directly. Configuring and managing log directories is a PITA, and always opens up the question of logrotate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to log to standard output and have the service manager deal with it.

What service manager are you talking about here? I use Circus to run the processes, but I don't see my logs ending up in its log, so I'm not sure what you mean.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stuff like systemd. Even sending to syslog is better than having to work with logfiles.


if app.config['ENABLE_CORS']:
from flask_cors import CORS
CORS(app)
Expand All @@ -49,7 +61,6 @@ def sanitized_join(path: str, root: pathlib.Path) -> pathlib.Path:
def get_paths(base_path: pathlib.Path):
data_file = pathlib.Path(str(base_path) + ".data")
metadata_file = pathlib.Path(str(base_path) + ".meta")

return data_file, metadata_file


Expand All @@ -65,12 +76,45 @@ def get_info(path: str, root: pathlib.Path) -> typing.Tuple[
path,
pathlib.Path(app.config["DATA_ROOT"]),
)

data_file, metadata_file = get_paths(dest_path)

return data_file, load_metadata(metadata_file)


def apply_quota(root: pathlib.Path, quota: int):
""" Get the files, sorted by last modification date and the sum of their
sizes.
"""
if not quota:
return

file_list = []
total_size = 0
# We assume a file structure whereby files are are stored inside
# uuid() directories inside the root dir and that there aren't any files in
# the root dir itself.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not make assumptions about the directory structure. While there is currently only an implementation of this in Prosody, I don’t see why we should rely on only prosody being used on the other side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t see why we should rely on only prosody being used on the other side

I'm using Prosody and I'm trying to get something working and running sooner rather than later.
I understand that you'd rather prefer less coupled and more generic solutions.

for uuid_dir in os.listdir(root):
for path, dirnames, filenames in os.walk(root/uuid_dir):
for name in [n for n in filenames if n.endswith('.data')]:
fp = os.path.join(path, name)
size = os.path.getsize(fp)
total_size += size
modified = os.stat(fp)[stat.ST_MTIME]
file_list.append((modified, path, name, size))

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my reading, this sums the files of all users.

Also this is O(n) in the number of stored files, which I don’t like at all. This must be condensed into an O(1) operation.

See the main review comment for details.

bytes = total_size - quota
if (bytes > 0):
# Remove files (oldest first) until we're under our quota
file_list.sort(key=lambda a: a[0])
while (bytes >= 0):
modified, path, name, size = file_list.pop()
logger.info(
"Removing file {}/{} to maintain quota of {}"
.format(path, name, quota)
)
shutil.rmtree(path)
bytes -= size


@contextlib.contextmanager
def write_file(at: pathlib.Path):
with at.open("xb") as f:
Expand Down Expand Up @@ -135,6 +179,11 @@ def put_file(path):
)

dest_path.parent.mkdir(parents=True, exist_ok=True, mode=0o770)

quota = flask.request.args.get("q", "")
Copy link
Owner

@horazont horazont Jan 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows a client to provide an arbitrary quota, or leave out the q entirely, or set it to 0 to delete all files uploaded so far. The quota needs to be part of the HMAC instead.

Copy link
Contributor Author

@jcbrand jcbrand Jan 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, very good point.

set it to 0 to delete all files uploaded so far

A quota of 0 means that no files get deleted. But that's also a potential avenue of attack.

if (quota):
apply_quota(dest_path.parent.parent, int(quota))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don’t make assumptions about the directory structure.

This should also take into account the size of the uploaded file. The amount of data used must still be below the quota even after the upload.

It also needs to handle the case where the file is larger than the quota allows for: in that case, the upload should be refused without any files being deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also take into account the size of the uploaded file. The amount of data used must still be below the quota even after the upload.

It also needs to handle the case where the file is larger than the quota allows for: in that case, the upload should be refused without any files being deleted.

Depends on whether you're enforcing a hard quota or a soft quota. I'm OK with usage slightly going over the quota. Prosody already enforces file size via http_upload_external_file_size_limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don’t make assumptions about the directory structure.

I don't yet see how I can do it otherwise. Putting storage usage in a file like you suggested provides some decoupling, but you still need to know what files to remove.


data_file, metadata_file = get_paths(dest_path)

try:
Expand Down Expand Up @@ -183,7 +232,8 @@ def generate_headers(response_headers, metadata_headers):

response_headers["X-Content-Type-Options"] = "nosniff"
response_headers["X-Frame-Options"] = "DENY"
response_headers["Content-Security-Policy"] = "default-src 'none'; frame-ancestors 'none'; sandbox"
response_headers["Content-Security-Policy"] = \
"default-src 'none'; frame-ancestors 'none'; sandbox"


@app.route("/<path:path>", methods=["HEAD"])
Expand Down