Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to maintain quotas and file logging #9

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jcbrand
Copy link
Contributor

@jcbrand jcbrand commented Jan 13, 2019

The README diff contains more info on the changes made.

I'm planning to update Prosody's mod_http_upload_external to prepend the hashed/salted JID in the PUT URLs, so that admins can identify the files uploaded for a particular JID.

This can help with GDPR compliance (e.g. removal of particular user's data upon request) and also makes per-user quotas possible.

If the hashed/salted JID is not in the URL, then the quota is enforced globally across all users.

Since files get removed to enforce the quotas, I figured it would be good to log that to a file.

@horazont horazont self-requested a review January 13, 2019 12:50
@horazont horazont self-assigned this Jan 13, 2019
Copy link
Owner

@horazont horazont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not entirely sure what the aim is. Given the lack of code for user-specific things, I assume that this is about a global quota for now.

This is not ready to me merged yet, and it has quite a way to go before it can:

  • The global quota should not be managed by the XMPP server, but via configuration.
  • The global quota is a delicate issue in and by itself: it allows a single user with lots of bandwidth to cause lots of churn on the entire thing and make files of other users be deleted.
  • The quota implementation as it is is not at all multitasking safe (while the upload implementation is). This is a hard requirement, because the duration of an upload is determined by the speed of the client, and thus xhu must be able to run in multiple threads, coroutine tasks (possibly in the future) and/or processes concurrently. Simply decorating the thing with a few locks won’t do (due to multiprocessing), this needs to be solved on the file-system level.
  • The quota implementation is O(n) in the number of uploaded files.

In general, I think that quota data needs to be stored to reduce the complexity of the quota check (from O(n) to O(1)). For this, a file which stores the current quota use per scope (scopes are global and one for each user) and which is read and updated using a file-level lock (flock; reading can use a shared lock, updating needs an exclusive lock). The update must be rolled back properly when the upload aborts etc.

(flocks don’t work properly with threads though.)

total_size = 0
# We assume a file structure whereby files are are stored inside
# uuid() directories inside the root dir and that there aren't any files in
# the root dir itself.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not make assumptions about the directory structure. While there is currently only an implementation of this in Prosody, I don’t see why we should rely on only prosody being used on the other side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t see why we should rely on only prosody being used on the other side

I'm using Prosody and I'm trying to get something working and running sooner rather than later.
I understand that you'd rather prefer less coupled and more generic solutions.

@@ -135,6 +179,11 @@ def put_file(path):
)

dest_path.parent.mkdir(parents=True, exist_ok=True, mode=0o770)

quota = flask.request.args.get("q", "")
Copy link
Owner

@horazont horazont Jan 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows a client to provide an arbitrary quota, or leave out the q entirely, or set it to 0 to delete all files uploaded so far. The quota needs to be part of the HMAC instead.

Copy link
Contributor Author

@jcbrand jcbrand Jan 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, very good point.

set it to 0 to delete all files uploaded so far

A quota of 0 means that no files get deleted. But that's also a potential avenue of attack.


quota = flask.request.args.get("q", "")
if (quota):
apply_quota(dest_path.parent.parent, int(quota))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don’t make assumptions about the directory structure.

This should also take into account the size of the uploaded file. The amount of data used must still be below the quota even after the upload.

It also needs to handle the case where the file is larger than the quota allows for: in that case, the upload should be refused without any files being deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also take into account the size of the uploaded file. The amount of data used must still be below the quota even after the upload.

It also needs to handle the case where the file is larger than the quota allows for: in that case, the upload should be refused without any files being deleted.

Depends on whether you're enforcing a hard quota or a soft quota. I'm OK with usage slightly going over the quota. Prosody already enforces file size via http_upload_external_file_size_limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don’t make assumptions about the directory structure.

I don't yet see how I can do it otherwise. Putting storage usage in a file like you suggested provides some decoupling, but you still need to know what files to remove.

total_size += size
modified = os.stat(fp)[stat.ST_MTIME]
file_list.append((modified, path, name, size))

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my reading, this sums the files of all users.

Also this is O(n) in the number of stored files, which I don’t like at all. This must be condensed into an O(1) operation.

See the main review comment for details.

filemode='a',
level=logging.INFO
)
logger = logging.getLogger(__name__)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to log to standard output and have the service manager deal with it. Alternatively, support systemd journal (optionally) directly. Configuring and managing log directories is a PITA, and always opens up the question of logrotate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to log to standard output and have the service manager deal with it.

What service manager are you talking about here? I use Circus to run the processes, but I don't see my logs ending up in its log, so I'm not sure what you mean.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stuff like systemd. Even sending to syslog is better than having to work with logfiles.

@horazont
Copy link
Owner

Also, unrelated to the review, please ensure that you talk to @mwild1, because I think he has some plans for a v3 of the upload protocol.

@jcbrand
Copy link
Contributor Author

jcbrand commented Jan 13, 2019

Hi @horazont

Thank you for taking the time to look into the PR and for providing feedback.

I’m not entirely sure what the aim is. Given the lack of code for user-specific things, I assume that this is about a global quota for now.

That depends on the directory structure within which the files are stored. It can be global or per user.

I want to use it as a per-user quota. I've made changes to mod_http_upload_external so that a user's files are stored in a top-level directory that is a salted hash of their JID. By doing so I can apply the quota per user and I can also easily identify the files uploaded per JID (but only because I know the salt value) and I can remove them as per a GDPR request.

The global quota should not be managed by the XMPP server, but via configuration.

I first had it as local configuration, but I want eventually to have more finegrained quotas. For example I'd like to allow friends and family or people who pay a subscription to have larger quotas than complete strangers.

I think this information will more likely need to be managed and stored by the XMPP server rather than the file upload service.

The global quota is a delicate issue in and by itself: it allows a single user with lots of bandwidth to cause lots of churn on the entire thing and make files of other users be deleted.

Yes, so it comes with trade-offs and people need to decide whether they want to use a global quota or not. My intention is to implement a per-user quota.

The quota implementation as it is is not at all multitasking safe (while the upload implementation is).
This is a hard requirement, because the duration of an upload is determined by the speed of the client, and thus xhu must be able to run in multiple threads, coroutine tasks (possibly in the future) and/or processes concurrently. Simply decorating the thing with a few locks won’t do (due to multiprocessing), this needs to be solved on the file-system level.

This is interesting and not something I considered. I'm not sure however what you mean by "this needs to be solved on the file-system level".

This makes me think of the ZODB Python object database.

It's transactional (atomic) and allows rollbacks. You can then use ZEO to allow multiple processes to access to a single database and then put a load-balancer in front of the different processes.

Something like that would pretty much solve the concurrency and thread-safety issue, but it's a big departure from the current design and I wouldn't be surprised if you're not interested in going in such a direction.

The quota implementation is O(n) in the number of uploaded files.

Yes, or rather the number of files per user for the case I have in mind.

In general, I think that quota data needs to be stored to reduce the complexity of the quota check (from O(n) to O(1)).

Yes, you're right in pointing out that looping over all files to calculate storage usage is not something that'll scale well. I like the idea of using a file to store the quota.

@horazont
Copy link
Owner

Alright, now I get the big picture here. To summarise, so that we’re on the same page: You modified (or plan to modify) mod_http_upload_external to send the quota (as query argument) and to prefix the URL with an HMAC of the JID and a secret key. Thus, the directory structure ends up being something like this:

/<hmac(jid, key)>/<random nonce>/<filename>.{data,meta}

This allows you to only look at the files in <hmac(jid, key)> to enforce a users quota. So at least some of the code makes more sense to me now. Also, the decision to let the XMPP server hand over the quota via GET makes sense now.

Still, there are unsolved issues. Continuing the discussion.

The quota implementation as it is is not at all multitasking safe (while the upload implementation is).
This is a hard requirement, because the duration of an upload is determined by the speed of the client, and thus xhu must be able to run in multiple threads, coroutine tasks (possibly in the future) and/or processes concurrently. Simply decorating the thing with a few locks won’t do (due to multiprocessing), this needs to be solved on the file-system level.

This is interesting and not something I considered. I'm not sure however what you mean by "this needs to be solved on the file-system level".

The file-system is in the end the shared resource the tasks (be it threads, processes or coroutines) operate on. Thus, the synchronisation is best handled by the very same thing. For this, file-system level locks such as some lock based on flock would be suitable (but see my remarks about using flock and multithreading).

This makes me think of the ZODB Python object database.

It's transactional (atomic) and allows rollbacks. You can then use ZEO to allow multiple processes to access to a single database and then put a load-balancer in front of the different processes.

Something like that would pretty much solve the concurrency and thread-safety issue, but it's a big departure from the current design and I wouldn't be surprised if you're not interested in going in such a direction.

Yes, it’s a huge deviation from the current design, which is extremely slim. This will be a nasty trade-off: either we use OS specific tools (such as flock, which is BSD specific but also available on Linux) or a high-level tool such as sqlite or ZODB (I’d prefer SQLite, because I’m more familiar with it though; ideally, the transactions are all rather short and thus we don’t have to worry about the lack of concurrency in SQLite – after all, serialisation is what we need on some level).

I am really not sure what would be the right way here. On the one hand, I’d like to avoid sqlite or anything like that, because I perceive it as overkill, even though it probably isn’t. On the other hand, stuff like flock won’t work in all scenarios, and will still be tricky to get right.

Actually I think that this would be an excellent use-case of LMDB (which is an embeddable ordered key-value store database with support for transactions), but never having used LMDB in a project yet, I’m not confident in that either. However, since we might want to use LMDB in JabberCat, this might be a good playground to explore it. It would also allow to reduce the complexity of various operations by storing basic file metadata in the LMDB too.

The quota implementation is O(n) in the number of uploaded files.

Yes, or rather the number of files per user for the case I have in mind.

In general, I think that quota data needs to be stored to reduce the complexity of the quota check (from O(n) to O(1)).

Yes, you're right in pointing out that looping over all files to calculate storage usage is not something that'll scale well. I like the idea of using a file to store the quota.

The file to store the quota makes everything much trickier though. First the aforementioned multitasking issues. Second, where would you put the file? It needs to be on the level of the user. In your current design, you cannot really be sure whether you’re working with multiple users in prosody or not. And you cannot put the file in the user directory, because it is possible for the XMPP server to create a URL which matches that file name, and then all bets are off – unless characters which aren’t URL safe are used in the filename… such as \t or a space. Takes a bit of consideration though. And it’s still a hack.

@horazont horazont marked this pull request as draft May 16, 2020 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants