Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file storage option #12590

Open
wants to merge 11 commits into
base: develop
Choose a base branch
from

Conversation

nucleogenesis
Copy link
Member

@nucleogenesis nucleogenesis commented Aug 22, 2024

Summary

In order to allow us to use a cloud backend for the File Storage in Django, this adds a value to options.py which defaults to the Django FileSystemStorage (which... it did anyway but now we make sure of it).

This should make it configurable on BCK such that if there is a module that is a Google cloud backend class that implements the Django Storage class, then it can be added as the value for the settings.

For example, if we have a new class "GCloudStorage" in kolibri.core.storage then we would use that class if we set the option added here to kolibri.core.storage.GCloudStorage.


This is very much a first whack -- one thing I'm not clear on is if by naming the option by the name that Django would look to in the env vars DEFAULT_FILE_STORAGE does the Kolibri options.py stuff automatically apply that setting because of the matching name?

References

Fixes #9441 (or at least begins to address it)

Reviewer guidance

@pcenov - could you please test all workflows that involve:

  • importing from CSV
  • generating a CSV
  • downloading a generated CSV

The changes I've made here should basically have no effect to the user for any of those workflows.

Once this passes regression testing, we can deploy it to a BCK instance and do final testing there.


Testing checklist

  • Contributor has fully tested the PR manually
  • If there are any front-end changes, before/after screenshots are included
  • Critical user journeys are covered by Gherkin stories
  • Critical and brittle code paths are covered by unit tests

PR process

  • PR has the correct target branch and milestone
  • PR has 'needs review' or 'work-in-progress' label
  • If PR is ready for review, a reviewer has been added. (Don't use 'Assignees')
  • If this is an important user-facing change, PR or related issue has a 'changelog' label
  • If this includes an internal dependency change, a link to the diff is provided

Reviewer checklist

  • PR is fully functional
  • PR has been tested for accessibility regressions
  • External dependency files were updated if necessary (yarn and pip)
  • Documentation is updated
  • Contributor is in AUTHORS.md

@nucleogenesis nucleogenesis requested a review from rtibbles August 22, 2024 21:23
@github-actions github-actions bot added the DEV: backend Python, databases, networking, filesystem... label Aug 22, 2024
Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably want to keep the options.py interface restricted, others can override settings if they see fit, but we should focus on enabling the specific options we need.

Hint: if you do kolibri configure list-env you will see a complete list of available env vars for configuration (which will also show how your new option can be set as an env var).

@@ -359,6 +377,16 @@ def lazy_import_callback_list(value):


base_option_spec = {
"FileStorage": {
"DEFAULT_FILE_STORAGE": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd want to hew closer to the pattern we have for the Cache and Database options here - and offer simple, pre-specified string options that refer to specific backends. If someone really wants to run a custom backend, they can override the settings file and do what they like.

That way, with specific backends in mind, we can then explicitly enumerate the additional things that need to be specified in each case - for example the default "file_system" backend value will need a path, if it's a GCS backend then other things might be required (or may be automagically configured in some cases).

except ImportError:
logger.error("Default file storage is not available.")
raise VdtValueError(value)
except Exception:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't ever be catching a bare Exception, unless for very good reason - it can hide a multitude of sins.

modules = value.split(".")
klass = modules.pop()
module_path = ".".join(modules)
module = importlib.import_module(module_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that Django exposes a utility called import_string which we already use in this module for loading classes by a string dot path, so this seems preferable to use here.

@@ -15,6 +15,7 @@
from configobj import ConfigObj
from configobj import flatten_errors
from configobj import get_extra_values
from django.core.files.storage import Storage
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably why flake8 wouldn't let pre-commit pass but it didn't give me useful output

@nucleogenesis nucleogenesis requested a review from jredrejo August 26, 2024 23:13
@rtibbles rtibbles self-assigned this Aug 27, 2024
@@ -737,6 +766,7 @@ def _get_validator():
"url_prefix": url_prefix,
"bytes": validate_bytes,
"multiprocess_bool": multiprocess_bool,
"storage_option": storage_option,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how will this work with database based cache?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm - I'm not sure. I assumed this was only related to validation on initialization

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see that this would affect anything to do with the cache.

@github-actions github-actions bot added the APP: Facility Re: Facility App (user/class management, facility settings, csv import/export, etc.) label Dec 20, 2024
Comment on lines 218 to 220
logger.info("File saved - Path: {}".format(file_storage.url(file)))
logger.info("File saved - Size: {}".format(file_storage.size(file)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jredrejo I've tried several things around here, but no matter what I do I the file always shows size 0... I've confirmed that there are users (usernames is full of data, for example) -- so I'm not sure why the writer isn't updating the file object here...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem was not in reading the file info, but writing it. It had 0 bytes. I have made a commit in this PR to ensure data is flushed into the BytesIO and now it's working:

INFO     2024-12-26 20:45:45,313 Invoking command bulkexportusers
INFO     2024-12-26 20:45:45,373 Creating users csv file /home/jose/.kolibri/log_export/Kolibri en casa de admin_3da4_users.csv
  [#######################################################################################################################################################################################-------------------------------------------------------------]   75%INFO     2024-12-26 20:45:46,767 File saved - Path: https://storage.googleapis.com/kdp-csv-reporting-develop/Kolibri_en_casa_de_admin_3da4_users.csv
INFO     2024-12-26 20:45:47,073 File saved - Size: 482
INFO     2024-12-26 20:45:47,073 Created csv file /home/jose/.kolibri/log_export/Kolibri en casa de admin_3da4_users.csv with 4 lines

(I've also rebased develop in the branch because there was already a conflict and more might appear soon)

@jredrejo jredrejo force-pushed the fix--dynamic-file-storage-backend-cloud-file-storage branch from 43309bc to 953b29b Compare December 26, 2024 19:48
@github-actions github-actions bot added DEV: dev-ops Continuous integration & deployment DEV: renderers HTML5 apps, videos, exercises, etc. APP: Device Re: Device App (content import/export, facility-syncing, user permissions, etc.) APP: Learn Re: Learn App (content, quizzes, lessons, etc.) APP: Coach Re: Coach App (lessons, quizzes, groups, reports, etc.) APP: User Re: User app (sign-in, sign-up, user profile, etc.) APP: Setup Wizard Re: Setup Wizard (facility import, superuser creation, settings, etc.) DEV: frontend DEV: tools Internal tooling for development labels Dec 26, 2024
@jredrejo jredrejo changed the base branch from release-v0.17.x to develop December 26, 2024 20:11
@nucleogenesis nucleogenesis marked this pull request as ready for review January 14, 2025 03:26
@nucleogenesis nucleogenesis force-pushed the fix--dynamic-file-storage-backend-cloud-file-storage branch 2 times, most recently from 123671a to ebc6d37 Compare January 14, 2025 04:01
# https://django-storages.readthedocs.io/en/latest/backends/gcloud.html#google-cloud-storage


class KolibriFileStorage(GoogleCloudStorage):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here we can override the default values for the location if needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear to me why this file is necessary - we can just directly use the GoogleCloudStorage class and use settings (with their own options in options.py to allow them to be defined).

https://django-storages.readthedocs.io/en/latest/backends/gcloud.html#settings

file_storage = DefaultStorage()


def random_filename():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hindsight, I think that maybe just cleaning up the files after each test could have avoided needing this.

@@ -150,18 +153,18 @@ def import_exported_csv(self):
assert len(classroom.get_coaches()) == 1

def test_dryrun_from_export_csv(self):
with open_csv_for_reading(self.filepath) as source:
header = next(csv.reader(source, strict=True))
source = file_storage.open(self.filepath, "r")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking back at this, the reassignment of source makes this a it odd to read in a way that it wasn't when they were created inside of with statements.

I'll come back through and do some cleaning

@rtibbles
Copy link
Member

There seems to be a problem with how files are being handled that is leaving file handles open - this is why things are erroring on windows, because it is much more sensitive to files not being closed:

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'D:\\a\\kolibri\\kolibri\\.pytest_kolibri_home\\media\\UserImportCommandTestCase.csv'

I haven't looked through the code in detail to see where this might be happening, but it suggests that something isn't being handled entirely properly.

Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here seem to have completely ignored the purpose of the two helper functions open_csv_for_writing and open_csv_for_reading - I think the diff could be significantly reduced by leaning into the existence of these two functions and consolidating the logic in there - with the small tweak as you have done that we change filepath to filename, so that it is clear it is not a literal disk file path.

We can also probably conditionalize whether we write directly to the file or to memory. Writing large CSV files to memory in a non-cloud environment is probably not going to be a great idea.

The other alternative is we could just write directly to the storage object in either case unless we know absolutely that it is too slow when writing to GCS? Given it's happening in an asynchronous task, and we have to write to it eventually... I am not sure how writing to memory first necessarily helps.

@@ -174,7 +176,7 @@ def csv_file_generator(facility, filepath, overwrite=True, demographic=False):
column for column in db_columns if demographic or column not in DEMO_FIELDS
)

csv_file = open_csv_for_writing(filepath)
csv_file = io.StringIO()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using io.StringIO() here but io.BytesIO below?

if not os.environ.get("DEFAULT_FILE_STORAGE"):
if conf.OPTIONS["FileStorage"]["STORAGE_BACKEND"] == "gcs":
DEFAULT_FILE_STORAGE = "kolibri.utils.file_storage.KolibriFileStorage"
BUCKET_NAME = os.getenv("GCS_BUCKET_NAME") or "kdp-csv-reporting-develop"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a BUCKET NAME option in our options.py as well - we definitely should not be hard coding anything specific about any platform in here.

If GCS_BUCKET_NAME is an environment variable that is set by default on GCP (but I don't think it is) we can specify specific env vars in the options configuration to draw values from.

# https://django-storages.readthedocs.io/en/latest/backends/gcloud.html#google-cloud-storage


class KolibriFileStorage(GoogleCloudStorage):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear to me why this file is necessary - we can just directly use the GoogleCloudStorage class and use settings (with their own options in options.py to allow them to be defined).

https://django-storages.readthedocs.io/en/latest/backends/gcloud.html#settings

if not os.environ.get("DEFAULT_FILE_STORAGE"):
if conf.OPTIONS["FileStorage"]["STORAGE_BACKEND"] == "gcs":
DEFAULT_FILE_STORAGE = "kolibri.utils.file_storage.KolibriFileStorage"
BUCKET_NAME = os.getenv("GCS_BUCKET_NAME") or "kdp-csv-reporting-develop"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than subclassing GoogleCloudStorage, we should just set the additional settings here for it.

@@ -737,6 +766,7 @@ def _get_validator():
"url_prefix": url_prefix,
"bytes": validate_bytes,
"multiprocess_bool": multiprocess_bool,
"storage_option": storage_option,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see that this would affect anything to do with the cache.

"description": """
The storage backend class that Django will use when managing files. The class given here must implement
the Django files.storage.Storage class.
""",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add addiitonal options here for the ACL, bucket name, etc - we don't need to expose all the settings that are available, but for anything we need set, we can include here.

@rtibbles
Copy link
Member

@nucleogenesis and I had a quick sync up to see how to consolidate the file handling here - seems like there are some deficiencies in the Django Storage conformance to the Python file object spec https://forum.djangoproject.com/t/file-open-to-support-different-encodings/21491 - but an idea there to wrap the file object in a TextIOWrapper to ensure we generate with the correct encoding and newlines seems promising!

@pcenov
Copy link
Member

pcenov commented Jan 27, 2025

Hi @nucleogenesis - on my end while regression testing the .csv generation I noticed that it's no longer possible to generate the same .csv file twice. For example if I go to Facility > Data and generate a sessions logs file for the current day, it gets generated correctly, but if I attempt to to the same thing 5 minutes later it still says "Generated 5 minutes ago."
The same is valid when generating a new user .csv file - the first time it works fine, but if I go ahead and add several new users, go back to Facility > Data and click again the "Generate new user CSV file" it will seem that it's generated correctly but when you open it you don't see the newly added users. There are no errors in the console.

Here are the logs from one of my devices though:
logs.zip

log.with.the.same.date.range.mp4

@nucleogenesis nucleogenesis force-pushed the fix--dynamic-file-storage-backend-cloud-file-storage branch 2 times, most recently from 702204e to ff63d7f Compare February 4, 2025 20:57
if os.path.exists(log_path):
csv_statuses[log_type] = os.path.getmtime(log_path)

if default_storage.exists(filename):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually had a call with @nucleogenesis earlier about his access challenges; the django-storages library suggests using oath2 creds for use with creds when passing settings to the library like so:

from google.oauth2 import service_account

GS_CREDENTIALS = service_account.Credentials.from_service_account_file(
    "path/to/credentials.json"
)

But i'd like to try just using Application Default Credentials; I think that might work across local dev and cloud envs

import google.auth
credentials, project = google.auth.default()

@nucleogenesis nucleogenesis force-pushed the fix--dynamic-file-storage-backend-cloud-file-storage branch from 23d2db6 to 2e739fa Compare February 26, 2025 23:59
@nucleogenesis
Copy link
Member Author

@pcenov I've made some significant changes here and I've been able to test it locally in both the Google Cloud & the local context.

Can you go through the various CSV downloads in Kolibri again to confirm?

I will connect with you once it is testable in a cloud context if needed.

@rtibbles things are ready for re-review code-wise, I'll annotate a few areas w/ thoughts & questions as well

# Initialize GS_CREDENTIALS as a proper google.auth.credentials.Credentials object
import google.auth

GS_CREDENTIALS, _ = google.auth.default()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DXCanas this is where I ended up putting the code we need to generate the credentials object. @rtibbles is there somewhere better to put this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we initialize this during the options parsing as a second step in validation? That way we could fail much earlier if the credentials are incorrectly configured.

# File Storage Backend
# https://docs.djangoproject.com/en/3.2/ref/files/storage/

if not os.environ.get("DEFAULT_FILE_STORAGE"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the extra environment variable here for? Can we not just gate this on the conf.OPTIONS?

if value == "gcs":
try:
from storages.backends.gcloud import GoogleCloudStorage # noqa

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my thought was we could try to initialize the credentials here - presumably it throws a predictable error which we can catch and raise the validation error.

@nucleogenesis nucleogenesis force-pushed the fix--dynamic-file-storage-backend-cloud-file-storage branch from 3df65d9 to 07db9d6 Compare February 27, 2025 23:38
@nucleogenesis
Copy link
Member Author

@rtibbles updating the google-storages module and removing the GS_CREDENTIALS handling altogether just worked locally <3

@pcenov
Copy link
Member

pcenov commented Feb 28, 2025

Hi @nucleogenesis, I confirm that the generation of the session and summary logs is working correctly now however when I try to import a user CVS file I am repeatedly getting the following error even without having modified in any way the just exported csv file:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/kolibri/core/tasks/job.py", line 326, in execute
    result = func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/kolibri/core/tasks/registry.py", line 237, in __call__
    return self.func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/kolibri/core/auth/tasks.py", line 188, in importusersfromcsv
    delete=delete,
  File "/usr/lib/python3/dist-packages/kolibri/dist/django/core/management/__init__.py", line 181, in call_command
    return command.execute(*args, **defaults)
  File "/usr/lib/python3/dist-packages/kolibri/dist/django/core/management/base.py", line 398, in execute
    output = self.handle(*args, **options)
  File "/usr/lib/python3/dist-packages/kolibri/core/tasks/management/commands/base.py", line 28, in handle
    return self.handle_async(*args, **options)
  File "/usr/lib/python3/dist-packages/kolibri/core/auth/management/commands/bulkimportusers.py", line 873, in handle_async
    self.number_lines = self.get_number_lines(filepath)
  File "/usr/lib/python3/dist-packages/kolibri/core/auth/management/commands/bulkimportusers.py", line 717, in get_number_lines
    with open_csv_for_reading(filepath) as f:
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/usr/lib/python3/dist-packages/kolibri/core/utils/csv.py", line 50, in open_csv_for_reading
    with default_storage.open(filename, "rb") as f:
  File "/usr/lib/python3/dist-packages/kolibri/dist/django/core/files/storage.py", line 38, in open
    return self._open(name, mode)
  File "/usr/lib/python3/dist-packages/kolibri/dist/django/core/files/storage.py", line 243, in _open
    return File(open(self.path(name), mode))
  File "/usr/lib/python3/dist-packages/kolibri/dist/django/core/files/storage.py", line 338, in path
    return safe_join(self.location, name)
  File "/usr/lib/python3/dist-packages/kolibri/dist/django/utils/_os.py", line 31, in safe_join
    'component ({})'.format(final_path, base_path))
django.core.exceptions.SuspiciousFileOperation: The joined path (/home/osboxes/.kolibri/temp/tmpv3rj_54v.upload.csv) is located outside of the base path component (/home/osboxes/.kolibri/media)

Here's a video as well:

import.user.CSV.mp4

Logs: logs.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
APP: Coach Re: Coach App (lessons, quizzes, groups, reports, etc.) APP: Device Re: Device App (content import/export, facility-syncing, user permissions, etc.) APP: Facility Re: Facility App (user/class management, facility settings, csv import/export, etc.) APP: Learn Re: Learn App (content, quizzes, lessons, etc.) APP: Setup Wizard Re: Setup Wizard (facility import, superuser creation, settings, etc.) APP: User Re: User app (sign-in, sign-up, user profile, etc.) DEV: backend Python, databases, networking, filesystem... DEV: dev-ops Continuous integration & deployment DEV: frontend DEV: renderers HTML5 apps, videos, exercises, etc. DEV: tools Internal tooling for development
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Broken features in cloud instances when depending on temp or uploaded files
5 participants