feat: auto import course structure on course publish #43

Danyal-Faheem · 2024-08-05T13:17:23Z

Changes

Parse openedx_logs in vector to look for the Updating course overview for <course_id> logs
Send this course_id to a service
Create a new service cairn-watchcourses that listens for incoming events from vector
Trigger importcoursedata script from cairn-watchcourses for the specific course whenever a course is published

Caveats

We utilize batch processing on the vector sinks with a timeout of 300 secs or batch size of 10 events. This means that changes to the course structure may take up to 5 minutes to show up in the superset dashboard.

regisb

This is excellent work. I have many comments, because this is an important change, but I expect we'll get there.

tutorcairn/plugin.py

tutorcairn/patches/local-docker-compose-services

tutorcairn/templates/cairn/apps/openedx/scripts/main.py

regisb · 2024-08-06T04:58:02Z

tutorcairn/patches/local-docker-compose-services

@@ -85,3 +85,17 @@ cairn-postgresql:
    depends_on:
        - permissions
 {% endif %}
+cairn-watchcourses:


What are the additional CPU/memory resources that are needed by this container? (see docker stats) Given that it's a very thin wrapper, I expect that the requirements are low. But if they are not, then we'll have to gatekeep this service behind a feature flag.

My environment is the following:

Docker 4.28.0

Tutor 18.1.1

Plugins enabled: Cairn

My machine specs are the following:

MacOS Sonoma

Macbook Pro M1

Resources utilized:

When the container is sitting idle, listening to requests:

CPU: 0.00 - 0.03%

Memory: ~58 MiB

When the container is executing the importcoursedata script:

CPU: 100% (on MacOS, this usually signifies one core is being completely utilized)

Memory: 250 MiB - 300 MiB

Then it returns back to idle resources once it is completed.

These results would definitely vary on a linux based machine.

regisb · 2024-08-06T05:00:57Z

tutorcairn/patches/local-docker-compose-services

+        && uvicorn --app-dir /openedx/scripts/ main:app --host 0.0.0.0 --port {{ CAIRN_WATCHCOURSES_PORT }}"
+    restart: unless-stopped
+    environment:
+      SETTINGS: ${TUTOR_EDX_PLATFORM_SETTINGS:-tutor.production}


Can we avoid defining this environment variable here? After all, the fastapi process only needs it for its subprocess, not its main process. I suggest removing it from the service declaration and use the env option in subprocess.call(... env=...) (docs).

It just works without defining this environment variable. Checking the value of os.environ, it reveals that the container is already using the production settings.

This is what I found in os.environ.

'DJANGO_SETTINGS_MODULE': 'lms.envs.tutor.production'

This is going to work in production because the container inherits the settings from the image definition. But you should make sure that it's correct in dev and in kubernetes.

tutorcairn/templates/cairn/apps/vector/partials/common-post.toml

regisb · 2024-08-09T08:38:44Z

tutorcairn/templates/cairn/apps/vector/partials/common-post.toml

+inputs = ["course_published"]
+batch.timeout_secs = 300
+batch.max_events = 10
+uri = "http://cairn-watchcourses:9282/import_course/"


The name does not seem very adequate anymore, now that we can pass several course IDs at once. Maybe /courses/published?

That sounds much better, I'll update it.

regisb · 2024-08-09T08:43:03Z