Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subset of Tracking Log Events on Message Bus #28

Closed
3 tasks done
jmakowski1123 opened this issue Jan 9, 2023 · 9 comments
Closed
3 tasks done

Subset of Tracking Log Events on Message Bus #28

jmakowski1123 opened this issue Jan 9, 2023 · 9 comments
Labels
epic Large unit of work, consisting of multiple tasks

Comments

@jmakowski1123
Copy link

jmakowski1123 commented Jan 9, 2023

To extend our event handling capabilities we would like to be able to put tracking logs onto the event bus. This epic is to track that work, and the first consumer of those logs - event-routing-backends optionally using it instead of the Celery-based async backend.

Existing task that should be completed as a prerequisite to this work:
openedx/openedx-events#210

High level tasks to be groomed:

@jmakowski1123 jmakowski1123 converted this from a draft issue Jan 9, 2023
@jmakowski1123 jmakowski1123 added the epic Large unit of work, consisting of multiple tasks label Jan 9, 2023
@jmakowski1123 jmakowski1123 added this to the Events Over Message Bus milestone Jan 9, 2023
@jmakowski1123 jmakowski1123 moved this to Backlog in Data Working Group Jan 10, 2023
@jmakowski1123 jmakowski1123 moved this from Backlog to Epics - no status in Data Working Group Jan 10, 2023
@bmtcril bmtcril changed the title Events on Message Bus Subset of Tracking Log Events on Message Bus Jan 19, 2023
@bmtcril
Copy link

bmtcril commented May 5, 2023

@Ian2012 this is as far as I've gotten today on this work, still need to write the other 2 tickets

@mariajgrimaldi
Copy link
Member

mariajgrimaldi commented May 23, 2023

I have a few comments regarding Create a single “tracking log” OpenEdxPublicSignal:

  • Each Open edX Event has its own data class. In case of having a single tracking log event, we should use a generic data class capable of accepting anything inside the event field which is the one that varies among events -- please let me know if I'm wrong. I imagine our data class would look something like this:
@attr.s(frozen=True)
class ContextTrackingLogBase:
    course_id = attr.ib(type=str)
    org_id = attr.ib(type=str)
    path = attr.ib(type=str)
    user_id = attr.ib(type=str)

Class ContextTrackingLog(ContextTrackingLogBase):
    course_user_tags = attr.ib(type=Optional[dict])
    module = attr.ib(type=Optional[dict])

@attr.s(frozen=True)
class TrackingLogData:
    accept_language = attr.ib(type=str)
    agent = attr.ib(type=str)
    context = attr.ib(type=ContextTrackingLog)
    event = attr.ib(type=dict)
    event_source = attr.ib(type=str)
    event_type = attr.ib(type=str)
    host = attr.ib(type=str)
    ip = attr.ib(type=str)
    name = attr.ib(type=str)
    page = attr.ib(type=str)
    referer = attr.ib(type=str)
    session = attr.ib(type=str)
    time = attr.ib(type=str)
    username = attr.ib(type=str)

Now, the event field varies according to the event we're sending. So, we could only validate the tracking log common data using the data class we just created (which happens automatically when sending the event). But we couldn't do much with the info inside the event field -- at least as it is right now.

  • All Open edX Events are namespaced by an architectural subdomain. For example, events related to a student's enrollment belong to the subdomain learning, so they live inside the /learning submodule and have learning as part of their event_type:

org.openedx.learning.course.enrollment.created.v1

Each event_type is created according to this ADR. Now, how would this look like with a single tracking log event? 🤔 I see in the documentation of the events that some events are categorized as student or course. I don't know if this could be used to create more than one event so we can continue following the naming convention.

  • Also, tracking logs have their own event_type, which differs from the Open edX Events event_type mentioned before. I hope the tracking logs event_type does not confuse the user.
  • When producing Open edX Events to the event bus, we'll need this info. That means we must include it in the tracking log data class and specify it while sending the event -- at least the first three fields since the other ones won't change among events.

@mariajgrimaldi
Copy link
Member

mariajgrimaldi commented May 24, 2023

About subdomains

After reviewing architectural subdomains again, specifically after seeing this image:
image
Found here. I see that there's a bounded context called Analytics, which is traversal to all subdomains, now I wonder if we should create a tracking log Open edX event for each subdomain. 🤔

We might need to ask an expert 😉. What do you think @ormsbee?

@bmtcril
Copy link

bmtcril commented Jun 1, 2023

I think this is all reasonable. Sadly a lot of the time various fields on the tracking log entries will be empty strings, and if you haven't I think it would make sense to confirm these are the only keys we're going to see. There exists some possibility that people will want to replay very old tracking logs which may have different or broken formats. You all are probably in a better position than I am to check that, though. :)

For the arch domain I personally think a single cross-cutting Analytics subdomain makes the most sense as shown in the image above and this one:
image

But it's a loosely held opinion and I could be convinced to go a different direction.

@Ian2012
Copy link

Ian2012 commented Jun 5, 2023

Proper PRs has been created for this requirement:

In the proposed approach, the data format was updated to match the schema defined here because data and context are dynamic variables I think is not important to define a rigid schema for the tracking log data:

event = {
    'name': name or UNKNOWN_EVENT_TYPE,
    'timestamp': datetime.now(UTC),
    'data': data or {},
    'context': self.resolve_context()
}

@bmtcril
Copy link

bmtcril commented Jun 14, 2023

We had a reminder from @robrap that the volume of these events may require us to look at making some of the event bus audit logging configurable to avoid performance issues related to log spam. As part of testing these PRs we should make sure that the associated logging is of an acceptable volume.

@Ian2012
Copy link

Ian2012 commented Oct 25, 2023

The current state of this work is the following:

This can be tested by defining the following settings in your environment:

      EVENT_TRACKING_BACKENDS = {
        'event_bus': {
          'ENGINE':  'eventtracking.backends.event_bus.EventBusRoutingBackend',
        },
      }
      EVENT_BUS_PRODUCER = 'edx_event_bus_redis.create_producer'
      EVENT_BUS_REDIS_CONNECTION_URL = 'redis://@redis:6379/'
      EVENT_BUS_TOPIC_PREFIX = 'dev'
      EVENT_BUS_CONSUMER = 'edx_event_bus_redis.RedisEventConsumer'

@robrap can you take a look at the changes here: openedx/event-tracking#246

The workaround is basically to avoid an inifite loop there by re-emitting the signals. ERB can't be converted in an IDA as ERB depends on accesing the same database as the LMS, so it's better to consume the events in the same service.

External services still can access to the event bus data by using TRACKING_EVENT_EMITTED, but TRACKING_EVENT_EMITTED_TO_BUS is used here internally.

Also, this can be easily run as:

tutor dev exec lms ./manage.py lms consume_events -t analytics -g event_routing_backends --extra '{"consumer_name": "aspects"}'

cc @felipemontoya @mariajgrimaldi @bmtcril

@robrap
Copy link

robrap commented Oct 25, 2023

I have not reviewed any code, but simply providing openedx/openedx-events#79 with some past discussion around avoiding infinite loops.

@bmtcril bmtcril moved this from All Epics - no status to Doing in Data Working Group Feb 13, 2024
@bmtcril
Copy link

bmtcril commented Feb 21, 2024

The infinite loop problem has been addressed here: openedx/openedx-events#312 While there is still an open PR for batching events, I think the work completed is sufficient to close this out.

@bmtcril bmtcril closed this as completed Feb 21, 2024
@github-project-automation github-project-automation bot moved this from Doing to Done in Data Working Group Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Large unit of work, consisting of multiple tasks
Projects
Development

No branches or pull requests

5 participants