Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add basic telemetry to pipelines 2.0 #5929

Merged
merged 27 commits into from
Oct 13, 2023
Merged

feat: add basic telemetry to pipelines 2.0 #5929

merged 27 commits into from
Oct 13, 2023

Conversation

ZanSara
Copy link
Contributor

@ZanSara ZanSara commented Sep 29, 2023

Related Issues

Proposed Changes:

Adds telemetry to Pipeline.run() for Haystack 2.0 pipelines.

This PR introduces Haystack-side downsampling of Pipeline.run() events by always the first event and then at most one event every minute.

In this first iteration the data collected about the pipeline is fairly limited - only a list of component names and types. Later on we can define what we want to collect about which component and adapt the telemetry code to collect it (see related epic). Note that because of this, right now we would not get ANY information about the document stores used.

The event sent also contains mostly the same basic system information that it used to be collected in v1, namely some libraries versions (transformers, torch, pytest, etc...), system specs, OS, etc.

The events sent are found in Posthog by the name Pipeline run (2.x): https://eu.posthog.com/events#q=%7B%22kind%22%3A%22DataTableNode%22%2C%22full%22%3Atrue%2C%22source%22%3A%7B%22kind%22%3A%22EventsQuery%22%2C%22select%22%3A%5B%22*%22%2C%22event%22%2C%22person%22%2C%22coalesce(properties.%24current_url%2C%20properties.%24screen_name)%20--%20Url%20%2F%20Screen%22%2C%22properties.%24lib%22%2C%22timestamp%22%5D%2C%22orderBy%22%3A%5B%22timestamp%20DESC%22%5D%2C%22after%22%3A%22-24h%22%2C%22event%22%3A%22Pipeline%20run%20(2.x)%22%7D%2C%22propertiesViaUrl%22%3Atrue%2C%22showSavedQueries%22%3Atrue%2C%22showPersistentColumnConfigurator%22%3Atrue%7D

How did you test it?

Manual tests.

Notes for the reviewer

n/a

Checklist

@ZanSara ZanSara requested a review from a team as a code owner September 29, 2023 15:10
@ZanSara ZanSara requested review from anakin87 and removed request for a team September 29, 2023 15:10
@github-actions github-actions bot added the type:documentation Improvements on the docs label Sep 29, 2023
@ZanSara ZanSara requested review from masci and removed request for anakin87 September 29, 2023 15:11
@ZanSara ZanSara added topic:telemetry 2.x Related to Haystack v2.0 and removed type:documentation Improvements on the docs labels Sep 29, 2023
@github-actions github-actions bot added the type:documentation Improvements on the docs label Sep 29, 2023
@ZanSara ZanSara requested a review from a team as a code owner September 29, 2023 15:14
@ZanSara ZanSara requested review from dfokina and removed request for a team September 29, 2023 15:15
Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, I'm not convinced the exponential sampling will do much honestly, I'd go with a constant sampling rate (or if we want an "advanced" solution, an adaptive strategy based on number of events per hour/day)

haystack/preview/telemetry/environment.py Outdated Show resolved Hide resolved
haystack/preview/telemetry/telemetry.py Outdated Show resolved Hide resolved
haystack/preview/telemetry/telemetry.py Outdated Show resolved Hide resolved
haystack/preview/telemetry/telemetry.py Outdated Show resolved Hide resolved
haystack/preview/telemetry/telemetry.py Outdated Show resolved Hide resolved
haystack/preview/telemetry/telemetry.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added the type:documentation Improvements on the docs label Oct 4, 2023
@ZanSara
Copy link
Contributor Author

ZanSara commented Oct 4, 2023

I verified that on Posthog side, the SQL feature allows us to aggregate these values in two ways:

WITH FilteredEvents AS (
    SELECT
        person_id AS person_id,
        properties.pipeline_id AS pipeline_id,
        sum(properties.runs) AS total_runs
    FROM
        events
    WHERE
        event = 'Pipeline run (2.x)'
    GROUP BY
        person_id,
        pipeline_id
)
SELECT *
FROM
    FilteredEvents
  • By filtering out only the most recent value of runs for each series of events associated with the same person_id/pipeline_id pair. In this case the SQL query is mode complex and we are relying on Posthog timestamps, which I observed they don't always seem to work well in case of events firing very close to each other. However, this technique is still "eventually consistent", so I don't find it worrying.
WITH FilteredEvents AS (
   SELECT
       person_id AS person_id,
       properties.pipeline_id AS pipeline_id,
       properties.runs AS runs,
       ROW_NUMBER() OVER (PARTITION BY (person_id, pipeline_id) ORDER BY timestamp DESC) AS rn
   FROM
       events
   WHERE
       event = 'Pipeline run (2.x)'
)
SELECT
   person_id,
   pipeline_id,
   runs
FROM
   FilteredEvents
WHERE
   rn = 1

Right now the values of runs are consistent with using the second technique. Switching it would be very simple, so open for feedback on which one to use.


Note that, in either case, PostHog does NOT allow to make graphs out of this data. It can only be shown in table form. See the bottom of https://posthog.com/docs/product-analytics/sql

@ZanSara ZanSara requested a review from julian-risch October 9, 2023 11:39
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downsampling by number of runs messes up our weekly active users metrics. Limiting the number of events sent per day per user would work but not based on number of runs. For now, it's best to not include it. 👍
pipeline_id is currently just based on memory address but ideally we would have a pipeline id that doesn't change if I run another pipeline with the same configuration later, right? Or do we want to rely on memory address to see how long a certain pipeline is running without interruption?
The docstring for dynamic_system_specs should be updated. Otherwise looks good to me. Let's talk about pipeline_id and then I'll be ready approve.

haystack/preview/telemetry/_environment.py Outdated Show resolved Hide resolved

# Data sent to Posthog
return "Pipeline run (2.x)", {
"pipeline_id": str(id(pipeline)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this would be the same pipeline_id if a pipeline with the exact same configuration is run later again, or even on a different machine. Can't we create a hash based on pipeline_description? If it's out of scope for this PR, let's at least create an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to create the new issue 😊

haystack/preview/telemetry/_telemetry.py Show resolved Hide resolved
@Timoeller
Copy link
Contributor

Please include first, second and third run in the exponential downsampling, too @ZanSara . Reason: We want to see if users ask at least 3 queries on a RAG pipeline for measuring Haystack 2.0 adoption.
In any case, it would be good to parametrize the downsampling to adjust it after we have some measurements. (I assume the currently downsampling is too coarse grained on early interactions, but I would love to see some data first).

@julian-risch you wrote

Downsampling by number of runs messes up our weekly active users metrics.

Currently, we only track when pipelines are initiated. Tracking how often pipelines are run, adds information. The information on how often people use a pipeline is very important to us. Of course, we need to adjust how WAU is computed, but it is not impossible to compute WAU in the same way as before.

@julian-risch
Copy link
Member

julian-risch commented Oct 9, 2023

@Timoeller Sure, nothing against counting pipeline runs. 👍 Some more context: My comment was about the idea of "Pipeline will send their first, 10th, 100th, 1000th run event, and then a run event ever 10.000." in an earlier version of the PR. With that approach we would easily have users who are active with dozens of queries per week but they wouldn't send an event every week and thus wouldn't be counted as WAU in every week, just in weeks when they pass a 10000 mark. Sara and I discussed it and the PR and the description were updated afterward. Having at least one event per day is much better to count WAU. 🙂

@ZanSara
Copy link
Contributor Author

ZanSara commented Oct 9, 2023

After a quick sync with @Timoeller I will add tracking for also the 2nd and 3rd event of a Pipeline. I don't expect a lot of useful data from them, but I have no objections against trying it out for a few version and re-assess after we have some hard numbers 🙂

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! 👍 Thanks for adjusting the downsampling. We can iterate on it later if we notice any need for changes. And great to see that you synced with Timo and integrated the feedback. 👍

Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super happy with this PR, the logic is quite complex already, and I can imagine we will keep iterating. Since we ended up tracking timestamps, I wonder if we should give it a shot at the rate limiter I mentioned before: we send all the events and we start buffering as the sends per minute go above a certain threshold. This could be implemented in the Telemetry class without adding any complexity to the sending functions.

Also not sure about server-side aggregation: we always send pipeline._telemetry_runs, did you check if posthog lets you compute rollups using deltas? Otherwise we would keep adding total runs to the total at every event send.

haystack/preview/telemetry/_telemetry.py Outdated Show resolved Hide resolved
@ZanSara
Copy link
Contributor Author

ZanSara commented Oct 10, 2023

Also not sure about server-side aggregation: we always send pipeline._telemetry_runs, did you check if posthog lets you compute rollups using deltas? Otherwise we would keep adding total runs to the total at every event send.

Yes, see #5929 (comment): with their SQL support you can do basically anything, and I have made 2 working insights (see the other comment for the links) where we:

  1. accumulate the runs count (so I would need to reset the counter)
  2. pick the last event's run field (so that I don't need to reset the counter) --> this imho is more robust against losing messages, but not a lot of difference otherwise

This PR currently assumes we're doing 2. The catch is that those tables cannot (yet) be converted into nice graphs by PostHog for some reason, but the data is there and it's aggregated properly.

we send all the events and we start buffering as the sends per minute go above a certain threshold.

What threshold you have in mind? 10 would do? And what's the interval?

@ZanSara
Copy link
Contributor Author

ZanSara commented Oct 10, 2023

I've adapted the strategy again: now it sends always the first event and then at most one other every minute. As discussed with @masci, this is good enough to track pipeline usage: to track user activation (for example, users that made 1 or 2 or 3 pipeline runs) we can use another set of events, to be done in another PR. It will be easier to manage them in Posthog this way.

@ZanSara ZanSara requested a review from masci October 10, 2023 12:54
@ZanSara ZanSara merged commit 110aacd into main Oct 13, 2023
@ZanSara ZanSara deleted the telemetry-2.0 branch October 13, 2023 07:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 topic:telemetry type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants