feat: add basic telemetry to pipelines 2.0 #5929

ZanSara · 2023-09-29T15:10:52Z

Related Issues

related to Telemetry for Haystack (2.x) #5915

Proposed Changes:

Adds telemetry to Pipeline.run() for Haystack 2.0 pipelines.

This PR introduces Haystack-side downsampling of Pipeline.run() events by always the first event and then at most one event every minute.

In this first iteration the data collected about the pipeline is fairly limited - only a list of component names and types. Later on we can define what we want to collect about which component and adapt the telemetry code to collect it (see related epic). Note that because of this, right now we would not get ANY information about the document stores used.

The event sent also contains mostly the same basic system information that it used to be collected in v1, namely some libraries versions (transformers, torch, pytest, etc...), system specs, OS, etc.

The events sent are found in Posthog by the name Pipeline run (2.x): https://eu.posthog.com/events#q=%7B%22kind%22%3A%22DataTableNode%22%2C%22full%22%3Atrue%2C%22source%22%3A%7B%22kind%22%3A%22EventsQuery%22%2C%22select%22%3A%5B%22*%22%2C%22event%22%2C%22person%22%2C%22coalesce(properties.%24current_url%2C%20properties.%24screen_name)%20--%20Url%20%2F%20Screen%22%2C%22properties.%24lib%22%2C%22timestamp%22%5D%2C%22orderBy%22%3A%5B%22timestamp%20DESC%22%5D%2C%22after%22%3A%22-24h%22%2C%22event%22%3A%22Pipeline%20run%20(2.x)%22%7D%2C%22propertiesViaUrl%22%3Atrue%2C%22showSavedQueries%22%3Atrue%2C%22showPersistentColumnConfigurator%22%3Atrue%7D

How did you test it?

Manual tests.

Notes for the reviewer

n/a

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

haystack/preview/telemetry/__init__.py

…elemetry-2.0

masci

Left a few comments, I'm not convinced the exponential sampling will do much honestly, I'd go with a constant sampling rate (or if we want an "advanced" solution, an adaptive strategy based on number of events per hour/day)

haystack/preview/telemetry/environment.py

haystack/preview/telemetry/telemetry.py

ZanSara · 2023-10-04T11:13:16Z

I verified that on Posthog side, the SQL feature allows us to aggregate these values in two ways:

By summing up the values of runs for each event associated with the same person_id/pipeline_id pair. The required SQL query is relatively simple.
- https://eu.posthog.com/insights/N5XjsjkA

WITH FilteredEvents AS (
    SELECT
        person_id AS person_id,
        properties.pipeline_id AS pipeline_id,
        sum(properties.runs) AS total_runs
    FROM
        events
    WHERE
        event = 'Pipeline run (2.x)'
    GROUP BY
        person_id,
        pipeline_id
)
SELECT *
FROM
    FilteredEvents

By filtering out only the most recent value of runs for each series of events associated with the same person_id/pipeline_id pair. In this case the SQL query is mode complex and we are relying on Posthog timestamps, which I observed they don't always seem to work well in case of events firing very close to each other. However, this technique is still "eventually consistent", so I don't find it worrying.
- https://eu.posthog.com/insights/LL5I8kaE

WITH FilteredEvents AS (
   SELECT
       person_id AS person_id,
       properties.pipeline_id AS pipeline_id,
       properties.runs AS runs,
       ROW_NUMBER() OVER (PARTITION BY (person_id, pipeline_id) ORDER BY timestamp DESC) AS rn
   FROM
       events
   WHERE
       event = 'Pipeline run (2.x)'
)
SELECT
   person_id,
   pipeline_id,
   runs
FROM
   FilteredEvents
WHERE
   rn = 1

Right now the values of runs are consistent with using the second technique. Switching it would be very simple, so open for feedback on which one to use.

Note that, in either case, PostHog does NOT allow to make graphs out of this data. It can only be shown in table form. See the bottom of https://posthog.com/docs/product-analytics/sql

julian-risch

Downsampling by number of runs messes up our weekly active users metrics. Limiting the number of events sent per day per user would work but not based on number of runs. For now, it's best to not include it. 👍
pipeline_id is currently just based on memory address but ideally we would have a pipeline id that doesn't change if I run another pipeline with the same configuration later, right? Or do we want to rely on memory address to see how long a certain pipeline is running without interruption?
The docstring for dynamic_system_specs should be updated. Otherwise looks good to me. Let's talk about pipeline_id and then I'll be ready approve.

haystack/preview/telemetry/_environment.py

julian-risch · 2023-10-09T12:23:55Z

haystack/preview/telemetry/_telemetry.py

+
+    # Data sent to Posthog
+    return "Pipeline run (2.x)", {
+        "pipeline_id": str(id(pipeline)),


Ideally this would be the same pipeline_id if a pipeline with the exact same configuration is run later again, or even on a different machine. Can't we create a hash based on pipeline_description? If it's out of scope for this PR, let's at least create an issue.

Happy to create the new issue 😊

haystack/preview/telemetry/_telemetry.py

Timoeller · 2023-10-09T13:28:21Z

Please include first, second and third run in the exponential downsampling, too @ZanSara . Reason: We want to see if users ask at least 3 queries on a RAG pipeline for measuring Haystack 2.0 adoption.
In any case, it would be good to parametrize the downsampling to adjust it after we have some measurements. (I assume the currently downsampling is too coarse grained on early interactions, but I would love to see some data first).

@julian-risch you wrote

Downsampling by number of runs messes up our weekly active users metrics.

Currently, we only track when pipelines are initiated. Tracking how often pipelines are run, adds information. The information on how often people use a pipeline is very important to us. Of course, we need to adjust how WAU is computed, but it is not impossible to compute WAU in the same way as before.

julian-risch · 2023-10-09T13:59:58Z

@Timoeller Sure, nothing against counting pipeline runs. 👍 Some more context: My comment was about the idea of "Pipeline will send their first, 10th, 100th, 1000th run event, and then a run event ever 10.000." in an earlier version of the PR. With that approach we would easily have users who are active with dozens of queries per week but they wouldn't send an event every week and thus wouldn't be counted as WAU in every week, just in weeks when they pass a 10000 mark. Sara and I discussed it and the PR and the description were updated afterward. Having at least one event per day is much better to count WAU. 🙂

ZanSara · 2023-10-09T14:56:17Z

After a quick sync with @Timoeller I will add tracking for also the 2nd and 3rd event of a Pipeline. I don't expect a lot of useful data from them, but I have no objections against trying it out for a few version and re-assess after we have some hard numbers 🙂

julian-risch

Looks good to me! 👍 Thanks for adjusting the downsampling. We can iterate on it later if we notice any need for changes. And great to see that you synced with Timo and integrated the feedback. 👍

masci

I'm not super happy with this PR, the logic is quite complex already, and I can imagine we will keep iterating. Since we ended up tracking timestamps, I wonder if we should give it a shot at the rate limiter I mentioned before: we send all the events and we start buffering as the sends per minute go above a certain threshold. This could be implemented in the Telemetry class without adding any complexity to the sending functions.

Also not sure about server-side aggregation: we always send pipeline._telemetry_runs, did you check if posthog lets you compute rollups using deltas? Otherwise we would keep adding total runs to the total at every event send.

haystack/preview/telemetry/_telemetry.py

ZanSara · 2023-10-10T08:40:59Z

Also not sure about server-side aggregation: we always send pipeline._telemetry_runs, did you check if posthog lets you compute rollups using deltas? Otherwise we would keep adding total runs to the total at every event send.

Yes, see #5929 (comment): with their SQL support you can do basically anything, and I have made 2 working insights (see the other comment for the links) where we:

accumulate the runs count (so I would need to reset the counter)
pick the last event's run field (so that I don't need to reset the counter) --> this imho is more robust against losing messages, but not a lot of difference otherwise

This PR currently assumes we're doing 2. The catch is that those tables cannot (yet) be converted into nice graphs by PostHog for some reason, but the data is there and it's aggregated properly.

we send all the events and we start buffering as the sends per minute go above a certain threshold.

What threshold you have in mind? 10 would do? And what's the interval?

ZanSara · 2023-10-10T12:54:30Z

I've adapted the strategy again: now it sends always the first event and then at most one other every minute. As discussed with @masci, this is good enough to track pipeline usage: to track user activation (for example, users that made 1 or 2 or 3 pipeline runs) we can use another set of events, to be done in another PR. It will be easier to manage them in Posthog this way.

haystack/preview/telemetry/_telemetry.py

add telemetry to pipelines 2.0

f10edda

ZanSara requested a review from a team as a code owner September 29, 2023 15:10

ZanSara requested review from anakin87 and removed request for a team September 29, 2023 15:10

github-actions bot added the type:documentation Improvements on the docs label Sep 29, 2023

ZanSara requested review from masci and removed request for anakin87 September 29, 2023 15:11

ZanSara added topic:telemetry 2.x Related to Haystack v2.0 and removed type:documentation Improvements on the docs labels Sep 29, 2023

only collect data if telemetry is on

ec837e2

github-actions bot added the type:documentation Improvements on the docs label Sep 29, 2023

reno

faef7fa

ZanSara requested a review from a team as a code owner September 29, 2023 15:14

ZanSara requested review from dfokina and removed request for a team September 29, 2023 15:15

masci assigned ZanSara Oct 3, 2023

ZanSara mentioned this pull request Oct 3, 2023

Telemetry for Haystack (2.x) #5915

Closed

ZanSara added 6 commits October 3, 2023 10:20

Merge branch 'main' into telemetry-2.0

afc25f8

add downsampling

2edb083

typing

9c93dbd

manual tests

c39bf15

pylint

3fad710

simplify code

908bfb6

ZanSara commented Oct 3, 2023

View reviewed changes

haystack/preview/telemetry/__init__.py Outdated Show resolved Hide resolved

ZanSara added 5 commits October 3, 2023 12:06

Update haystack/preview/telemetry/__init__.py

8228483

rather index by component type

9f7aa03

Merge branch 'telemetry-2.0' of github.com:deepset-ai/haystack into t…

4407535

…elemetry-2.0

black

e48b15a

mypy

110a613

masci suggested changes Oct 3, 2023

View reviewed changes

review feedback & small improvements

10cb302

github-actions bot added the type:documentation Improvements on the docs label Oct 4, 2023

ZanSara added 5 commits October 4, 2023 13:16

defaultdict

45c2f11

stray changes

e781cb6

lint

c955d72

invert condition

bc5e260

Merge branch 'main' into telemetry-2.0

8018e95

ZanSara requested a review from julian-risch October 9, 2023 11:39

julian-risch requested changes Oct 9, 2023

View reviewed changes

always send the first event of the day

6521ebd

collect specs

4adbd6f

track 2nd and 3rd events too

4c8cc7f

julian-risch approved these changes Oct 9, 2023

View reviewed changes

masci suggested changes Oct 9, 2023

View reviewed changes

haystack/preview/telemetry/_telemetry.py Outdated Show resolved Hide resolved

send first event and then max 1 event a minute

3a5d4ae

ZanSara requested a review from masci October 10, 2023 12:54

rename constant

204f55a

masci reviewed Oct 11, 2023

View reviewed changes

haystack/preview/telemetry/_telemetry.py Outdated Show resolved Hide resolved

invert condition

df89910

masci approved these changes Oct 11, 2023

View reviewed changes

linting

a485ebe

ZanSara merged commit 110aacd into main Oct 13, 2023

ZanSara deleted the telemetry-2.0 branch October 13, 2023 07:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add basic telemetry to pipelines 2.0 #5929

feat: add basic telemetry to pipelines 2.0 #5929

ZanSara commented Sep 29, 2023 •

edited

Loading

masci left a comment

ZanSara commented Oct 4, 2023 •

edited

Loading

julian-risch left a comment

julian-risch Oct 9, 2023

ZanSara Oct 9, 2023

Timoeller commented Oct 9, 2023

julian-risch commented Oct 9, 2023 •

edited

Loading

ZanSara commented Oct 9, 2023

julian-risch left a comment •

edited

Loading

masci left a comment

ZanSara commented Oct 10, 2023 •

edited

Loading

ZanSara commented Oct 10, 2023

feat: add basic telemetry to pipelines 2.0 #5929

feat: add basic telemetry to pipelines 2.0 #5929

Conversation

ZanSara commented Sep 29, 2023 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

masci left a comment

Choose a reason for hiding this comment

ZanSara commented Oct 4, 2023 • edited Loading

julian-risch left a comment

Choose a reason for hiding this comment

julian-risch Oct 9, 2023

Choose a reason for hiding this comment

ZanSara Oct 9, 2023

Choose a reason for hiding this comment

Timoeller commented Oct 9, 2023

julian-risch commented Oct 9, 2023 • edited Loading

ZanSara commented Oct 9, 2023

julian-risch left a comment • edited Loading

Choose a reason for hiding this comment

masci left a comment

Choose a reason for hiding this comment

ZanSara commented Oct 10, 2023 • edited Loading

ZanSara commented Oct 10, 2023

ZanSara commented Sep 29, 2023 •

edited

Loading

ZanSara commented Oct 4, 2023 •

edited

Loading

julian-risch commented Oct 9, 2023 •

edited

Loading

julian-risch left a comment •

edited

Loading

ZanSara commented Oct 10, 2023 •

edited

Loading