Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slack ts and thread_ts inconsistent types #412

Open
zilto opened this issue Apr 4, 2024 · 4 comments
Open

Slack ts and thread_ts inconsistent types #412

zilto opened this issue Apr 4, 2024 · 4 comments

Comments

@zilto
Copy link
Collaborator

zilto commented Apr 4, 2024

dlt version

0.4.7

Source name

slack

Describe the problem

Values by get_messages() and get_thread_replies() don't return the same data types for field ts and thread_ts. Values are returned as timestamp for the first and string for the latter.

This is problematic when trying to join tables of messages and replies based on their thread_ts (thread id), which is a very common operation.

This is because get_messages() passes datetime_fields=MSG_DATETIME_FIELDS whereas get_thread_replies() doesn't.

Expected behavior

  1. ts and thread_ts should both receive the same type from MSG_DATETIME_FIELDS

  2. More importantly, according to Slack specs, ts and thread_ts are not timestamps and string is actually the proper type. (see ref)

There are a few additional fields that describe the author (such as user or bot_id), but there's also an additional ts field. The ts value is essentially the ID of the message, guaranteed unique within the context of a channel or conversation.

They look like UNIX/epoch timestamps, hence ts, with specified milliseconds. But they're actually message IDs, even if they're partially composed in seconds-since-the-epoch.

Given ts and thread_ts do not exactly represent a timestamp but rather are unique ids that can be sorted chronologically, I just removing them from the default values of MSG_DATETIME_FIELDS.

This would be a breaking change for the message tables, but not for replies tables, so it would the right time to introduce the change to defaults if accepted.

Steps to reproduce

dlt init slack

How you are using the source?

I run this source for fun.

Operating system

Linux

Runtime environment

Local

Python version

3.10.9

dlt destination

duckdb

Additional information

As a solution, I manually change type of ts and thread_ts of messages from timestamp to string

@zilto
Copy link
Collaborator Author

zilto commented Apr 12, 2024

I can make the PR for the changes, just let me know what direction to go!

@VioletM
Copy link
Contributor

VioletM commented Apr 15, 2024

Hey @zilto good point!
I think the reason it was done this way -- so we have an incremental field. But string could also be incremental, so I agree with your idea of converting both to strings.

Thanks a lot for the suggestion to create a PR, this would be wonderful! Just put me as a reviewer :)

@zilto
Copy link
Collaborator Author

zilto commented Apr 20, 2024

I started making these changed to open a PR, but I encountered another challenge.

In a few places, we have dlt.sources.incremental() throwing an exception because it can't compare ts of type str and created_at of type DateTime.

 created_at: dlt.sources.incremental[DateTime] = dlt.sources.incremental(
    "ts",
    initial_value=start_dt,
    end_value=end_dt,
    allow_external_schedulers=True,
),

line

I envision 2 solutions:

  1. pass MSG_DATETIME_FIELDS to get_thread_replies() to make everything ts and thread_ts consistently a timestamp.
  2. do post-processing (dlt.transformer()?) to change the type of ts and thread_ts to str after comparisons for incremental loading are done.

I could implement solution 1, but would need some additional guidance for solution 2. Let me know how to proceed

@VioletM
Copy link
Contributor

VioletM commented May 16, 2024

Hi @zilto!
I think what we can do -- is to create the variable in str type right away:

created_at: dlt.sources.incremental[str] = dlt.sources.incremental(
    "ts",
    initial_value=start_dt,
    end_value=end_dt,
    allow_external_schedulers=True,
),

Or make everything a timestamp, as you've suggested in 1 solution.

If you have some code ready -- could you open a PR? Or if you don't have a time to continue with this, just let me know, we'll assign someone on this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Planned
Development

No branches or pull requests

2 participants