Skip to content

Update docs as per default tp_index behaviour change #58

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 20, 2025
Merged

Conversation

pskrbasu
Copy link
Contributor

@pskrbasu pskrbasu commented Jun 18, 2025

Merge only after Tailpipe v0.5.0 is released.

@pskrbasu pskrbasu self-assigned this Jun 18, 2025
Copy link

Preview Available 🚀

Commit Author: Karan Popat
Commit Message: Fix the preview builds

Preview Link: tailpipe-io-git-docs-tpindex-turbot.vercel.app

@pskrbasu pskrbasu marked this pull request as ready for review June 19, 2025 10:27
@@ -53,14 +53,14 @@ Tailpipe uses [hive partitioning](https://duckdb.org/docs/data/partitioning/hive

- The data is written to Parquet files in the workspace directory, with a prescribed directory and filename structure. Each partition is written to a separate directory.

- For [custom tables](/docs/collect/custom-tables), you can define a `tp_index` column on which to index. For tables implemented by plugins, the index is not *user*-definable. Be aware that defining a `tp_index` does not always increase performance and may, in fact, decrease it as it can result in many small parquet files.
- The `tp_index` is used to partition the data and defaults to `"default"` if not specified. You can configure the `tp_index` in your partition config to specify a column whose value should be used as tp_index. Be aware that defining a `tp_index` does not always increase performance and may, in fact, decrease it as it can result in many small parquet files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link here, eg:

... You can configure the `tp_index` in your [partition config](/docs/reference/config-files/partition)

@@ -65,6 +65,8 @@ You may also use one or more [`column` definitions](/docs/reference/config-files

In our example, the source format does not define a field named `tp_timestamp`. Since ***`tp_timestamp` is a required column***, we will add a `tp_timestamp` column and map the `timestamp` from the source. Also, the source includes a `plugin_timestamp`, but it is parsed as a number because it is epoch milliseconds. We will transform it to a timestamp data type.

> [!NOTE]
> You cannot set the `tp_index` mapping in the table definition. The `tp_index` can only be configured through the partition config, where it defaults to `"default"` if not specified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add link:

... configured through the [partition config](/docs/reference/config-files/partition)

@@ -93,4 +93,4 @@ partition "aws_cloudtrail_log" "cloudtrail_all" {

## What partition indexes are available for a table?

That depends on how the plugin author has defined the common `tp_index` field. For AWS tables, it's the `account_id`. In the dual-partition case above, you could carve the logs by `account_id` using the common `tp_partition` field (but `tp_index` will always be the same). In the single-partition case above, you could carve the logs by `account_id` using `tp_index` (but `tp_partition` will always be the same).
The `tp_index` value depends on how you have configured it in your partition config. By default, `tp_index` is set to `"default"`, but you can configure it to specify a column whose value should be used as the partition index, as makes sense for the data. For AWS tables, you might set it to `account_id`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add link:

... depends on how you have configured it in your  [partition config](/docs/reference/config-files/partition)

@@ -16,6 +16,7 @@ Compact multiple Parquet files per day to one per day.
| Flag | Description
|-|-
| `--help` | Help for compact
| `--reindex` | Update the `tp_index` field to the currently configured value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description should be something more like :

Reorganize data using the currently configured `tp_index` structure.  Any data collected using a different `tp_index` value will be rewritten to new files [partitioned](/docs/reference/config-files/partition) using the current `tp_index`.

@@ -111,7 +111,7 @@ Tailpipe supports most of the [DuckDB general-purpose data types](https://duckdb

Tailpipe tables include a set of common columns. These mappings enable queries that correlate values across different logs. If you have collected both Cloudtrail and ALB logs, for example, you could query for `tp_ips` to find IP addresses in the `aws_cloudtrail_log` and `aws_alb_access_log` tables using the same syntax.

When creating a custom table, `tp_timestamp` is the only required column; ***you must define a `tp_timestamp` column***. This is because Tailpipe uses the timestamp to [organize the data files](/docs/collect/configure#hive-partitioning). The `tp_index` is also used in the hive partitioning scheme. You may set it if you want, but it will default to `default` if not set.
When creating a custom table, `tp_timestamp` is the only required column; ***you must define a `tp_timestamp` column***. This is because Tailpipe uses the timestamp to [organize the data files](/docs/collect/configure#hive-partitioning). The `tp_index` is also used in the hive partitioning scheme. By default, `tp_index` is set to `"default"`, but you can configure it in your partition config to specify a column whose value should be used as the partition index.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add link:

...but you can configure it in your [partition config](/docs/reference/config-files/partition)

@johnsmyth johnsmyth merged commit 1463552 into main Jun 20, 2025
3 checks passed
@johnsmyth johnsmyth deleted the tp_index branch June 20, 2025 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants