Skip to content

Commit

Permalink
Update datalake doc for clickhouse
Browse files Browse the repository at this point in the history
  • Loading branch information
Suresh Kumar Sivasankaran committed Jun 14, 2024
1 parent 82250b2 commit 54717a8
Show file tree
Hide file tree
Showing 9 changed files with 42 additions and 65 deletions.
107 changes: 42 additions & 65 deletions docs/datalake/epilot-datalake.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ sidebar_position: 1

Welcome to the documentation for our Data Lake feature, which serves as the centralized repository for real-time event streams of entity operations and snapshots of workflow executions data. This feature empowers users to access and analyze essential data generated by the 360 portal, including changes to entities such as orders, opportunities, contacts, accounts, products, and more.

Our Data Lake is hosted on Amazon S3 buckets, and seamlessly integrated with Redshift for data warehousing, enabling the users to leverage Business Intelligence (BI) tools and create insightful reports. This documentation will guide you through the key components of the Data Lake feature, including data schemas, usage, and credential management.
Our Data Lake is seamlessly integrated with Clickhouse for data warehousing, enabling the users to leverage Business Intelligence (BI) tools and create insightful reports. This documentation will guide you through the key components of the Data Lake feature, including data schemas, usage, and credential management.

Feel free to contact our customer support or sales for help in enabling this datalake feature for your organization.

Expand All @@ -22,40 +22,30 @@ The schema for entity operations is as follows:

```json
{
"id": "string", // ID for the entity operation
"detail-type": "string", // Detail type can be "EntityOperation" or "SnapshotOperation"
"operation": "string", // Operation can be "createEntity," "updateEntity," or "deleteEntity"
"source": "string", // Source of the microservice generating the event
"account": "string", // Account or organization associated with the operation
"time": "string", // Timestamp of the entity mutation
"region": "string", // Region associated with the operation
"activity_id": "string", // ID for the individual entity operation/activity
"entity_id": "string", // ID of the entity being mutated
"detail": "string", // Stringified JSON payload containing entity data
"schema": "string", // Schema of the entity
"month":"string", // Month of the entity operation
"year": "string" // Year of the entity operation
"activity_id": "string", // ID for the individual entity operation/activity
"entity_id": "string", // ID of the entity being mutated
"org_id": "string", // Organization ID
"operation": "string", // Operation can be "createEntity," "updateEntity," or "deleteEntity"
"schema": "string", // Schema of the entity
"timestamp": "DateTime", // Timestamp of the entity mutation
"detail": "string", // Stringified JSON payload containing entity data
}
```

Fields of interest in this schema include:

**Detail-type** : Distinguishes between EntityOperation and SnapshotOperation.
- `EntityOperation`: These events represent real-time changes to entities and are ingested immediately upon entity mutation.
- `SnapshotOperation`: These events are part of monthly roll-up operations to create snapshots of the latest activity state of active entities. These snapshots are useful for optimizing queries and performance when retrieving current active entities within a specific month's partition.

**Operation**: Describes the type of entity operation, including creation, update, or deletion.
- `createEntity`: This operation type is recorded when a new entity is created.
- `updateEntity`: This operation type is recorded when an existing entity is updated.
- `deleteEntity`: This operation type is recorded when an entity is deleted.

**time**: Provides a timestamp for building time series reports.
**timestamp**: Provides a timestamp for building time series reports.

**activity_id** and **entity_id**: These fields contain unique identifiers that can be used to track individual entity operations (activity_id) and identify the specific entity that was mutated (entity_id).

**detail**: The detail field contains a stringified JSON payload that includes the full entity data at the time of the operation. This payload can be parsed and used to access detailed information about the entity's state during the operation.

> Additionally, we offer a simplified view called **current_entities**, which displays only the latest state of currently active entities in the organization, excluding deleted entities
> Additionally, we offer a simplified view called **{org_id}_current_entities_final**, which displays only the latest state of currently active entities in the organization, excluding deleted entities
### 2: Workflow Execution Snapshots

Expand Down Expand Up @@ -98,79 +88,61 @@ To know more about workflow execution details, please refer [here](/api/workflow
- A new set of credentials will be generated. Note that the password will be visible only once at the time of creation for security reasons, so it's crucial to save it securely.
![Datalake Credentials](/img/datalake/datalake-credentials.png)

- Utilize the generated `username`, `endpoint`, `database` and `password` details to connect to the Data Lake from any BI tool or other data sources.
- Utilize the generated `username`, `host`, `port`, `database` and `password` details to connect to the Data Lake from any BI tool or other data sources.

## Querying the Data
Once you've set up the necessary credentials, you can connect to the Redshift data warehouse and query the data. There are two primary ways to interact with the data:
Once you've set up the necessary credentials, you can connect to the Clickhouse data warehouse and query the data. There are two primary ways to interact with the data:

### Directly Quering via SQL
You can write SQL queries to retrieve and manipulate the data directly from Redshift using SQL functions. Below are examples illustrating how to query the data for insights.
You can write SQL queries to retrieve the data directly from Clickhouse using SQL functions. Below are examples illustrating how to query the data for insights.

**Example 1: Reporting Opportunities Created Over Time**
Suppose you need to create a report showing opportunities created over time, grouped by journey source, with a time granularity of months. You can use SQL to accomplish this task:

``` sql
WITH span AS
(SELECT *
FROM "epilot_datalake"."epilot_datalake"."entity_operations"
WHERE "schema" = 'opportunity' )
SELECT json_extract_path_text(detail, 'payload', 'source', 'title', TRUE) AS "Journey Title",
EXTRACT(YEAR FROM json_extract_path_text(detail, 'payload', '_created_at', TRUE)::timestamp) AS year_,
EXTRACT(MONTH FROM json_extract_path_text(detail, 'payload', '_created_at', TRUE)::timestamp) AS month_,
COUNT(*)
FROM span
WHERE "operation" = 'createEntity'
GROUP BY "Journey Title", year_, month_
ORDER BY year_, month_;
select
count(*) as count,
JSONExtractString(detail, 'payload', 'source', 'title') as journey_source,
toStartOfYear (timestamp) as year,
toStartOfMonth(timestamp) as month
from
entity_operations
where
and schema = 'opportunity'
and operation = 'createEntity'
group by
journey_source,
year,
month;
```

This SQL query retrieves data about opportunities created over time, extracts relevant information from the JSON payload, and aggregates it by year and month, providing insights into opportunities created during different periods.

![Datalake page](/img/datalake/opportunity-time-series.png)

You can use any SQL client to connect to the Clickhouse Data Warehouse (DWH) using the credentials provided. For more detailed information, please refer to [this link](https://clickhouse.com/docs/en/integrations/datagrip).


**Example 2: List of Current Active Opportunities**
To obtain a list of all currently active opportunities from the entity operations, you can use the following SQL query:

``` sql
WITH span AS
(SELECT *
FROM "epilot_datalake"."epilot_datalake"."entity_operations"
WHERE "SCHEMA" = 'opportunity' AND "year" = EXTRACT(YEAR FROM CURRENT_DATE) AND "month" = EXTRACT(MONTH FROM CURRENT_DATE))
SELECT a.entity_id, schema, detail
FROM
(SELECT entity_id,
MAX(TIME) AS updated_at,
LISTAGG(OPERATION) AS operations,
COUNT(*) AS num_operations
FROM span
GROUP BY entity_id) a
LEFT JOIN span b ON a.entity_id = b.entity_id
AND a.updated_at = b.time
WHERE operations NOT LIKE '%deleteEntity%';
```

This query retrieves the list of currently active opportunities by identifying the latest updates to each opportunity and checking that the operation does not include `deleteEntity`.

![Datalake page](/img/datalake/current-opportunities.png)

### Connecting to BI Tools

Alternatively, you can connect to Business Intelligence (BI) tools of your choice to load the data from Redshift and build reports and dashboards. Many BI tools support direct integration with Redshift, allowing you to create visually appealing and interactive reports based on your data.
Alternatively, you can connect to Business Intelligence (BI) tools of your choice to load the data from Clickhouse and build reports and dashboards. Many BI tools support direct integration with Clickhouse, allowing you to create visually appealing and interactive reports based on your data or alternatively you can make use of official [Clickhouse ODBC driver](https://github.com/ClickHouse/clickhouse-odbc/releases) to establish the connection setup with Clickhouse.

> **Example** - we will walk you through an example of connecting to Power BI to create a demo BI report for `Raven Energy GmbH`, an energy utility company with two distinct journeys and workflow processes for handling `Wallbox` and `Energieausweis` sales digitally. Each use case involves two products. We will leverage our Data Lake to set up a BI report for this scenario.
To get started with the various ways to establish a connection between Clickhouse and PowerBI, please refer to [this link] (https://clickhouse.com/docs/en/integrations/powerbi).

Steps to Create a Power BI Report:

- **Open Power BI**: Launch Power BI, and in the interface, you can add a new data source connection using the Redshift connector available in the list of supported connectors.
![Power BI Redshift](/img/datalake/powerbi-redshift.png)
- **Open Power BI**: Once you have installed all the necessary dependencies to set up an ODBC connection as specified in the link above, you can add a new data source by selecting the ODBC option from the list of supported connectors.
![Power BI ODBC](/img/datalake/powerbi-odbc.png)

- **Enter Connection Details**: Enter the required connection details, including the ODBC connection, username, password, and the specific database. Then, click "OK" to establish the connection.

- **Enter Connection Details**: Enter the required connection details, including the Data Lake's endpoint, username, password, and the specific database. Then, click "OK" to establish the connection.
Example ODBC Connection String - ```Driver={ClickHouse ODBC Driver (ANSI)};Server={replace-it-with-host};Port=8443;Database=datawarehouse;```

![Power BI Connection](/img/datalake/powerbi-connection.png)

- **Select Data**: Once connected, Power BI will display the available schemas and tables. You can easily locate the "entity_operations" table within the "epilot_datalake" schema, along with the "current_entities" simplified view for easier data access.
- **Select Data**: Once connected, Power BI will display the available schemas and tables. You can easily locate the "{org_id}_entity_operations" table within the "datawarehouse" schema, along with the "{org_id}_current_entities_final" simplified view for easier data access.

![Power BI Entity Operations](/img/datalake/powerbi-entity-operations.png)

Expand All @@ -192,6 +164,11 @@ Steps to Create a Power BI Report:
<iframe title="raven bi" width="100%" height="600" src="https://app.powerbi.com/view?r=eyJrIjoiZDQ4MmQzNzAtODVlYy00MjdiLTg5ODMtNzVhNmMxOTU4OGUzIiwidCI6IjMzZDRmM2U1LTNkZjItNDIxZS1iOTJlLWE2M2NmYTY4MGE4OCJ9" frameborder="0" allowFullScreen="true"></iframe>
</div>



Our Data Lake feature provides a powerful way to capture and analyze real-time entity operations, enabling you to gain valuable insights from your data. By understanding the data schema and following the steps to set up credentials, you can leverage this feature to build reports, perform analytics, and make data-driven decisions for your organization.


You can also refer to the following link to establish connection to [different BI tools.](https://clickhouse.com/docs/en/integrations/data-visualization)

If you have any further questions or need assistance with data queries, please reach out to our team.
Binary file removed static/img/datalake/current-opportunities.png
Binary file not shown.
Binary file modified static/img/datalake/datalake-credentials.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified static/img/datalake/opportunity-time-series.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified static/img/datalake/powerbi-connection.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified static/img/datalake/powerbi-current-entities.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified static/img/datalake/powerbi-entity-operations.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/img/datalake/powerbi-odbc.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed static/img/datalake/powerbi-redshift.png
Binary file not shown.

0 comments on commit 54717a8

Please sign in to comment.