Add documentation for epilot datalake

epilot-dev · Nov 3, 2023 · 5430bf8 · 5430bf8
1 parent d3bba13
commit 5430bf8
Show file tree

Hide file tree

Showing 17 changed files with 9,295 additions and 4,382 deletions.
diff --git a/docs/datalake/_category_.json b/docs/datalake/_category_.json
@@ -0,0 +1,4 @@
+{
+  "label": "Datalake",
+  "position": 16
+}
diff --git a/docs/datalake/epilot-datalake.md b/docs/datalake/epilot-datalake.md
@@ -0,0 +1,196 @@
+---
+sidebar_position: 1
+---
+
+# Epilot Datalake
+
+Welcome to the documentation for our Data Lake feature, which serves as the centralized repository for real-time event streams of entity operations and snapshots of workflow executions data. This feature empowers users to access and analyze essential data generated by the 360 portal, including changes to entities such as orders, opportunities, contacts, accounts, products, and more.
+
+Our Data Lake is hosted on Amazon S3 buckets, and seamlessly integrated with Redshift for data warehousing, enabling the users to leverage Business Intelligence (BI) tools and create insightful reports. This documentation will guide you through the key components of the Data Lake feature, including data schemas, usage, and credential management.
+
+Feel free to contact our customer support or sales for help in enabling this datalake feature for your organization.
+
+## Data Schemas
+
+Our Data Lake features two distinct datasets:
+
+### 1. Entity Operations Event Streams
+
+This dataset captures real-time event streams of entity operations. 
+
+The schema for entity operations is as follows:
+
+```json
+{
+    "id": "string",            // ID for the entity operation
+    "detail-type": "string",   // Detail type can be "EntityOperation" or "SnapshotOperation"
+    "operation": "string",     // Operation can be "createEntity," "updateEntity," or "deleteEntity"
+    "source": "string",        // Source of the microservice generating the event
+    "account": "string",       // Account or organization associated with the operation
+    "time": "string",          // Timestamp of the entity mutation
+    "region": "string",        // Region associated with the operation
+    "activity_id": "string",   // ID for the individual entity operation/activity
+    "entity_id": "string",     // ID of the entity being mutated
+    "detail": "string",         // Stringified JSON payload containing entity data
+    "schma": "string",          // Schema of the entity
+    "month":"string",          // Month of the entity operation
+    "year": "string"          // Year of the entity operation
+}
+```
+
+Fields of interest in this schema include:
+
+**Detail-type** : Distinguishes between EntityOperation and SnapshotOperation.
+ - `EntityOperation`: These events represent real-time changes to entities and are ingested immediately upon entity mutation.
+ - `SnapshotOperation`: These events are part of monthly roll-up operations to create snapshots of the latest activity state of active entities. These snapshots are useful for optimizing queries and performance when retrieving current active entities within a specific month's partition.
+
+**Operation**: Describes the type of entity operation, including creation, update, or deletion.
+  - `createEntity`: This operation type is recorded when a new entity is created.
+  - `updateEntity`: This operation type is recorded when an existing entity is updated.
+  - `deleteEntity`: This operation type is recorded when an entity is deleted.
+
+**time**: Provides a timestamp for building time series reports.
+
+**activity_id** and **entity_id**: These fields contain unique identifiers that can be used to track individual entity operations (activity_id) and identify the specific entity that was mutated (entity_id).
+
+**detail**: The detail field contains a stringified JSON payload that includes the full entity data at the time of the operation. This payload can be parsed and used to access detailed information about the entity's state during the operation.
+
+> Additionally, we offer a simplified view called **current_entities**, which displays only the latest state of currently active entities in the organization, excluding deleted entities
+
+### 2: Workflow Execution Snapshots
+
+In addition to entity operations, we ingest real-time updates from workflow executions, allowing users to access current snapshots of workflow data. This dataset comprises five tables:
+
+- **WorkflowExecution**
+This table contains all the attributes of workflow execution data.
+
+- **WorkflowTasks**
+This table stores attributes related to individual steps or tasks within a workflow execution.
+
+- **WorkflowPhases**
+Here, you can find attributes of the various phases involved in a workflow execution.
+
+- **WorkflowCancellations**
+This table records data regarding the reasons for workflow execution cancellations.
+
+- **WorkflowContexts**
+Contains information about related entities associated with a workflow execution.
+
+![Workflow Executions DB Schema Design](/img/datalake/workflow-executions-db-schema.png)
+
+For interactive DB design, please refer [here](https://drawsql.app/teams/epilot/diagrams/workflows/embed).
+To know more about workflow execution details, please refer [here](/api/workflow-execution#tag/Workflows/operation/getExecution)
+
+
+## Setting up Datalake
+
+### Generate datalake credentials 
+
+-  Navigate to the "Settings" menu on the top right in your organization's portal. Then locate the "Data Lake" section from the drop down menu.
+![Datalake menu](/img/datalake/datalake-menu.png)
+
+-  The `Data Lake page` will display active credentials for your Data Lake. Here, you can revoke existing credentials or create new ones as needed.
+![Datalake page](/img/datalake/datalake-page.png)
+
+- To create new credentials, select the `Generate Credentials` button.
+
+- A new set of credentials will be generated. Note that the password will be visible only once at the time of creation for security reasons, so it's crucial to save it securely.
+![Datalake Credentials](/img/datalake/datalake-credentials.png)
+
+- Utilize the generated `username`, `endpoint`, `database` and `password` details to connect to the Data Lake from any BI tool or other data sources.
+
+## Querying the Data
+Once you've set up the necessary credentials, you can connect to the Redshift data warehouse and query the data. There are two primary ways to interact with the data:
+
+### Directly Quering via SQL
+You can write SQL queries to retrieve and manipulate the data directly from Redshift using SQL functions. Below are examples illustrating how to query the data for insights.
+
+**Example 1: Reporting Opportunities Created Over Time**
+Suppose you need to create a report showing opportunities created over time, grouped by journey source, with a time granularity of months. You can use SQL to accomplish this task:
+
+``` sql
+WITH span AS
+  (SELECT *
+   FROM "epilot_datalake"."epilot_datalake"."entity_operations"
+   WHERE "schema" = 'opportunity' )
+SELECT json_extract_path_text(detail, 'payload', 'source', 'title', TRUE) AS "Journey Title",
+       EXTRACT(YEAR FROM json_extract_path_text(detail, 'payload', '_created_at', TRUE)::timestamp) AS year_,
+       EXTRACT(MONTH FROM json_extract_path_text(detail, 'payload', '_created_at', TRUE)::timestamp) AS month_,
+       COUNT(*)
+FROM span
+WHERE "operation" = 'createEntity'
+GROUP BY "Journey Title", year_, month_
+ORDER BY year_, month_;
+```
+
+This SQL query retrieves data about opportunities created over time, extracts relevant information from the JSON payload, and aggregates it by year and month, providing insights into opportunities created during different periods.
+
+![Datalake page](/img/datalake/opportunity-time-series.png)
+
+
+
+**Example 2: List of Current Active Opportunities**
+To obtain a list of all currently active opportunities from the entity operations, you can use the following SQL query:
+
+``` sql
+WITH span AS
+  (SELECT *
+   FROM "epilot_datalake"."epilot_datalake"."entity_operations"
+   WHERE "SCHEMA" = 'opportunity' AND "year" = EXTRACT(YEAR FROM CURRENT_DATE) AND "month" = EXTRACT(MONTH FROM CURRENT_DATE))
+SELECT a.entity_id, schema, detail
+FROM
+  (SELECT entity_id,
+          MAX(TIME) AS updated_at,
+          LISTAGG(OPERATION) AS operations,
+          COUNT(*) AS num_operations
+   FROM span
+   GROUP BY entity_id) a
+LEFT JOIN span b ON a.entity_id = b.entity_id
+AND a.updated_at = b.time
+WHERE operations NOT LIKE '%deleteEntity%';
+```
+
+This query retrieves the list of currently active opportunities by identifying the latest updates to each opportunity and checking that the operation does not include `deleteEntity`.
+
+![Datalake page](/img/datalake/current-opportunities.png)
+
+### Connecting to BI Tools
+
+Alternatively, you can connect to Business Intelligence (BI) tools of your choice to load the data from Redshift and build reports and dashboards. Many BI tools support direct integration with Redshift, allowing you to create visually appealing and interactive reports based on your data.
+
+> **Example** - we will walk you through an example of connecting to Power BI to create a demo BI report for `Raven Energy GmbH`, an energy utility company with two distinct journeys and workflow processes for handling `Wallbox` and `Energieausweis` sales digitally. Each use case involves two products. We will leverage our Data Lake to set up a BI report for this scenario.
+
+Steps to Create a Power BI Report:
+
+- **Open Power BI**: Launch Power BI, and in the interface, you can add a new data source connection using the Redshift connector available in the list of supported connectors.
+![Power BI Redshift](/img/datalake/powerbi-redshift.png)
+
+- **Enter Connection Details**: Enter the required connection details, including the Data Lake's endpoint, username, password, and the specific database. Then, click "OK" to establish the connection.
+
+![Power BI Connection](/img/datalake/powerbi-connection.png)
+
+- **Select Data**: Once connected, Power BI will display the available schemas and tables. You can easily locate the "entity_operations" table within the "epilot_datalake" schema, along with the "current_entities" simplified view for easier data access.
+
+![Power BI Entity Operations](/img/datalake/powerbi-entity-operations.png)
+
+- **Load or Transform Data**: You have two options for data handling. You can click "Load Data" to load all the raw data directly into Power BI. Alternatively, you can click "Transform" to access Power BI's query editor, where you can define your custom transformation logic to preprocess and shape the data to your specific requirements.
+
+![Power BI Current Entities](/img/datalake/powerbi-current-entities.png)
+
+- **Data Modeling**: After loading or transforming the data, you can begin setting up data models. Create relationships between different tables to enable seamless data exploration and reporting. For example, you can establish relationships between the entity_operations data and other relevant tables in your dataset.
+
+![Power BI Data Model](/img/datalake/powerbi-data-model.png)
+
+- **Report Creation**: With the data modeling in place, you can start creating visually appealing reports within Power BI. Utilize various chart types, tables, and visuals to convey insights effectively.
+
+![Power BI Report](/img/datalake/powerbi-report.png)
+
+
+## Interactive PowerBI report using Epilot Datalake
+
+<iframe title="raven bi" width="800" height="486" src="https://app.powerbi.com/view?r=eyJrIjoiZDQ4MmQzNzAtODVlYy00MjdiLTg5ODMtNzVhNmMxOTU4OGUzIiwidCI6IjMzZDRmM2U1LTNkZjItNDIxZS1iOTJlLWE2M2NmYTY4MGE4OCJ9" frameborder="0" allowFullScreen="true"></iframe>
+
+
+Our Data Lake feature provides a powerful way to capture and analyze real-time entity operations, enabling you to gain valuable insights from your data. By understanding the data schema and following the steps to set up credentials, you can leverage this feature to build reports, perform analytics, and make data-driven decisions for your organization.
+
+If you have any further questions or need assistance with data queries, please reach out to our team.