Skip to content

Commit

Permalink
Merge pull request #5728 from EnterpriseDB/release-2024-06-04a
Browse files Browse the repository at this point in the history
Release 2024-06-04a
  • Loading branch information
djw-m authored Jun 4, 2024
2 parents 8e8d651 + f85ea19 commit 5e4dbc8
Show file tree
Hide file tree
Showing 56 changed files with 462 additions and 489 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/sync-and-process-files.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,4 +53,4 @@ jobs:
path: destination/
reviewers: ${{ env.REVIEWERS }}
title: ${{ env.TITLE }}
token: ${{ secrets.SYNC_FILES_TOKEN }}
token: ${{ secrets.GH_TOKEN }}
9 changes: 4 additions & 5 deletions advocacy_docs/edb-postgres-ai/ai-ml/install-tech-preview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,9 @@ The preview release of pgai is distributed as a self-contained Docker container

## Configuring and running the container image

If you havent already, sign up for an EDB account and log in to the EDB container registry.
If you haven't already, sign up for an EDB account and log in to the EDB container registry.


Log in to docker with your the username tech-preview and your EDB Repo 2.0 Subscription Token as your password:
Log in to Docker with the username tech-preview and your EDB Repo 2.0 subscription token as your password:

```shell
docker login docker.enterprisedb.com -u tech-preview -p <your_EDB_repo_token>
Expand Down Expand Up @@ -65,13 +64,13 @@ docker run -d --name pgai \

## Connect to Postgres

If you havent yet, install the Postgres command-line tools. If youre on a Mac, using Homebrew, you can install it as follows:
If you haven't yet, install the Postgres command-line tools. If you're on a Mac, using Homebrew, you can install it as follows:

```shell
brew install libpq
```

Connect to the tech preview PostgreSQL running in the container. Note that this relies on $PGPASSWORD being set - if youre using a different terminal for this part, make sure you re-export the password:
Connect to the tech preview PostgreSQL running in the container. Note that this relies on $PGPASSWORD being set - if you're using a different terminal for this part, make sure you re-export the password:

```shell
psql -h localhost -p 15432 -U postgres postgres
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Additional functions and stand-alone embedding in pgai
title: Additional functions and standalone embedding in pgai
navTitle: Additional functions
description: Other pgai extension functions and how to generate embeddings for images and text.
---
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ We recommend you to prepare your own S3 compatible object storage bucket with so

In addition we use image data and an according image encoder LLM in this example instead of text data. But you could also use plain text data on object storage similar to the examples in the previous section.

First lets create a retriever for images stored on s3-compatible object storage as the source. We specify torsten as the bucket name and an endpoint URL where the bucket is created. We specify an empty string as prefix because we want all the objects in that bucket. We use the [`clip-vit-base-patch32`](https://huggingface.co/openai/clip-vit-base-patch32) open encoder model for image data from HuggingFace. We provide a name for the retriever so that we can identify and reference it subsequent operations:
First let's create a retriever for images stored on s3-compatible object storage as the source. We specify torsten as the bucket name and an endpoint URL where the bucket is created. We specify an empty string as prefix because we want all the objects in that bucket. We use the [`clip-vit-base-patch32`](https://huggingface.co/openai/clip-vit-base-patch32) open encoder model for image data from HuggingFace. We provide a name for the retriever so that we can identify and reference it subsequent operations:

```sql
SELECT pgai.create_s3_retriever(
Expand Down Expand Up @@ -39,7 +39,7 @@ __OUTPUT__
(1 row)
```

Finally, run the retrieve_via_s3 function with the required parameters to retrieve the top K most relevant (most similar) AI data items. Please be aware that the object type is currently limited to image and text files.
Finally, run the retrieve_via_s3 function with the required parameters to retrieve the top K most relevant (most similar) AI data items. Be aware that the object type is currently limited to image and text files.

```sql
SELECT data from pgai.retrieve_via_s3(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ description: How to work with AI data stored in Postgres tables using the pgai e

We will first look at working with AI data stored in columns in the Postgres table.

To see how to use AI data stored in S3-compatible object storage, please skip to the next section.
To see how to use AI data stored in S3-compatible object storage, skip to the next section.

First lets create a Postgres table for some test AI data:
First let's create a Postgres table for some test AI data:

```sql
CREATE TABLE products (
Expand All @@ -22,7 +22,7 @@ CREATE TABLE
```


Now lets create a retriever with the just created products table as the source. We specify product_id as the unique key column to and we define the product_name and description columns to use for the similarity search by the retriever. We use the `all-MiniLM-L6-v2` open encoder model from HuggingFace. We set `auto_embedding` to True so that any future insert, update or delete to the source table will automatically generate, update or delete also the corresponding embedding. We provide a name for the retriever so that we can identify and reference it subsequent operations:
Now let's create a retriever with the just created products table as the source. We specify product_id as the unique key column to and we define the product_name and description columns to use for the similarity search by the retriever. We use the `all-MiniLM-L6-v2` open encoder model from HuggingFace. We set `auto_embedding` to True so that any future insert, update or delete to the source table will automatically generate, update or delete also the corresponding embedding. We provide a name for the retriever so that we can identify and reference it subsequent operations:

```sql
SELECT pgai.create_pg_retriever(
Expand All @@ -44,7 +44,7 @@ __OUTPUT__



Now lets insert some AI data records into the products table. Since we have set auto_embedding to True, the retriever will automatically generate all embeddings in real-time for each inserted record:
Now let's insert some AI data records into the products table. Since we have set auto_embedding to True, the retriever will automatically generate all embeddings in real-time for each inserted record:

```sql
INSERT INTO products (product_name, description) VALUES
Expand Down Expand Up @@ -80,7 +80,7 @@ __OUTPUT__
(5 rows)
```

Now lets try a retriever without auto embedding. This means that the application has control over when the embeddings are computed in a bulk fashion. For demonstration we can simply create a second retriever for the same products table that we just created above:
Now let's try a retriever without auto embedding. This means that the application has control over when the embeddings are computed in a bulk fashion. For demonstration we can simply create a second retriever for the same products table that we just created above:

```sql
SELECT pgai.create_pg_retriever(
Expand Down Expand Up @@ -115,7 +115,7 @@ __OUTPUT__
(0 rows)
```

Thats why we first need to run a bulk generation of embeddings. This is achieved via the `refresh_retriever()` function:
That's why we first need to run a bulk generation of embeddings. This is achieved via the `refresh_retriever()` function:

```sql
SELECT pgai.refresh_retriever(
Expand Down Expand Up @@ -148,7 +148,7 @@ __OUTPUT__
(5 rows)
```

Now lets see what happens if we add additional AI data records:
Now let's see what happens if we add additional AI data records:

```sql
INSERT INTO products (product_name, description) VALUES
Expand Down
36 changes: 18 additions & 18 deletions advocacy_docs/edb-postgres-ai/analytics/concepts.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,45 +7,45 @@ description: Learn about the ideas and terminology behind EDB Postgres Lakehouse
EDB Postgres Lakehouse is the solution for running Rapid Analytics against
operational data on the EDB Postgres® AI platform.

## Major Concepts
## Major concepts

* **Lakehouse Nodes** query **Lakehouse Tables** in **Managed Storage Locations**.
* **Lakehouse Sync** can create **Lakehouse Tables** from **Transactional Tables** in a source database.
* **Lakehouse nodes** query **Lakehouse tables** in **managed storage locations**.
* **Lakehouse Sync** can create **Lakehouse tables** from **Transactional tables** in a source database.

Here's how it fits together:

![Level 50 basic architecture](./images/level-50-architecture.png)

### Lakehouse Node
### Lakehouse node

A Postgres Lakehouse Node is Postgres, with a Vectorized Query Engine that's
optimized to query Lakehouse Tables, but still fall back to Postgres for full
A Postgres Lakehouse node is Postgres, with a Vectorized Query Engine that's
optimized to query Lakehouse tables, but still fall back to Postgres for full
compatibility.

Lakehouse nodes are stateless and ephemeral. Scale them up or down based on
workload requirements.

### Lakehouse Tables
### Lakehouse tables

Lakehouse Tables are stored using highly compresible, columnar storage formats
Lakehouse Tables are stored using highly compressible, columnar storage formats
optimized for analytics and interoperable with the rest of the Analytics ecosystem.
Currently, Postgres Lakehouse Nodes can read tables stored using the Delta
Currently, Postgres Lakehouse nodes can read tables stored using the Delta
Protocol ("delta tables"), and Lakehouse Sync can write them.

### Managed Storage Location
### Managed storage location

A Managed Storage Location is where you can organize Lakehouse Tables in
A *managed storage location* is where you can organize Lakehouse tables in
object storage, so that Postgres Lakehouse can query them.

A "Managed Storage Location" is a location in object storage where we control
A managed storage location is a location in object storage where we control
the file layout and write Lakehouse Tables on your behalf. Technically, it's an
implementation detail that we store these in buckets. This is really a subset
of an upcoming "Storage Location" feature that will also support
"External Storage Locations," where you bring your own bucket.

### Lakehouse Sync

Lakehouse Sync is a Data Migration Service offered as part of the EDB
Lakehouse Sync is a data migration service offered as part of the EDB
Postgres AI platform. It can "sync" tables from a transactional database, to
Lakehouse Tables in a destination Storage Location. Currently, it supports
source databases hosted in the EDB Postgres AI Cloud Service (formerly known as
Expand All @@ -58,28 +58,28 @@ It's built using [Debezium](https://debezium.io).
### Lakehouse

The
"[Lakehouse Architecture](https://15721.courses.cs.cmu.edu/spring2023/papers/02-modern/armbrust-cidr21.pdf)"
"[Lakehouse architecture](https://15721.courses.cs.cmu.edu/spring2023/papers/02-modern/armbrust-cidr21.pdf)"
is a data engineering practice, which is a portmanteau of "Data _Lake_" and "Data
Ware_house_," offering the best of both. The central tenet of the architecture is
that data is stored in Object Storage, generally in columnar formats like
Parquet, where different query engines can process it for their own specialized
purposes, using the optimal compute resources for a given query.

### Vectorized Query Engine
### Vectorized query engine

A vectorized query engine is a query engine that's optimized for running queries
on columnar data. Most analytics engines use vectorized query execution.
Postgres Lakehouse uses [Apache DataFusion](https://datafusion.apache.org/).

### Delta Tables
### Delta tables

We use the term "Lakehouse Tables" to avoid overcommitting to a particular
We use the term "Lakehouse tables" to avoid overcommitting to a particular
format (since we might eventually support Iceberg or Hudi, for example). But
technically, we're using [Delta Tables](https://delta.io/). A Delta Table
is a well-defined container of Parquet files and JSON metadata, according to
the "Delta Lake" spec and open protocol. Delta Lake is a Linux Foundation project.

## How it Works
## How it works

Postgres Lakehouse is built using a number of technologies:

Expand Down
33 changes: 16 additions & 17 deletions advocacy_docs/edb-postgres-ai/analytics/index.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Lakehouse Analytics
navTitle: Lakehouse Analytics
title: Lakehouse analytics
navTitle: Lakehouse analytics
indexCards: simple
iconName: Improve
navigation:
Expand All @@ -11,19 +11,19 @@ navigation:

EDB Postgres Lakehouse extends the power of Postgres to analytical workloads,
by adding a vectorized query engine and separating storage from compute. Building
a Data Lakehouse has never been easier just use Postgres.
a data Lakehouse has never been easier: just use Postgres.

## Rapid Analytics for Postgres
## Rapid analytics for Postgres

Postgres Lakehouse is a core offering of the EDB Postgres® AI platform, extending
Postgres to support analytical queries over columnar data in object storage,
while keeping the simplicity and ease of use that Postgres users love.

With Postgres Lakehouse, you can query your Postgres data with a Lakehouse Node,
an ephemeral, scale-to-zero compute resource powered by Postgres thats optimized for
With Postgres Lakehouse, you can query your Postgres data with a Lakehouse node,
an ephemeral, scale-to-zero compute resource powered by Postgres that's optimized for
vectorized query execution over columnar data.

## Postgres Native
## Postgres native

Never leave the Postgres ecosystem.

Expand All @@ -33,16 +33,16 @@ columnar tables in object storage using the open source Delta Lake protocol.

EDB Postgres Lakehouse is “just Postgres” – you can query it with any Postgres
client, and it fully supports all Postgres queries, functions and statements, so
theres no need to change existing queries or reconfigure business
there's no need to change existing queries or reconfigure business
intelligence software.

## Vectorized Execution
## Vectorized execution

Postgres Lakehouse uses Apache DataFusions vectorized SQL query engine to
Postgres Lakehouse uses Apache DataFusion's vectorized SQL query engine to
execute analytical queries 5-100x faster (30x on average) compared to native
Postgres, while still falling back to native execution when necessary.

## Columnar Storage
## Columnar storage

Postgres Lakehouse is optimized to query "Lakehouse Tables" in object storage,
extending the power of open source database to open table formats. Currently,
Expand All @@ -54,20 +54,19 @@ You can sync your own data from tables in transactional sources (initially, EDB
Postgres® AI Cloud Service databases) into Lakehouse Tables in Storage Locations
(initially, managed locations in S3 object storage).

## Fully Managed Service
## Fully managed service

You can launch Postgres Lakehouse nodes using the EDB Postgres AI Cloud
Service (formerly EDB BigAnimal). Point a Lakehouse Node at a storage bucket
Service (formerly EDB BigAnimal). Point a Lakehouse node at a storage bucket
with some Delta Tables in it, and get results of analytical (OLAP) queries in
less time than if you queried the same data in a transactional Postgres database.

Postgres Lakehouse nodes are available now for customers using
EDB Postgres AI - Hosted environments on AWS, and will be rolling out
to additional cloud environments soon.

## Try Today
## Try it today

Its easy to start using Postgres Lakehouse. Provision a Lakehouse Node in five
minutes, and start qureying pre-loaded benchmark data like TPC-H, TPC-DS,
It's easy to start using Postgres Lakehouse. Provision a Lakehouse node in five
minutes, and start querying pre-loaded benchmark data like TPC-H, TPC-DS,
Clickbench, and the 1 Billion Row challenge.

36 changes: 18 additions & 18 deletions advocacy_docs/edb-postgres-ai/analytics/quick_start.mdx
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
---
title: Quick Start - EDB Postgres Lakehouse
navTitle: Quick Start
description: Launch a Lakehouse Node and query sample data.
description: Launch a Lakehouse node and query sample data.
---

In this guide, you will:

1. Create a Lakehouse Node
1. Create a Lakehouse node
2. Connect to the node with your preferred Postgres client
3. Query sample data (TPC-H, TPC-DS, Clickbench, or 1BRC) in object storage

For more details and advanced use cases, see [reference](./reference).

## Introduction

Postgres Lakehouse is a new type of Postgres “cluster” (its really just one
Postgres Lakehouse is a new type of Postgres “cluster” (it's really just one
node) that you can provision in EDB Postgres® AI Cloud Services (formerly known
as "BigAnimal"). It includes a vectorized query engine (based on Apache
[DataFusion](https://github.com/apache/datafusion)) for fast queries over
Expand All @@ -39,18 +39,18 @@ restarts and will be saved as part of backup/restore operations. Otherwise,
Lakehouse tables will not be part of backups, since they are ultimately stored
in object storage.

### Basic Architecture
### Basic architecture

Here's "what's in the box of a Lakehouse Node:
Here's what's in the box of a Lakehouse node:

![Level 300 Architecture of Postgres Lakehouse Node](./images/level-300-architecture.png)
![Level 300 Architecture of Postgres Lakehouse node](./images/level-300-architecture.png)

## Getting Started
## Getting started

You will need an EDB Postgres AI account. Once youve logged in and created
You will need an EDB Postgres AI account. Once you've logged in and created
a project, you can create a cluster.

### Create a Lakehouse Node
### Create a Lakehouse node

You will see a “Lakehouse Analytics” option under the “Create New” dropdown
on your project page:
Expand Down Expand Up @@ -79,13 +79,13 @@ block storage device and will survive a restart or backup/restore cycle.
* Only Postgres 16 is supported.

For more notes about supported instance sizes,
see [reference - supported AWS instances](./reference/#supported-aws-instances).
see [Reference - Supported AWS instances](./reference/#supported-aws-instances).

## Operating a Lakehouse Node
## Operating a Lakehouse node

### Connect to the Node
### Connect to the node

You can connect to the Lakehouse Node with any Postgres client, in the same way
You can connect to the Lakehouse node with any Postgres client, in the same way
that you connect to any other cluster from EDB Postgres AI Cloud Service
(formerly known as BigAnimal): navigate to the cluster detail page and copy its
connection string.
Expand Down Expand Up @@ -121,9 +121,9 @@ remain untouched.
storage (but it supports write queries to system tables for creating users,
etc.). You cannot write directly to object storage. You cannot create new tables.
* If you want to load your own data into object storage,
see [reference - bring your own data](./reference/#advanced-bring-your-own-data).
see [Reference - Bring your own data](./reference/#advanced-bring-your-own-data).

## Inspect the Benchmark Datasets
## Inspect the benchmark datasets

Inspect the Benchmark Datasets. Every cluster has some benchmarking data
available out of the box. If you are using pgcli, you can run `\dn` to see
Expand All @@ -137,9 +137,9 @@ The available benchmarking datsets are:
* 1 Billion Row Challenge

For more details on benchmark datasets,
see [reference - available benchmarking datasets](./reference/#available-benchmarking-datasets).
see Reference - Available benchmarking datasets](./reference/#available-benchmarking-datasets).

## Query the Benchmark Datasets
## Query the benchmark datasets

You can try running some basic queries:

Expand All @@ -164,5 +164,5 @@ SELECT 1
Time: 0.651s
```

Note: Do not use `search_path`! Please read the [reference](./reference)
Note: Do not use `search_path`! Read the [reference](./reference)
page for more gotchas and information about syntax/query compatibility.
Loading

2 comments on commit 5e4dbc8

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sign in to comment.