Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making some changes to the docs. Better titles/headings etc. #792

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
222 changes: 3 additions & 219 deletions docs/source/aryn_cloud/get_started.md
Original file line number Diff line number Diff line change
@@ -1,226 +1,10 @@
## An Introduction to the Aryn Partitioning Service
# An Introduction to the Aryn Partitioning Service
You can use the Aryn Partitioning Service to easily chunk and extract data from complex PDFs. The Partitioning Service can extract paragraphs, tables and images. It returns detailed information about the components it has just identified in a JSON object. The following two sections will walk you through two examples where we segment PDF documents and extract a table and an image from those documents using the python aryn-sdk.

### Extracting Tables from a PDF
- [Table Extraction from PDF](get_started_Table_extraction.md)
- [Image Extraction from PDF](get_started_Image_extraction.md)

In [this example](https://colab.research.google.com/drive/1Qpd-llPC-EPzuTwLfnguMnrQk0eclyqJ?usp=sharing), we’ll use the Partitioning Service to extract the “Supplemental Income” table (shown below) from the 10k financial document of 3M, and turn it into a pandas dataframe.

![alt text](3m_supplemental_income.png)

We’ll go through the important code snippets below to see what’s going on. (Try it out in [colab](https://colab.research.google.com/drive/1Qpd-llPC-EPzuTwLfnguMnrQk0eclyqJ?usp=sharing) yourself! )


Let’s focus on the following code that makes a call to the Aryn Partitioning Service:

```python
import aryn_sdk
from aryn_sdk.partition import partition_file, tables_to_pandas
import pandas as pd
from io import BytesIO

file = open('my-document.pdf', 'rb')
aryn_api_key = 'YOUR-KEY-HERE'

## Make a call to the Aryn Partitioning Service (APS)
## param extract_table_structure (boolean): extract tables and their structural content. default: False
## param use_ocr (boolean): extract text using an OCR model instead of extracting embedded text in PDF. default: False
## returns: JSON object with elements representing information inside the PDF
partitioned_file = partition_file(file, aryn_api_key, extract_table_structure=True, use_ocr=True)
```

If you inspect the partitioned_file variable, you’ll notice that it’s a large JSON object with details about all the components in the PDF (checkout [this page](./aps_output.md) to understand the schema of the returned JSON object in detail). Below, we highlight the ‘table’ element that contains the information about the table in the page.

```
{'type': 'table',
'bbox': [0.09080806058995863,
0.11205035122958097,
0.8889295869715074,
0.17521638350053267],
'properties': {'score': 0.9164711236953735,
'title': None,
'columns': None,
'rows': None,
'page_number': 1},
'table': {'cells': [ {'content': '(Millions)',
'rows': [0],
'cols': [0],
'is_header': True,
'bbox': {'x1': 0.09080806058995863,
'y1': 0.11341398759321733,
'x2': 0.40610217823701744,
'y2': 0.12250489668412642},
'properties': {}},
{'content': '2018',
'rows': [0],
'cols': [1],
'is_header': True,
'bbox': {'x1': 0.6113962958840763,
'y1': 0.11341398759321733,
'x2': 0.6766904135311351,
'y2': 0.12250489668412642},
'properties': {}},
{'content': '2017',
'rows': [0],
'cols': [2],
'is_header': True,
'bbox': {'x1': 0.718455119413488,
'y1': 0.11341398759321733,
'x2': 0.7825727664723116,
'y2': 0.12250489668412642},
'properties': {}},

...

]}}

```

In particular let's look at the “cells” field which is an array of cell objects that represent each of the cells in the table. Let’s focus on the first element of that list.

```
{'cells': [ {'content': '(Millions)',
'rows': [0],
'cols': [0],
'is_header': True,
'bbox': {'x1': 0.09080806058995863,
'y1': 0.11341398759321733,
'x2': 0.40610217823701744,
'y2': 0.12250489668412642},
'properties': {}} ... }

```

Here we've detected the first cell, its bounding box (which indicates the coordinates of the cell in the PDF), whether it’s a header cell and its contents. You can then process this JSON however you’d like for further analysis. In [the notebook](https://colab.research.google.com/drive/1Qpd-llPC-EPzuTwLfnguMnrQk0eclyqJ?usp=sharing) we use the tables_to_pandas function to turn the JSON into a pandas dataframe and then perform some analysis on it:

```python
pandas = tables_to_pandas(partitioned_file)

tables = []
#pull out the tables from the list of elements
for elt, dataframe in pandas:
if elt['type'] == 'table':
tables.append(dataframe)

supplemental_income = tables[0]
display(supplemental_income)
```

| (Millions) | 2018 | 2017 | 2016 |
| --- | --- | --- | --- |
| Interest expense | 350 | 322 | 199 |
| Interest income | (70) | (50) | (29) |
| Pension and postretirement net periodic benefi... | (73) | (128) | (196) |
| Total | 207 | 144 | (26) |


### Extracting Images from a PDF

In [this example](https://colab.research.google.com/drive/1n5zRm5hfHhxs7dA0FncC44VjlpiPJLWq?usp=sharing), we’ll use the Partitioning Service to extract an image from a battery manual. We’ll go through the important code snippets below to see what’s going on (Try it out in [colab](https://colab.research.google.com/drive/1n5zRm5hfHhxs7dA0FncC44VjlpiPJLWq?usp=sharing) yourself! )

Let’s focus on the following code that makes a call to the Aryn Partitioning Service:

```python
import aryn_sdk
from aryn_sdk.partition import partition_file, tables_to_pandas
import pandas as pd
from io import BytesIO

file = open('my-document.pdf', 'rb')
aryn_api_key = 'YOUR-KEY-HERE'

## Make a call to the Aryn Partitioning Service (APS)
## param use_ocr (boolean): extract text using an OCR model instead of extracting embedded text in PDF. default: False
## param extract_images (boolean): extract image contents. default: False
## returns: JSON object with elements representing information inside the PDF
partitioned_file = partition_file(file, aryn_api_key, extract_images=True, use_ocr=True)
```

If you inspect the partitioned_file variable, you’ll notice that it’s a large JSON object with details about all the components in the PDF (checkout [this page](./aps_output.md) to understand the schema of the returned JSON object in detail). Below, we highlight the ‘Image’ element that contains the information about some of the images in the page:

```
[
{
"type": "Section-header",
"bbox": [
0.06470742618336398,
0.08396875554865056,
0.3483343505859375,
0.1039327656139027
],
"properties": {
"score": 0.7253036499023438,
"page_number": 1
},
"text_representation": "Make AC Power Connections\n"
},
{
"type": "Image",
"bbox": [
0.3593270694508272,
0.10833765896883878,
0.6269251924402574,
0.42288088711825284
],
"properties": {
"score": 0.7996300458908081,
"image_size": [
475,
712
],
"image_mode": "RGB",
"image_format": null,
"page_number": 1
},
"text_representation": "",
"binary_representation": "AAAAAA.."
}, ...
]
```

In particular let's look at the element which highlights the Image that has been detected.

```
{
"type": "Image",
"bbox": [
0.3593270694508272,
0.10833765896883878,
0.6269251924402574,
0.42288088711825284
],
"properties": {
"score": 0.7996300458908081,
"image_size": [
475,
712
],
"image_mode": "RGB",
"image_format": null,
"page_number": 1
},
"text_representation": "",
"binary_representation": "AAAAAA.."
}
```

This JSON object represents one of the images in the PDF. You’ll notice that the image’s binary representation, its bounding box (which indicates the coordinates of the image in the PDF), and certain other properties (image_mode, image_size etc.) are returned back. You can then process this JSON however you’d like for further analysis. In the notebook, we use the Pillow Image module from python to display the extracted image on its own.

```python
## extract all the images from the JSON and print out the JSON representation of the first image
images = [e for e in partitioned_file['elements'] if e['type'] == 'Image']
first_image = images[0]

## read in the image and display it
image_width = first_image['properties']['image_size'][0]
image_height = first_image['properties']['image_size'][1]
image_mode = first_image['properties']['image_mode']
image = Image.frombytes(image_mode, (image_width, image_height), base64.b64decode(first_image['binary_representation']))

#display the image
image
```

![alt text](board.png)

### More examples

Expand Down
7 changes: 3 additions & 4 deletions docs/source/sycamore/tutorials.rst
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
Tutorials
Vector Database Ingestion Examples
=============

Learn how to write Sycamore scripts
--------------------------------------

Now that you've learned about Sycamore concepts, transforms, and connectors, let's put it all together with some tutorials showing how to write Sycamore processing jobs.
Now that you've learned about Sycamore concepts, transforms, and connectors, let's put it all together with some tutorials showing how to write Sycamore processing jobs to ingest data into your data sources.

Some tutorials are located below, and visit the `Aryn blog <https://www.aryn.ai/blog>`_ for more examples.

Expand All @@ -13,3 +11,4 @@ Some tutorials are located below, and visit the `Aryn blog <https://www.aryn.ai/

./tutorials/etl_pinecone_tutorial.md
./tutorials/etl_for_opensearch.md
./tutorials/etl_weaviate_tutorial.md
2 changes: 1 addition & 1 deletion docs/source/sycamore/tutorials/etl_for_opensearch.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Process and load data into an OpenSearch hybrid search index
# ETL tutorial with Sycamore and Opensearch

This tutorial provides a walkthrough of how to use Sycamore to extract, enrich, transform, and create vector embeddings from a PDF dataset in S3 and load it into OpenSearch. The way in which you run ETL on these document is critical for the end quality of your application, and you can easily use Sycamore to facilitate this. The example below shows a few transforms Sycamore can do in a pipeline, and how to use LLMs to extract information.

Expand Down
5 changes: 5 additions & 0 deletions docs/source/sycamore/tutorials/etl_weaviate_tutorial.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# ETL tutorial with Sycamore and Weaviate

This tutorial shows how to create an ETL pipeline with Sycamore to load a Weaviate vector database. It walks through an example of using Sycamore to partition, extract, clean, chunk, embed, and load data into Weaviate. You will need an [Aryn Partitioning Service API key](https://www.aryn.ai/get-started) and [OpenAI API key](https://platform.openai.com/signup) (for LLM-powered data enrichment and creating vector embeddings(for creating and using a vector index). At the time of writing, there are free trial or free tier options for all of these services.

Run this tutorial [locally with Jupyter](https://github.com/aryn-ai/sycamore/blob/main/notebooks/weaviate-writer.ipynb).
Loading