Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: AI-OCR for PDFs #455

Closed
wants to merge 29 commits into from
Closed

feat: AI-OCR for PDFs #455

wants to merge 29 commits into from

Conversation

fsatsuki
Copy link
Contributor

@fsatsuki fsatsuki commented Jul 18, 2024

Issue #, if available:

Using AI for PDF OCR

Description of changes:

PDFs contain various types of images, graphs, charts, designs, objects, etc., and these PDFs are easy for people to understand. However, typical OCR tools can't understand relationships between objects. Tabular tables are single-column strings, so it's difficult to infer table relationships from strings. The AI-OCR feature converts PDFs to images one by one and supports OCR using Claude3's multimodal features. As a result, structured markdown text can be retrieved and used as RAG knowledge.

The DB schema has changed. Add “Metadata” as JSON. In this PR, images converted from PDFs are stored in S3 buckets.
It is placed in the source image URL and placed in the original pdfurl of the metadata.parentSource.

Screenshot 2024-07-18 at 14 04 46
Screenshot 2024-07-18 at 14 25 35
Screenshot 2024-07-18 at 14 26 02
stepfunctions_graph (2)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

```

## Launch local server

```sh
pip install poetry --no-cache-dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ドキュメント修正ありがとうございますmm(ちょっとした細かいドキュメント修正、何気に結構助かります)

const setUp = async (dbConfig) => {
const client = new Client(dbConfig);
// Aurora Serverless may be down, so retry until it connect
async function connectWithRetry(maxRetries = 5, retryDelay = 60000) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memo: aurora serverless v2はcapacity zeroまでscale in しないはず。別の原因?

platform: Platform.LINUX_AMD64,
file: "Dockerfile",
exclude: [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exclude追加ありがとうございます!

platform: Platform.LINUX_AMD64,
file: "lambda.Dockerfile",
cmd: ["app.sqs_consumer.handler"],
exclude: [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exclude追加ありがとうございます!

@@ -102,6 +104,7 @@ export class BedrockChatStack extends cdk.Stack {
exclude: [
"**/node_modules/**",
"**/dist/**",
"**/dev-dist/**",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exclude追加ありがとうございます!

@statefb
Copy link
Contributor

statefb commented Jul 19, 2024

Memo: When the bedrock knowledge base retrieve api supports detail reference chunk, this PR could be a good reference.

@statefb statefb marked this pull request as draft August 20, 2024 01:11
@fsatsuki fsatsuki closed this by deleting the head repository Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants