|
| 1 | +# Automatically Detect Text with Amazon Textract and AWS Lambda |
| 2 | + |
| 3 | +This pattern explains how to deploy an AWS SAM application with Amazon S3, AWS Lambda, and Amazon DynamoDB to detect text stored within pdf or image files. When an image file is uploaded to Amazon S3, the event-driven workflow begins, sending an event to AWS Lambda. This Lambda function, written in Python, invokes the Amazon Textract `DetectDocumentText` function using boto3. Once Textract returns the response, Lambda stores the detected text in a DynamoDB table. |
| 4 | + |
| 5 | +Learn more about this pattern at Serverless Land Patterns: [https://serverlessland.com/patterns/textract-lambda-sam-python](https://serverlessland.com/patterns/textract-lambda-sam-python) |
| 6 | + |
| 7 | +Important: this application uses various AWS services and there are costs associated with these services after the Free Tier usage - please see the [AWS Pricing page](https://aws.amazon.com/pricing/) for details. You are responsible for any AWS costs incurred. No warranty is implied in this example. |
| 8 | + |
| 9 | +## Requirements |
| 10 | + |
| 11 | +* [Create an AWS account](https://portal.aws.amazon.com/gp/aws/developer/registration/index.html) if you do not already have one and log in. The IAM user that you use must have sufficient permissions to make necessary AWS service calls and manage AWS resources. |
| 12 | +* [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) installed and configured |
| 13 | +* [Git Installed](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) |
| 14 | +* [AWS Serverless Application Model](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-install.html) (AWS SAM) installed |
| 15 | + |
| 16 | +## Deployment Instructions |
| 17 | + |
| 18 | +1. Create a new directory, navigate to that directory in a terminal and clone the GitHub repository: |
| 19 | + ``` |
| 20 | + git clone https://github.com/aws-samples/serverless-patterns |
| 21 | + ``` |
| 22 | +2. Change directory to the pattern directory: |
| 23 | + ``` |
| 24 | + cd textract-lambda-sam-python |
| 25 | + ``` |
| 26 | +3. From the command line, use AWS SAM to deploy the AWS resources for the pattern as specified in the template.yml file: |
| 27 | + ``` |
| 28 | + sam deploy --guided |
| 29 | + ``` |
| 30 | +4. During the prompts: |
| 31 | + * Enter a stack name |
| 32 | + * Enter the desired AWS Region |
| 33 | + * Allow SAM CLI to create IAM roles with the required permissions. |
| 34 | +
|
| 35 | + Once you run `sam deploy --guided` for the first time, and save the arguments to a configuration file (samconfig.toml), you can use `sam deploy` in future deployments to use these defaults. |
| 36 | +
|
| 37 | +5. Note the outputs from the SAM deployment process. These contain the resource names and/or ARNs which are used for testing. |
| 38 | +
|
| 39 | +## How it works |
| 40 | +
|
| 41 | +- This pattern is designed to create all services required to run this workflow. |
| 42 | +- The workflow begins with an Amazon S3 bucket. |
| 43 | +- When an object is created within the S3 bucket, it sends an event to an AWS Lambda function. |
| 44 | +- This Lambda function invokes Amazon Textracts's [DetectDocumentText](https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html) function, which synchronously analyses the newly stored file. |
| 45 | +- Once this task is complete, the Lambda function stores the results in an Amazon DynamoDB table. |
| 46 | +
|
| 47 | +## Testing |
| 48 | +1. Upload a test image, or pdf file, containing text to the Amazon S3 bucket created during the deployment step. You can either do this via the console, or by running the following command. Replace `<your_file>` with your image or pdf file name. Replace `<S3BucketName>` with the name of the S3 bucket generated from the AWS SAM deployment outputs. |
| 49 | + ``` |
| 50 | + aws s3 cp <your_file> s3://<S3BucketName> |
| 51 | + ``` |
| 52 | +3. Wait for the lambda-start-detect-document-text-textract Lambda function to complete, then retrieve the record output from the DynamoDB table. Replace `<DDBTableName>` with the name of the DynamoDB table generated by the AWS SAM deployment output, in this case, it will be `TextractResultsTable`. You can use the following command: |
| 53 | + ``` |
| 54 | + aws dynamodb scan --table-name <DDBTableName> |
| 55 | + ``` |
| 56 | + For example: aws dynamodb scan --table-name TextractResultsTable |
| 57 | +4. There should now be a newly uploaded record within DynamoDB displayed, depending on the file, the key 'DetectedText' will either be empty, or contain values indicating what was found in the uploaded file. |
| 58 | +
|
| 59 | +## Cleanup |
| 60 | + |
| 61 | +1. Delete the stack |
| 62 | + ``` |
| 63 | + sam delete |
| 64 | + ``` |
| 65 | +
|
| 66 | +⚠️ **IMPORTANT** - The above command `sam delete` does not delete the **Amazon S3 bucket** if there are still objects stored within it. Please [empty the bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/empty-bucket.html) before running the above command. |
| 67 | +---- |
| 68 | +Copyright 2023 Amazon.com, Inc. or its affiliates. All Rights Reserved. |
| 69 | +
|
| 70 | +SPDX-License-Identifier: MIT-0 |
0 commit comments