Merge pull request #2519 from jack-lebon/jlebon-textract-lambda-sam-python

julianwood · web-flow · commit a182db995ea9 · 2024-12-17T09:20:28.000Z
New serverless pattern - textract-lambda-sam-python
diff --git a/textract-lambda-sam-python/README.md b/textract-lambda-sam-python/README.md
@@ -0,0 +1,70 @@
+# Automatically Detect Text with Amazon Textract and AWS Lambda
+
+This pattern explains how to deploy an AWS SAM application with Amazon S3, AWS Lambda, and Amazon DynamoDB to detect text stored within pdf or image files. When an image file is uploaded to Amazon S3, the event-driven workflow begins, sending an event to AWS Lambda. This Lambda function, written in Python, invokes the Amazon Textract `DetectDocumentText` function using boto3. Once Textract returns the response, Lambda stores the detected text in a DynamoDB table.
+
+Learn more about this pattern at Serverless Land Patterns: [https://serverlessland.com/patterns/textract-lambda-sam-python](https://serverlessland.com/patterns/textract-lambda-sam-python) 
+
+Important: this application uses various AWS services and there are costs associated with these services after the Free Tier usage - please see the [AWS Pricing page](https://aws.amazon.com/pricing/) for details. You are responsible for any AWS costs incurred. No warranty is implied in this example.
+
+## Requirements
+
+* [Create an AWS account](https://portal.aws.amazon.com/gp/aws/developer/registration/index.html) if you do not already have one and log in. The IAM user that you use must have sufficient permissions to make necessary AWS service calls and manage AWS resources.
+* [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) installed and configured
+* [Git Installed](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
+* [AWS Serverless Application Model](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-install.html) (AWS SAM) installed
+
+## Deployment Instructions
+
+1. Create a new directory, navigate to that directory in a terminal and clone the GitHub repository:
+    ``` 
+    git clone https://github.com/aws-samples/serverless-patterns
+    ```
+2. Change directory to the pattern directory:
+    ```
+    cd textract-lambda-sam-python
+    ```
+3. From the command line, use AWS SAM to deploy the AWS resources for the pattern as specified in the template.yml file:
+    ```
+    sam deploy --guided
+    ```
+4. During the prompts:
+    * Enter a stack name
+    * Enter the desired AWS Region
+    * Allow SAM CLI to create IAM roles with the required permissions.
+
+    Once you run `sam deploy --guided` for the first time, and save the arguments to a configuration file (samconfig.toml), you can use `sam deploy` in future deployments to use these defaults.
+
+5. Note the outputs from the SAM deployment process. These contain the resource names and/or ARNs which are used for testing.
+
+## How it works
+
+- This pattern is designed to create all services required to run this workflow.
+- The workflow begins with an Amazon S3 bucket. 
+- When an object is created within the S3 bucket, it sends an event to an AWS Lambda function.
+- This Lambda function invokes Amazon Textracts's [DetectDocumentText](https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html) function, which synchronously analyses the newly stored file.
+- Once this task is complete, the Lambda function stores the results in an Amazon DynamoDB table.
+
+## Testing
+1. Upload a test image, or pdf file, containing text to the Amazon S3 bucket created during the deployment step. You can either do this via the console, or by running the following command. Replace `<your_file>` with your image or pdf file name. Replace `<S3BucketName>` with the name of the S3 bucket generated from the AWS SAM deployment outputs.
+    ```
+    aws s3 cp <your_file> s3://<S3BucketName>
+    ```
+3. Wait for the lambda-start-detect-document-text-textract Lambda function to complete, then retrieve the record output from the DynamoDB table. Replace `<DDBTableName>` with the name of the DynamoDB table generated by the AWS SAM deployment output, in this case, it will be `TextractResultsTable`. You can use the following command:
+    ```
+    aws dynamodb scan --table-name <DDBTableName>
+    ```
+    For example: aws dynamodb scan --table-name TextractResultsTable
+4. There should now be a newly uploaded record within DynamoDB displayed, depending on the file, the key 'DetectedText' will either be empty, or contain values indicating what was found in the uploaded file.
+
+## Cleanup
+ 
+1. Delete the stack
+    ```
+    sam delete
+    ```
+
+⚠️ **IMPORTANT** - The above command `sam delete` does not delete the **Amazon S3 bucket** if there are still objects stored within it. Please [empty the bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/empty-bucket.html) before running the above command.
+----
+Copyright 2023 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+
+SPDX-License-Identifier: MIT-0
diff --git a/textract-lambda-sam-python/example-pattern.json b/textract-lambda-sam-python/example-pattern.json
@@ -0,0 +1,67 @@
+{
+    "title": "Automatic Text Detection with Amazon Textract",
+    "description": "An event-driven workflow to automatically detect and store text found within pdf files by leveraging Amazon Textract, AWS Lambda, and Amazon DynamoDB.",
+    "language": "Python",
+    "level": "200",
+    "framework": "SAM",
+    "introBox": {
+      "headline": "How it works",
+      "text": [
+        "This sample project demonstrates how to deliver an event-driven architecture to detect text within pdf files, while storing the results in Amazon DynamoDB.",
+        "This pattern allows you to store image files in an Amazon S3 bucket, which triggers the workflow. Upon an object being created in the S3 bucket, a Lambda function is invoked, which initiates Amazon Textracts's DetectDocumentText function. Once the function call is finished and it has retrieved any text found within the file, the Lambda function stores this information in our DynamoDB table.",
+        "This pattern deploys 1 S3 bucket, 1 Lambda Function, and 1 DynamoDB Table."
+      ]
+    },
+    "gitHub": {
+      "template": {
+        "repoURL": "https://github.com/aws-samples/serverless-patterns/tree/main/textract-lambda-sam-python",
+        "templateURL": "https://github.com/aws-samples/serverless-patterns/main/textract-lambda-sam-python/template.yaml",
+        "projectFolder": "textract-lambda-sam-python",
+        "templateFile": "template.yaml"
+      }
+    },
+    "resources": {
+      "bullets": [
+        {
+          "text": "Amazon Simple Storage Service (S3)",
+          "link": "https://aws.amazon.com/s3/"
+        },
+        {
+          "text": "AWS Lambda",
+          "link": "https://aws.amazon.com/lambda/"
+        },
+        {
+          "text": "Amazon Textract",
+          "link": "https://aws.amazon.com/textract/"
+        },
+        {
+          "text": "Amazon DynamoDB",
+          "link": "https://aws.amazon.com/dynamodb/"
+        }      
+      ]
+    },
+    "deploy": {
+      "text": [
+        "sam deploy"
+      ]
+    },
+    "testing": {
+      "text": [
+        "See the GitHub repo for detailed testing instructions."
+      ]
+    },
+    "cleanup": {
+      "text": [
+        "Delete the stack: sam delete"
+      ]
+    },
+    "authors": [
+      {
+        "name": "Jack Le Bon",
+        "image": "https://serverlessland.com/assets/images/resources/contributors/ext-jack-le-bon.jpg",
+        "bio": "AWS Solutions Architect",
+        "linkedin": "jack-le-bon"
+      }
+    ]
+  }
+  
diff --git a/textract-lambda-sam-python/src/lambda-start-detect-document-text-textract.py b/textract-lambda-sam-python/src/lambda-start-detect-document-text-textract.py
@@ -0,0 +1,27 @@
+import json
+import boto3
+import os
+import decimal
+
+client = boto3.client('textract')
+dynamodb = boto3.resource('dynamodb')
+table = dynamodb.Table(os.environ.get('dynamoDBTableName'))
+
+def lambda_handler(event, context):
+    bucket = event['Records'][0]['s3']['bucket']['name']
+    key = event['Records'][0]['s3']['object']['key']
+    record = {'Image': key, 'Bucket': bucket}
+    response = client.detect_document_text(
+        Document={
+            'S3Object': {
+                'Bucket': bucket,
+                'Name': key
+            }
+        }
+    )
+    record['DetectedText'] = response
+    dynamodb_response = table.put_item(Item=json.loads(json.dumps(record), parse_float=decimal.Decimal))
+    return {
+        'statusCode': 200,
+        'body': json.dumps('Textract complete.')
+    }
diff --git a/textract-lambda-sam-python/template.yaml b/textract-lambda-sam-python/template.yaml
@@ -0,0 +1,72 @@
+AWSTemplateFormatVersion: '2010-09-09'
+Transform: AWS::Serverless-2016-10-31
+Description: >
+  An Amazon S3 bucket for the storing of Image files. 
+  Upon uploading a new object, triggers an AWS Lambda function which consumes the event and leverages Amazon Textract's DetectDocumentText function to find text within an image or file. 
+  The Lambda function then stores the result in Amazon DynamoDB. (uksb-1tthgi812) (tag:textract-lambda-sam-python)
+
+Resources:
+  # S3 bucket to store Image files from the user.
+  ImageFileBucket:
+    Type: AWS::S3::Bucket
+
+  # Define the DynamoDB table
+  DynamoDBTable:
+    Type: AWS::DynamoDB::Table
+    Properties:
+      AttributeDefinitions:
+        - AttributeName: "Image"
+          AttributeType: "S"
+        - AttributeName: "Bucket"
+          AttributeType: "S"
+      KeySchema:
+        - AttributeName: "Image"
+          KeyType: HASH
+        - AttributeName: "Bucket"
+          KeyType: RANGE
+      BillingMode: PAY_PER_REQUEST
+      TableName: "TextractResultsTable"
+
+  # Lambda Function to begin Amazon Rekognition's DetectLabels function.
+  StartProcessingFunction:
+    Type: AWS::Serverless::Function
+    Properties:
+      FunctionName: lambda-start-detect-document-text-textract
+      Runtime: python3.13
+      Handler: src/lambda-start-detect-document-text-textract.lambda_handler
+      MemorySize: 128
+      Timeout: 10
+      Policies:
+        - Version: '2012-10-17'
+          Statement:
+            - Effect: Allow
+              Action: 
+                - "s3:GetObject"
+                - "s3:PutObject"
+              Resource: "*"
+            - Effect: Allow
+              Action:
+                - "textract:DetectDocumentText"
+              Resource: "*"
+            - Effect: Allow
+              Action:
+                - "dynamodb:PutItem"
+              Resource: !GetAtt DynamoDBTable.Arn
+      Events:
+        S3Event:
+          Type: S3
+          Properties:
+            Bucket:
+              Ref: ImageFileBucket
+            Events: s3:ObjectCreated:*
+      Environment:
+        Variables:
+          dynamoDBTableName: !Ref DynamoDBTable
+
+Outputs:
+  ImageFileBucket:
+    Value: !Ref ImageFileBucket
+    Description: S3 Bucket for object storage
+  DynamoDBTable:
+    Value: !Ref DynamoDBTable
+    Description: DynamoDB table containing Textract Results
diff --git a/textract-lambda-sam-python/textract-lambda-sam-python.json b/textract-lambda-sam-python/textract-lambda-sam-python.json
@@ -0,0 +1,107 @@
+{
+    "title": "Automatic Text Detection with Amazon Textract",
+    "description": "An event-driven workflow to automatically detect and store text found within pdf files by leveraging Amazon Textract, AWS Lambda, and Amazon DynamoDB.",
+    "language": "Python",
+    "level": "200",
+    "framework": "SAM",
+    "introBox": {
+        "headline": "How it works",
+        "text": [
+            "This sample project demonstrates how to deliver an event-driven architecture to detect text within pdf files, while storing the results in Amazon DynamoDB.",
+            "Upon an object creation in the S3 bucket, a Lambda function is invoked, which initiates Amazon Textracts's DetectDocumentText function. Textract returns the results to the Lambda function which stores this information in the DynamoDB table.",
+            "This pattern deploys 1 S3 bucket, 1 Lambda Function, and 1 DynamoDB Table."
+        ]
+    },
+    "gitHub": {
+        "template": {
+            "repoURL": "https://github.com/aws-samples/serverless-patterns/tree/main/textract-lambda-sam-python",
+            "templateURL": "https://github.com/aws-samples/serverless-patterns/main/textract-lambda-sam-python/template.yaml",
+            "projectFolder": "textract-lambda-sam-python",
+            "templateFile": "template.yaml"
+        }
+    },
+    "resources": {
+        "bullets": [
+            {
+                "text": "Amazon Simple Storage Service (S3)",
+                "link": "https://aws.amazon.com/s3/"
+            },
+            {
+                "text": "AWS Lambda",
+                "link": "https://aws.amazon.com/lambda/"
+            },
+            {
+                "text": "Amazon Textract",
+                "link": "https://aws.amazon.com/textract/"
+            },
+            {
+                "text": "Amazon DynamoDB",
+                "link": "https://aws.amazon.com/dynamodb/"
+            }
+        ]
+    },
+    "deploy": {
+        "text": [
+            "sam deploy"
+        ]
+    },
+    "testing": {
+        "text": [
+            "See the GitHub repo for detailed testing instructions."
+        ]
+    },
+    "cleanup": {
+        "text": [
+            "Delete the stack: sam delete"
+        ]
+    },
+    "authors": [
+        {
+            "name": "Jack Le Bon",
+            "image": "https://serverlessland.com/assets/images/resources/contributors/ext-jack-le-bon.jpg",
+            "bio": "AWS Solutions Architect",
+            "linkedin": "jack-le-bon"
+        }
+    ],
+    "patternArch": {
+        "icon1": {
+            "x": 10,
+            "y": 50,
+            "service": "s3",
+            "label": "Amazon S3"
+        },
+        "icon2": {
+            "x": 40,
+            "y": 50,
+            "service": "lambda",
+            "label": "AWS Lambda"
+        },
+        "icon3": {
+            "x": 80,
+            "y": 25,
+            "service": "textract",
+            "label": "Amazon Textract"
+        },
+        "icon4": {
+            "x": 80,
+            "y": 70,
+            "service": "dynamodb",
+            "label": "Amazon DynamoDB"
+        },
+        "line1": {
+            "from": "icon1",
+            "to": "icon2",
+            "label": "Object Created"
+        },
+        "line2": {
+            "from": "icon2",
+            "to": "icon3",
+            "label": "Document"
+        },
+        "line3": {
+            "from": "icon2",
+            "to": "icon4",
+            "label": "Results"
+        }
+    }
+}