Developers can run Tesseract OCR with pytesseract using Lambda container images for efficient and scalable OCR operations.
Ensure the following tools are installed on your system:
- AWS SAM
- Python 3.x
The following SAM template sets up the Lambda function triggered by EventBridge, since API Gateway has a maximum timeout limit of 29 seconds. The sample Python script execution exceeds this limit.
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Tesseract OCR Sample with AWS Lambda Container Images using AWS SAM
Resources:
TesseractOcrSample:
Type: AWS::Serverless::Function
Properties:
Events:
Schedule:
Type: Schedule
Properties:
Enabled: true
Schedule: cron(0 * * * ? *)
MemorySize: 512
PackageType: Image
Timeout: 900
Metadata:
DockerTag: latest
DockerContext: ./src/
Dockerfile: Dockerfile
Create a Dockerfile
to define the runtime environment. If your application processes text in a specific language like Japanese, set the LANG
environment variable (line 3) accordingly to avoid encoding issues.
FROM public.ecr.aws/lambda/python:3.9
ENV LANG=ja_JP.UTF-8
WORKDIR ${LAMBDA_TASK_ROOT}
COPY app.py ./
COPY requirements.txt ./
COPY run-melos.pdf ./
RUN rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm \
&& yum update -y && yum install -y poppler-utils tesseract tesseract-langpack-jpn \
&& pip install -U pip && pip install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"
CMD ["app.lambda_handler"]
Add the required libraries to requirements.txt
.
pdf2image==1.16.0
pytesseract==0.3.9
The script converts a PDF to images, performs OCR, and logs the results.
import re
from datetime import datetime
import pdf2image
import pytesseract
def lambda_handler(event: dict, context: dict) -> None:
start = datetime.now()
result = ''
images = to_images('run-melos.pdf', 1, 2)
for image in images:
result += to_string(image)
result = normalize(result)
end = datetime.now()
duration = end.timestamp() - start.timestamp()
print('----------------------------------------')
print(f'Start: {start}')
print(f'End: {end}')
print(f'Duration: {int(duration)} seconds')
print(f'Result: {result}')
print('----------------------------------------')
def to_images(pdf_path: str, first_page: int = None, last_page: int = None) -> list:
""" Convert a PDF to a PNG image.
Args:
pdf_path (str): PDF path
first_page (int): First page starting 1 to be converted
last_page (int): Last page to be converted
Returns:
list: List of image data
"""
print(f'Convert a PDF ({pdf_path}) to a png...')
images = pdf2image.convert_from_path(
pdf_path=pdf_path,
fmt='png',
first_page=first_page,
last_page=last_page,
)
print(f'A total of converted png images is {len(images)}.')
return images
def to_string(image) -> str:
""" OCR an image data.
Args:
image: Image data
Returns:
str: OCR processed characters
"""
print(f'Extract characters from an image...')
return pytesseract.image_to_string(image, lang='jpn')
def normalize(target: str) -> str:
""" Normalize result text.
Applying the following:
- Remove newlines.
- Remove spaces between Japanese characters.
Args:
target (str): Target text to be normalized
Returns:
str: Normalized text
"""
result = re.sub('\n', '', target)
result = re.sub('([あ-んア-ン一-鿐])\s+((?=[あ-んア-ン一-鿐]))', r'\1\2', result)
return result
Run the following command to build the application:
sam build
Execute the following command to run the application locally:
sam local invoke
If an ECR repository does not exist, create one:
aws ecr create-repository --repository-name tesseract-ocr-lambda
Deploy the application:
sam deploy \
--stack-name aws-lambda-tesseract-ocr-sample \
--image-repository 123456789012.dkr.ecr.ap-northeast-1.amazonaws.com/tesseract-ocr-lambda \
--capabilities CAPABILITY_IAM
After deployment, the Lambda function will run hourly and the OCR results will be written to CloudWatch Logs.
To clean up the provisioned AWS resources, use the following command:
sam delete --stack-name aws-lambda-tesseract-ocr-sample
Running Tesseract OCR in AWS Lambda using container images provides an efficient, scalable way to handle complex OCR workflows.
Happy Coding! 🚀