In this article we will cover the following topics:
- Creating a project with the serverless framework.
- Building tesseract to run on AWS Lambda.
- Writing the function for AWS Lambda that will be triggered by the AWS API Gateway HTTP request.
- Creating an AWS Lambda layer.
- Deploying a serverless project
Prerequisites
To follow this tutorial you will need:
- an AWS account,
- basic knowledge of AWS Lambda,
- basic knowledge Serverless framework,
- basic knowledge of Docker,
- basic knowledge of Python.
Don’t worry if you miss any of this. It’s written step by step so you can follow it.
All the code can be downloaded from this Github repository.
Tesseract on AWS Lambda – where to start
What is the serverless framework?
Serverless is a framework for developing and deploying web applications on cloud platforms such as AWS without dedicated servers. Those applications use the power of services like AWS Lambda.
Therefore, we will use it to write our serverless application – OCR as a service.
Install the serverless framework
Firstly, you should install the serverless framework on your computer (follow this guide in case of any problems).
npm install -g serverless
After serverless is installed, it’s time to create a new serverless project for our OCR as a service. We can use the serverless command to create a new project.
serverless
The abovementioned command will prompt you – Do you want to create a new project? Write Y to create it and press ENTER.
Serverless: No project detected. Do you want to create a new one? (Y/n)
Then you need to select which template to use for your project. We will write our service in Python. Therefore, select – AWS Python.
Serverless: What do you want to make?
AWS Node.js
❯ AWS Python
Other
After that, you should define a name for a project.
Serverless: What do you want to call this project?
Write tesseract-aws-lambda and click ENTER.
When you are asked if you want to enable free account select n.
Serverless: Would you like to enable this? (Y/n)
Do the same for the tab completion.
Serverless: Would you like to setup a command line completion? (Y/n)
So what have we done so far? We have the Serverless framework installed and a project created.
Our newly created project has 3 files:
- .gitignore
- handler.py – File where our function, that will run on AWS Lambda, is located.
- serverless.yml – Configuration for serverless.
All of them already contain some boilerplate code for easier first steps. Our code lives inside handler.py. It already contains a function called hello. We register functions to AWS Lambda inside a serverless.yml file.
functions:
hello:
handler: handler.hello
Besides function definition, it contains other configuration for the deployment of our service. We can configure settings such as:
- AWS Lambda runtime,
- service name,
- timeout for AWS Lambda,
- …
The serverless framework contains many useful commands. One of them is – invoke local. We can use it to test function hello.
serverless invoke local -f hello
It should produce output like this:
{
"statusCode": 200,
"body": "{\"input\": {}, \"message\": \"Go Serverless v1.0! Your function executed successfully!\"}"
}
Using Tesseract on AWS Lambda
First steps
Now we know where to put our code, configuration, and how to test our function with the serverless command. So we can start adding tesseract.
Firstly, we should install pytesseract.
pip install pytesseract
Pytesseract library is a wrapper around the Tesseract OCR engine (You can follow this guide to install it if you don’t have it already). After pytesseract is installed, we can check the OCR results.
Add an image called test.jpg to your project’s directory. You can get an example here.
After that, import pytesseract to your handler.py and use it inside function hello.
import json
import pytesseract
from PIL import Image
def hello(event, context):
body = {
"text": pytesseract.image_to_string(Image.open('test.jpg')),
}
response = {
"statusCode": 200,
"body": json.dumps(body)
}
return response
To get the full-page text we need to use the image_to_string method. At this point, we should get a result for image test.jpg which should be located in the same directory as handler.py. Running invoke local serverless command should produce output with full-page text from OCR.
serverless invoke local -f hello
It should produce output like this.
{
"statusCode": 200,
"body": "{\"text\": \"Test document PDF\n\nLorem ipsum dolor ...\"}"
}
Although it works, it can’t be used as a service yet. Our function will always return the text from test.jpg. Also, the name of our function isn’t really descriptive, right? Let’s rename it to ocr first.
import json
import pytesseract
from PIL import Image
def ocr(event, context):
body = {
"text": pytesseract.image_to_string(Image.open('test.jpg')),
}
response = {
"statusCode": 200,
"body": json.dumps(body)
}
return response
We should also update our serverless.yml configuration. We need to update the function name and handler property.
service: tesseract-aws-lambda
provider:
name: aws
runtime: python3.10
functions:
ocr:
handler: handler.ocr
After the function is renamed, we should add the ability to upload any file. We want to expose our Lambda function through the REST API. We can use AWS Lambda + API Gateway integration. It’s quite easy to do it with our framework.
We need to add events property to our function definition inside serverless.yml. It will be triggered with an HTTP POST request from API Gateway on path /ocr.
service: tesseract-aws-lambda
provider:
name: aws
runtime: python3.10
functions:
ocr:
handler: handler.ocr
events:
- http:
path: ocr
method: post
Now that configuration is set up, we need to rewrite our function.
- We need to get an image from the API Gateway request.
- OCR should be done on the uploaded image.
- Finally, we return the result.
API Gateway passes request data to Lambda inside the event function parameter. By default, serverless will use Lambda-proxy integration for invoking function from API Gateway. To test our function we need an example event that will be sent to our function from the API Gateway.
Let’s create a file called lambda_event.json inside our project’s directory. It will contain an example event from API Gateway. You can get a basic example here. We will use it in our test.
Write tests first
Firstly, install the pytest.
pip install pytest
If we want to write test first, we should define our POST body for uploading an image. To keep things simple we will accept a JSON body with key image and a base64 encoded string of image as value.
{
"image": ""
}
Consequently, we need to update the value of the body inside our lambda_event.json. Make sure to include a real base64 encoded string. You can download the example here.
{
"resource": "/",
"path": "/",
"httpMethod": "POST",
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-GB,en-US;q=0.8,en;q=0.6,zh-CN;q=0.4",
"cache-control": "max-age=0",
"CloudFront-Forwarded-Proto": "https",
"CloudFront-Is-Desktop-Viewer": "true",
"CloudFront-Is-Mobile-Viewer": "false",
"CloudFront-Is-SmartTV-Viewer": "false",
"CloudFront-Is-Tablet-Viewer": "false",
"CloudFront-Viewer-Country": "GB",
"content-type": "application/x-www-form-urlencoded",
"Host": "j3ap25j034.execute-api.eu-west-2.amazonaws.com",
"origin": "https://j3ap25j034.execute-api.eu-west-2.amazonaws.com",
"Referer": "https://j3ap25j034.execute-api.eu-west-2.amazonaws.com/dev/",
"upgrade-insecure-requests": "1",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
"Via": "2.0 a3650115c5e21e2b5d133ce84464bea3.cloudfront.net (CloudFront)",
"X-Amz-Cf-Id": "0nDeiXnReyHYCkv8cc150MWCFCLFPbJoTs1mexDuKe2WJwK5ANgv2A==",
"X-Amzn-Trace-Id": "Root=1-597079de-75fec8453f6fd4812414a4cd",
"X-Forwarded-For": "50.129.117.14, 50.112.234.94",
"X-Forwarded-Port": "443",
"X-Forwarded-Proto": "https"
},
"queryStringParameters": null,
"pathParameters": null,
"stageVariables": null,
"requestContext": {
"path": "/dev/",
"accountId": "125002137610",
"resourceId": "qdolsr1yhk",
"stage": "dev",
"requestId": "0f2431a2-6d2f-11e7-b799-5152aa497861",
"identity": {
"cognitoIdentityPoolId": null,
"accountId": null,
"cognitoIdentityId": null,
"caller": null,
"apiKey": "",
"sourceIp": "50.129.117.14",
"accessKey": null,
"cognitoAuthenticationType": null,
"cognitoAuthenticationProvider": null,
"userArn": null,
"userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
"user": null
},
"resourcePath": "/",
"httpMethod": "POST",
"apiId": "j3azlsj0c4"
},
"body": "{\"image\": \"/9j/4A...\"}",
"isBase64Encoded": false
}
Now that we have an example, we can write a test for our function.
import json
import pathlib
import pytest
from handler import ocr
@pytest.fixture
def lambda_event(request):
file = pathlib.Path(request.node.fspath.strpath)
event_file = file.with_name('lambda_event.json')
with event_file.open() as fp:
return json.load(fp)
def test_handler(lambda_event):
lambda_response = ocr(lambda_event, {})
assert lambda_response['statusCode'] == 200
assert len(lambda_response['body']['text']) > 0
We should check that the status code is 200 and that text is not empty.
Run the test.
pytest
The test passes because the function returns the expected result. That’s great, right?
Let’s refactor the code, using the image from the event.
Firstly, we load an image from the request body.
request_body = json.loads(event['body'])
image = io.BytesIO(base64.b64decode(request_body['image']))
It’s loaded to BytesIO so it can be opened with PIL.
Then we do OCR.
text = pytesseract.image_to_string(Image.open(image))
Our handler.py should now look like this.
import base64
import io
import json
import pytesseract
from PIL import Image
def ocr(event, context):
request_body = json.loads(event['body'])
image = io.BytesIO(base64.b64decode(request_body['image']))
text = pytesseract.image_to_string(Image.open(image))
body = {
"text": text
}
response = {
"statusCode": 200,
"body": json.dumps(body)
}
return response
Now we have a working function that will do OCR on the uploaded image. We can test it again with serverless.
serverless invoke local -f ocr -p lambda_event.json
So far so good.
Deploy Tesseract on AWS Lambda
We’ve done quite a bit so far. The function that will do OCR is written and tested. The project is configured to invoke our Lambda from the HTTP request sent to API Gateway. What could go wrong at this point?
As I mentioned above, pytesseract is just a wrapper around tesseract. It’s not sufficient to just run pip install. AWS Lambda service is using Amazon Linux. Therefore, we need to build tesseract on this platform and add it to our deploy package.
Fortunately, we can use Docker to build tesseract on the selected platform.
After we build tesseract, we can add it to the AWS Lambda layer using the serverless framework. Layers can be used to add additional dependencies to AWS Lambda functions.
Building tesseract in Docker
To build tesseract we will use the Amazon Linux Docker image. Firstly, we create a script for building.
#!/usr/bin/env bash
# docker_build tesseract
cd ~
git clone https://github.com/DanBloomberg/leptonica.git
cd leptonica/
git checkout $LEPTONICA_VERSION # newer version crashes tesseract build for now. See https://github.com/tesseract-ocr/tesseract/issues/3815
./autogen.sh
./configure
make
make install
# tesseract
cd ~
git clone https://github.com/tesseract-ocr/tesseract.git
cd tesseract
git checkout $TESSERACT_VERSION
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
./autogen.sh
./configure
make
make install
cd ~
mkdir tesseract-standalone
# copy files
cd tesseract-standalone
mkdir bin
cp /usr/local/bin/tesseract ./bin
mkdir lib
cp /usr/local/lib/libtesseract.so.5 lib/
cp /lib64/libpng15.so.15 lib/
cp /lib64/libtiff.so.5 lib/
cp /lib64/libgomp.so.1 lib/
cp /lib64/libjbig.so.2.0 lib/
cp /usr/local/lib/libleptonica.so.6 lib/
cp /usr/lib64/libjpeg.so.62 lib/
cp /usr/lib64/libwebp.so.4 lib/
cp /usr/lib64/libstdc++.so.6 lib/
# copy training data
mkdir tessdata
cd tessdata
wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata
# archive
cd ~
cd tesseract-standalone
mkdir python
cd ~
# trim unneeded ~ 15 MB
strip ./tesseract-standalone/**/*
cd tesseract-standalone
zip -r9 ../tesseract.zip *
The script downloads, builds, and installs tesseract. Then it creates ZIP with tesseract, libraries, and trained data files.
The next step is to create a Docker image where we can build tesseract. We will take public.ecr.aws/lambda/python:3.10-x86_64 for our base image. We add build dependencies and Leptionca. Lastly, we add the build script to the image.
FROM public.ecr.aws/lambda/python:3.10-x86_64
ENV LEPTONICA_VERSION="1.83.0"
ENV TESSERACT_VERSION="5.3.0"
WORKDIR /tmp/
RUN yum install -y aclocal autoconf automake cmakegcc freetype-devel gcc gcc-c++ \
git lcms2-devel libjpeg-devel libjpeg-turbo-devel autogen autoconf libtool \
libpng-devel libtiff-devel libtool libwebp-devel libzip-devel make zlib-devel zip
COPY build_tesseract.sh /tmp/build_tesseract.sh
RUN chmod +x /tmp/build_tesseract.sh
CMD sh /tmp/build_tesseract.sh
Let’s build a Docker Image.
docker build -t tesseract .
Now create a new directory, called tesseract, inside the project.
mkdir tesseract
Great, we are ready to build tesseract now.
Let’s run a container from previously built image to start tesseract building.
docker run -v $PWD/tesseract:/tmp/build tesseract sh /tmp/build_tesseract.sh
Now wait until finished …
Create a Lambda layer
Tesseract is finally built for AWS Lambda. Now we can create an AWS Lamda layer.
Create a new directory, called layer, inside the project.
mkdir layer
Afterwards, extract tesseract.zip to the directory layer.
unzip tesseract/tesseract.zip -d layer
mkdir -p layer/python/lib/python3.10/site-packages/
After that, we install pytesseract to this location.
pip install pytesseract -t layer/python/lib/python3.10/site-packages/
We have everything that we need for our layer. Let’s edit serverless.yml. We need to add the layer definition and add it to the function. Files and folders located inside layer directory will be added to AWS Lambda Layer.
We also include handler.py to the deployment package of our function and exclude all other files to keep deploy archives small.
service: tesseract-aws-lambda
provider:
name: aws
runtime: python3.10
package:
exclude:
- .idea/**
- __pycache__/**
- .pytest_cache/**
- tesseract/**
- venv/**
- build_tesseract.sh
- Dockerfile
- lambda_event.json
- requirements.txt
- serverless.yml
- test.jpg
- test_handler.py
- use_ocr_as_a_service.py
- layer/**
layers:
OCR:
path: layer
name: ocr-layer
description: Layer with Tesseract
compatibleRuntimes:
- python3.10
retain: false
package:
include:
- layer/**
functions:
ocr:
handler: handler.ocr
memorySize: 3008
timeout: 15
layers:
- {Ref: OCRLambdaLayer}
events:
- http:
path: ocr
method: post
package:
include:
- handler.py
Environment variables for running tesseract on AWS Lambda
There is one thing left to do in our project. Since we’ve added tesseract to AWS Lambda layer – tesseract executable, libraries and trained data won’t be located at the default location anymore. Consequently, we have to set environment variables when code is executing on AWS Lambda. As I mentioned before, layer content is located in /opt/. We need to update our handler.py:
import base64
import io
import json
import os
import pytesseract
from PIL import Image
if os.getenv('AWS_EXECUTION_ENV') is not None:
os.environ['LD_LIBRARY_PATH'] = '/opt/lib'
os.environ['TESSDATA_PREFIX'] = '/opt/tessdata'
pytesseract.pytesseract.tesseract_cmd = '/opt/tesseract'
def ocr(event, context):
request_body = json.loads(event['body'])
image = io.BytesIO(base64.b64decode(request_body['image']))
text = pytesseract.image_to_string(Image.open(image))
body = {
"text": text
}
response = {
"statusCode": 200,
"body": json.dumps(body)
}
return response
AWS_EXECUTION_ENV is an environment variable set in the AWS Lambda environment. When this variable is set, the code is running on AWS Lambda. Therefore, we set environment variables for trained data, libraries, and tesseract command path.
Let’s deploy Tesseract on AWS Lambda
We are almost done. All that’s left is actual deployment and test of our REST API for OCR. Deployment can be done using serverless framework CLI.
serverless deploy --stage dev
Wait until it’s finished …
Great, now we can test our OCR as a service for real. An URL, on which your service is available, was printed from serverless deploy command. It should look like this:
https://zs23rtvcp9.execute-api.us-east-1.amazonaws.com/dev/ocr
Go and install the requests package.
pip install requests
We add a file called use_ocr_as_a_service.py to test our OCR service for real.
import base64
import requests
with open('test.jpg', 'rb') as file:
base64_str = base64.b64encode(file.read()).decode()
response = requests.post(
'URL_OF_YOUR_ENDPOINT',
json={
'image': base64_str
}
)
print(response.json())
Run this script – don’t forget to set the URL of your OCR service. You should see output like this:
{'text': 'Test document PDF\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla est purus, ...'}
After you’ve finished playing around, don’t forget to remove this project.
serverless remove --stage dev
Conclusion
We’ve come a long way in this article. As you can see, you can run many different things on AWS Lambda. Although, in cases such as tesseract you have to build libraries yourself. Now that you know how to run tesseract on AWS Lambda, you can set up your own OCR service. At the point on which OCR is not enough – when you need advanced data extraction – check typless and save yourself time and hassle.
Read more: