Tesseract on AWS Lambda – OCR as a service

There are many great OCR engines out there. One of them is Tesseract. It’s widely used because it’s open-source and free to use. In this article, we will take a look at – how to run Tesseract on AWS Lambda to create OCR as a service accessible through REST API.

In this article we will cover the following topics:

  • Creating a project with the serverless framework.
  • Building tesseract to run on AWS Lambda.
  • Writing the function for AWS Lambda that will be triggered by the AWS API Gateway HTTP request.
  • Creating an AWS Lambda layer.
  • Deploying a serverless project

Prerequisites

To follow this tutorial you will need:

 

Don’t worry if you miss any of this. It’s written step by step so you can follow it.

All the code can be downloaded from this Github repository.

Tesseract on AWS Lambda – where to start

What is the serverless framework?

Serverless is a framework for developing and deploying web applications on cloud platforms such as AWS without dedicated servers. Those applications use the power of services like AWS Lambda.

Therefore, we will use it to write our serverless application – OCR as a service.

Install the serverless framework

Firstly, you should install the serverless framework on your computer (follow this guide in case of any problems).

				
					npm install -g serverless
				
			

After serverless is installed, it’s time to create a new serverless project for our OCR as a service. We can use the serverless command to create a new project.

				
					serverless
				
			

The abovementioned command will prompt you – Do you want to create a new project? Write Y to create it and press ENTER.

				
					Serverless: No project detected. Do you want to create a new one? (Y/n)
				
			

Then you need to select which template to use for your project. We will write our service in Python. Therefore, select – AWS Python.

				
					Serverless: What do you want to make? 
  AWS Node.js 
❯ AWS Python 
  Other 
				
			

After that, you should define a name for a project.

				
					Serverless: What do you want to call this project?
				
			

Write tesseract-aws-lambda and click ENTER.

 

When you are asked if you want to enable free account select n.

				
					Serverless: Would you like to enable this? (Y/n)
				
			

Do the same for the tab completion.

				
					Serverless: Would you like to setup a command line <tab> completion? (Y/n)
				
			

So what have we done so far? We have the Serverless framework installed and a project created.

Our newly created project has 3 files:

  • .gitignore
  • handler.py – File where our function, that will run on AWS Lambda, is located.
  • serverless.yml – Configuration for serverless.

All of them already contain some boilerplate code for easier first steps. Our code lives inside handler.py. It already contains a function called hello. We register functions to AWS Lambda inside a serverless.yml file.

				
					functions:
  hello:
    handler: handler.hello
				
			

Besides function definition, it contains other configuration for the deployment of our service. We can configure settings such as:

  • AWS Lambda runtime,
  • service name,
  • timeout for AWS Lambda,

The serverless framework contains many useful commands. One of them is – invoke local. We can use it to test function hello.

				
					serverless invoke local -f hello
				
			

It should produce output like this:

				
					{
    "statusCode": 200,
    "body": "{\"input\": {}, \"message\": \"Go Serverless v1.0! Your function executed successfully!\"}"
}
				
			

Using Tesseract on AWS Lambda

First steps

Now we know where to put our code, configuration, and how to test our function with the serverless command. So we can start adding tesseract.

Firstly, we should install pytesseract.

				
					pip install pytesseract
				
			

Pytesseract library is a wrapper around the Tesseract OCR engine (You can follow this guide to install it if you don’t have it already). After pytesseract is installed, we can check the OCR results.

Add an image called test.jpg to your project’s directory. You can get an example here.

After that, import pytesseract to your handler.py and use it inside function hello.

				
					import json
import pytesseract
from PIL import Image


def hello(event, context):

    body = {
        "text": pytesseract.image_to_string(Image.open('test.jpg')),
    }

    response = {
        "statusCode": 200,
        "body": json.dumps(body)
    }

    return response
				
			

To get the full-page text we need to use the image_to_string method. At this point, we should get a result for image test.jpg which should be located in the same directory as handler.py. Running invoke local serverless command should produce output with full-page text from OCR.

				
					 serverless invoke local -f hello
				
			

It should produce output like this.

				
					{
"statusCode": 200,
"body": "{\"text\": \"Test document PDF\n\nLorem ipsum dolor ...\"}"
}
				
			

Although it works, it can’t be used as a service yet. Our function will always return the text from test.jpg. Also, the name of our function isn’t really descriptive, right? Let’s rename it to ocr first.

				
					import json
import pytesseract
from PIL import Image


def ocr(event, context):

    body = {
        "text": pytesseract.image_to_string(Image.open('test.jpg')),
    }

    response = {
        "statusCode": 200,
        "body": json.dumps(body)
    }

    return response
				
			

We should also update our serverless.yml configuration. We need to update the function name and handler property.

				
					service: tesseract-aws-lambda

provider:
  name: aws
  runtime: python3.10

functions:
  ocr:
    handler: handler.ocr
				
			

After the function is renamed, we should add the ability to upload any file. We want to expose our Lambda function through the REST API. We can use AWS Lambda + API Gateway integration. It’s quite easy to do it with our framework.

We need to add events property to our function definition inside serverless.yml. It will be triggered with an HTTP POST request from API Gateway on path /ocr.

				
					service: tesseract-aws-lambda

provider:
  name: aws
  runtime: python3.10

functions:
  ocr:
    handler: handler.ocr
    events:
      - http:
          path: ocr
          method: post
				
			

Now that configuration is set up, we need to rewrite our function.

 

  • We need to get an image from the API Gateway request.
  • OCR should be done on the uploaded image.
  • Finally, we return the result.

 

API Gateway passes request data to Lambda inside the event function parameter. By default, serverless will use Lambda-proxy integration for invoking function from API Gateway. To test our function we need an example event that will be sent to our function from the API Gateway.

 

Let’s create a file called lambda_event.json inside our project’s directory. It will contain an example event from API Gateway. You can get a basic example here. We will use it in our test.

Write tests first

Firstly, install the pytest.

				
					 pip install pytest
				
			

If we want to write test first, we should define our POST body for uploading an image. To keep things simple we will accept a JSON body with key image and a base64 encoded string of image as value.

				
					{
  "image": "<base64str>"
}
				
			

Consequently, we need to update the value of the body inside our lambda_event.json. Make sure to include a real base64 encoded string. You can download the example here.

				
					{
  "resource": "/",
  "path": "/",
  "httpMethod": "POST",
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-GB,en-US;q=0.8,en;q=0.6,zh-CN;q=0.4",
    "cache-control": "max-age=0",
    "CloudFront-Forwarded-Proto": "https",
    "CloudFront-Is-Desktop-Viewer": "true",
    "CloudFront-Is-Mobile-Viewer": "false",
    "CloudFront-Is-SmartTV-Viewer": "false",
    "CloudFront-Is-Tablet-Viewer": "false",
    "CloudFront-Viewer-Country": "GB",
    "content-type": "application/x-www-form-urlencoded",
    "Host": "j3ap25j034.execute-api.eu-west-2.amazonaws.com",
    "origin": "https://j3ap25j034.execute-api.eu-west-2.amazonaws.com",
    "Referer": "https://j3ap25j034.execute-api.eu-west-2.amazonaws.com/dev/",
    "upgrade-insecure-requests": "1",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
    "Via": "2.0 a3650115c5e21e2b5d133ce84464bea3.cloudfront.net (CloudFront)",
    "X-Amz-Cf-Id": "0nDeiXnReyHYCkv8cc150MWCFCLFPbJoTs1mexDuKe2WJwK5ANgv2A==",
    "X-Amzn-Trace-Id": "Root=1-597079de-75fec8453f6fd4812414a4cd",
    "X-Forwarded-For": "50.129.117.14, 50.112.234.94",
    "X-Forwarded-Port": "443",
    "X-Forwarded-Proto": "https"
  },
  "queryStringParameters": null,
  "pathParameters": null,
  "stageVariables": null,
  "requestContext": {
    "path": "/dev/",
    "accountId": "125002137610",
    "resourceId": "qdolsr1yhk",
    "stage": "dev",
    "requestId": "0f2431a2-6d2f-11e7-b799-5152aa497861",
    "identity": {
      "cognitoIdentityPoolId": null,
      "accountId": null,
      "cognitoIdentityId": null,
      "caller": null,
      "apiKey": "",
      "sourceIp": "50.129.117.14",
      "accessKey": null,
      "cognitoAuthenticationType": null,
      "cognitoAuthenticationProvider": null,
      "userArn": null,
      "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36",
      "user": null
    },
    "resourcePath": "/",
    "httpMethod": "POST",
    "apiId": "j3azlsj0c4"
  },
  "body": "{\"image\": \"/9j/4A...\"}",
  "isBase64Encoded": false
}
				
			

Now that we have an example, we can write a test for our function.

 

Copy to Clipboard

				
					import json
import pathlib

import pytest

from handler import ocr


@pytest.fixture
def lambda_event(request):
    file = pathlib.Path(request.node.fspath.strpath)
    event_file = file.with_name('lambda_event.json')
    with event_file.open() as fp:
        return json.load(fp)


def test_handler(lambda_event):
    lambda_response = ocr(lambda_event, {})
    assert lambda_response['statusCode'] == 200
    assert len(lambda_response['body']['text']) > 0

				
			

We should check that the status code is 200 and that text is not empty.

Run the test.

				
					pytest
				
			

The test passes because the function returns the expected result. That’s great, right?

 

Let’s refactor the code, using the image from the event.

Firstly, we load an image from the request body.

				
					    request_body = json.loads(event['body'])
    image = io.BytesIO(base64.b64decode(request_body['image']))
				
			

It’s loaded to BytesIO so it can be opened with PIL.

Then we do OCR.

				
					text = pytesseract.image_to_string(Image.open(image))
				
			

Our handler.py should now look like this.

				
					import base64
import io
import json
import pytesseract
from PIL import Image


def ocr(event, context):

    request_body = json.loads(event['body'])
    image = io.BytesIO(base64.b64decode(request_body['image']))

    text = pytesseract.image_to_string(Image.open(image))

    body = {
        "text": text
    }

    response = {
        "statusCode": 200,
        "body": json.dumps(body)
    }

    return response
				
			

Now we have a working function that will do OCR on the uploaded image. We can test it again with serverless.

				
					serverless invoke local -f ocr -p lambda_event.json
				
			

So far so good.

Deploy Tesseract on AWS Lambda

We’ve done quite a bit so far. The function that will do OCR is written and tested. The project is configured to invoke our Lambda from the HTTP request sent to API Gateway. What could go wrong at this point?

 

As I mentioned above, pytesseract is just a wrapper around tesseract. It’s not sufficient to just run pip install. AWS Lambda service is using Amazon Linux. Therefore, we need to build tesseract on this platform and add it to our deploy package.

 

Fortunately, we can use Docker to build tesseract on the selected platform.

After we build tesseract, we can add it to the AWS Lambda layer using the serverless framework. Layers can be used to add additional dependencies to AWS Lambda functions.

Building tesseract in Docker

To build tesseract we will use the Amazon Linux Docker image. Firstly, we create a script for building.

				
					#!/usr/bin/env bash
# docker_build tesseract

cd ~
git clone https://github.com/DanBloomberg/leptonica.git
cd leptonica/
git checkout $LEPTONICA_VERSION # newer version crashes tesseract build for now. See https://github.com/tesseract-ocr/tesseract/issues/3815
./autogen.sh
./configure
make
make install


# tesseract
cd ~
git clone https://github.com/tesseract-ocr/tesseract.git
cd tesseract
git checkout $TESSERACT_VERSION
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
./autogen.sh
./configure
make
make install


cd ~
mkdir tesseract-standalone
# copy files
cd tesseract-standalone


mkdir bin
cp /usr/local/bin/tesseract ./bin


mkdir lib
cp /usr/local/lib/libtesseract.so.5 lib/
cp /lib64/libpng15.so.15 lib/
cp /lib64/libtiff.so.5 lib/
cp /lib64/libgomp.so.1 lib/
cp /lib64/libjbig.so.2.0 lib/
cp /usr/local/lib/libleptonica.so.6 lib/
cp /usr/lib64/libjpeg.so.62 lib/
cp /usr/lib64/libwebp.so.4 lib/
cp /usr/lib64/libstdc++.so.6 lib/


# copy training data
mkdir tessdata
cd tessdata


wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata


# archive
cd ~
cd tesseract-standalone
mkdir python


cd ~
# trim unneeded ~ 15 MB
strip ./tesseract-standalone/**/*


cd tesseract-standalone


zip -r9 ../tesseract.zip *
				
			

The script downloads, builds, and installs tesseract. Then it creates ZIP with tesseract, libraries, and trained data files.

The next step is to create a Docker image where we can build tesseract. We will take public.ecr.aws/lambda/python:3.10-x86_64 for our base image. We add build dependencies and Leptionca. Lastly, we add the build script to the image.

				
					FROM public.ecr.aws/lambda/python:3.10-x86_64

ENV LEPTONICA_VERSION="1.83.0"
ENV TESSERACT_VERSION="5.3.0"
WORKDIR /tmp/

RUN yum install -y aclocal autoconf automake cmakegcc freetype-devel gcc gcc-c++ \
git lcms2-devel libjpeg-devel libjpeg-turbo-devel autogen autoconf libtool \
libpng-devel libtiff-devel libtool libwebp-devel libzip-devel make zlib-devel zip


COPY build_tesseract.sh /tmp/build_tesseract.sh
RUN chmod +x /tmp/build_tesseract.sh
CMD sh /tmp/build_tesseract.sh
				
			

Let’s build a Docker Image.

				
					docker build -t tesseract .
				
			

Now create a new directory, called tesseract, inside the project.

				
					mkdir tesseract
				
			

Great, we are ready to build tesseract now.

Let’s run a container from previously built image to start tesseract building.

				
					docker run -v $PWD/tesseract:/tmp/build tesseract sh /tmp/build_tesseract.sh
				
			

Now wait until finished …

Create a Lambda layer

Tesseract is finally built for AWS Lambda. Now we can create an AWS Lamda layer.

Create a new directory, called layer, inside the project.

				
					mkdir layer
				
			

Afterwards, extract tesseract.zip to the directory layer.

				
					unzip tesseract/tesseract.zip -d layer
				
			
AWS Lambda layers are mounted to /opt/. Python packages located in /opt/python/lib/python./site-packages/ are automatically added to the path inside AWS Lambda. Therefore, we create the python/lib/python3.10/site-packages/ structure inside our layer directory.
				
					mkdir -p layer/python/lib/python3.10/site-packages/
				
			

After that, we install pytesseract to this location.

				
					pip install pytesseract -t layer/python/lib/python3.10/site-packages/
				
			

We have everything that we need for our layer. Let’s edit serverless.yml. We need to add the layer definition and add it to the function. Files and folders located inside layer directory will be added to AWS Lambda Layer.

We also include handler.py to the deployment package of our function and exclude all other files to keep deploy archives small.

				
					service: tesseract-aws-lambda

provider:
  name: aws
  runtime: python3.10

package:
  exclude:
    - .idea/**
    - __pycache__/**
    - .pytest_cache/**
    - tesseract/**
    - venv/**
    - build_tesseract.sh
    - Dockerfile
    - lambda_event.json
    - requirements.txt
    - serverless.yml
    - test.jpg
    - test_handler.py
    - use_ocr_as_a_service.py
    - layer/**

layers:
  OCR:
    path: layer
    name: ocr-layer
    description: Layer with Tesseract
    compatibleRuntimes:
      - python3.10
    retain: false
    package:
      include:
        - layer/**

functions:
  ocr:
    handler: handler.ocr
    memorySize: 3008
    timeout: 15
    layers:
      - {Ref: OCRLambdaLayer}
    events:
      - http:
          path: ocr
          method: post
    package:
      include:
        - handler.py
				
			

Environment variables for running tesseract on AWS Lambda

There is one thing left to do in our project. Since we’ve added tesseract to AWS Lambda layer – tesseract executable, libraries and trained data won’t be located at the default location anymore. Consequently, we have to set environment variables when code is executing on AWS Lambda. As I mentioned before, layer content is located in /opt/. We need to update our handler.py:

				
					import base64
import io
import json
import os

import pytesseract
from PIL import Image

if os.getenv('AWS_EXECUTION_ENV') is not None:
    os.environ['LD_LIBRARY_PATH'] = '/opt/lib'
    os.environ['TESSDATA_PREFIX'] = '/opt/tessdata'
    pytesseract.pytesseract.tesseract_cmd = '/opt/tesseract'


def ocr(event, context):

    request_body = json.loads(event['body'])
    image = io.BytesIO(base64.b64decode(request_body['image']))

    text = pytesseract.image_to_string(Image.open(image))

    body = {
        "text": text
    }

    response = {
        "statusCode": 200,
        "body": json.dumps(body)
    }

    return response
				
			

AWS_EXECUTION_ENV is an environment variable set in the AWS Lambda environment. When this variable is set, the code is running on AWS Lambda. Therefore, we set environment variables for trained data, libraries, and tesseract command path.

Let’s deploy Tesseract on AWS Lambda

We are almost done. All that’s left is actual deployment and test of our REST API for OCR. Deployment can be done using serverless framework CLI.

				
					serverless deploy --stage dev
				
			

Wait until it’s finished …

 

Great, now we can test our OCR as a service for real. An URL, on which your service is available, was printed from serverless deploy command. It should look like this:

				
					https://zs23rtvcp9.execute-api.us-east-1.amazonaws.com/dev/ocr
				
			

Go and install the requests package.

				
					pip install requests
				
			

We add a file called use_ocr_as_a_service.py to test our OCR service for real.

				
					import base64

import requests

with open('test.jpg', 'rb') as file:
    base64_str = base64.b64encode(file.read()).decode()


response = requests.post(
    'URL_OF_YOUR_ENDPOINT',
    json={
        'image': base64_str
    }
)

print(response.json())
				
			

Run this script – don’t forget to set the URL of your OCR service. You should see output like this:

				
					{'text': 'Test document PDF\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla est purus, ...'}
				
			

After you’ve finished playing around, don’t forget to remove this project.

				
					serverless remove --stage dev
				
			

Conclusion

We’ve come a long way in this article. As you can see, you can run many different things on AWS Lambda. Although, in cases such as tesseract you have to build libraries yourself. Now that you know how to run tesseract on AWS Lambda, you can set up your own OCR service. At the point on which OCR is not enough – when you need advanced data extraction – check typless and save yourself time and hassle.

Read more:

Scanning best practices for OCR

Send us a question and get in touch with us :)

What kind of documents would you like to extract?
Approximately how many documents would you like to extract monthly?
Could you tell us a little about yourself?
What would you like to discuss?