Summary

There are many great OCR engines out there. One of them is Tesseract. It’s widely used because it’s open-source and free to use. In this article, we will take a look at – how to run Tesseract on AWS Lambda to create OCR as a service accessible through REST API.

The following topics will be covered:

  • Creating a project with the serverless framework.
  • Building tesseract to run on AWS Lambda.
  • Writing the function for AWS Lambda that will be triggered by the AWS API Gateway HTTP request.
  • Creating an AWS Lambda layer.
  • Deploying a serverless project

Prerequisites

To follow this tutorial you will need:

Don’t worry if you miss any of this. It’s written step by step so you can follow it.

All the code can be downloaded from this Github repository.

Tesseract on AWS Lambda – where to start

What is the serverless framework?

Serverless is a framework for developing and deploying web applications on cloud platforms such as AWS without dedicated servers. Those applications use the power of services like AWS Lambda.

Therefore, we will use it to write our serverless application – OCR as a service.

Install the serverless framework

Firstly, you should install the serverless framework on your computer (follow this guide in case of any problems).

Copy to Clipboard

 

After serverless is installed, it’s time to create a new serverless project for our OCR as a service. We can use the serverless command to create a new project.

Copy to Clipboard

 

The abovementioned command will prompt you – Do you want to create a new project? Write Y to create it and press ENTER.

Copy to Clipboard

 

Then you need to select which template to use for your project. We will write our service in Python. Therefore, select – AWS Python.

Copy to Clipboard

 

After that, you should define a name for a project.

Copy to Clipboard

Write tesseract-aws-lambda and click ENTER.

When you are asked if you want to enable free account select n.

Copy to Clipboard

 

Do the same for the tab completion.

Copy to Clipboard

So what have we done so far? We have the Serverless framework installed and a project created.

Our newly created project has 3 files:

  • .gitignore
  • handler.py – File where our function, that will run on AWS Lambda, is located.
  • serverless.yml – Configuration for serverless.

All of them already contain some boilerplate code for easier first steps. Our code lives inside handler.py. It already contains a function called hello. We register functions to AWS Lambda inside a serverless.yml file.

Copy to Clipboard

Besides function definition, it contains other configuration for the deployment of our service. We can configure settings such as:

  • AWS Lambda runtime,
  • service name,
  • timeout for AWS Lambda,

The serverless framework contains many useful commands. One of them is – invoke local. We can use it to test function hello.

Copy to Clipboard

 

It should produce output like this:

Copy to Clipboard

Using Tesseract on AWS Lambda

First steps

Now we know where to put our code, configuration, and how to test our function with the serverless command. So we can start adding tesseract.

Firstly, we should install pytesseract.

Copy to Clipboard

Pytesseract library is a wrapper around the Tesseract OCR engine (You can follow this guide to install it if you don’t have it already). After pytesseract is installed, we can check the OCR results.

Add an image called test.jpg to your project’s directory. You can get an example here.

After that, import pytesseract to your handler.py and use it inside function hello.

Copy to Clipboard

To get the full-page text we need to use the image_to_string method. At this point, we should get a result for image test.jpg which should be located in the same directory as handler.py. Running invoke local serverless command should produce output with full-page text from OCR.

Copy to Clipboard

 

It should produce output like this.

Copy to Clipboard

 

Although it works, it can’t be used as a service yet. Our function will always return the text from test.jpg. Also, the name of our function isn’t really descriptive, right? Let’s rename it to ocr first.

Copy to Clipboard

 

We should also update our serverless.yml configuration. We need to update the function name and handler property.

Copy to Clipboard

After the function is renamed, we should add the ability to upload any file. We want to expose our Lambda function through the REST API. We can use AWS Lambda + API Gateway integration. It’s quite easy to do it with our framework.

We need to add events property to our function definition inside serverless.yml. It will be triggered with an HTTP POST request from API Gateway on path /ocr.

Copy to Clipboard

Now that configuration is set up, we need to rewrite our function.

  • We need to get an image from the API Gateway request.
  • OCR should be done on the uploaded image.
  • Finally, we return the result.

API Gateway passes request data to Lambda inside the event function parameter. By default, serverless will use Lambda-proxy integration for invoking function from API Gateway. To test our function we need an example event that will be sent to our function from the API Gateway.

Let’s create a file called lambda_event.json inside our project’s directory. It will contain an example event from API Gateway. You can get a basic example here. We will use it in our test.

Write tests first

Firstly, install the pytest.

Copy to Clipboard

If we want to write test first, we should define our POST body for uploading an image. To keep things simple we will accept a JSON body with key image and a base64 encoded string of image as value.

Copy to Clipboard

Consequently, we need to update the value of the body inside our lambda_event.json. Make sure to include a real base64 encoded string. You can download the example here.

Copy to Clipboard

Now that we have an example, we can write a test for our function.

Copy to Clipboard

 

We should check that the status code is 200 and that text is not empty.

Run the test.

Copy to Clipboard

The test passes because the function returns the expected result. That’s great, right?

Let’s refactor the code, using the image from the event.

Firstly, we load an image from the request body.

Copy to Clipboard

It’s loaded to BytesIO so it can be opened with PIL.

Then we do OCR.

Copy to Clipboard

Our handler.py should now look like this.

Copy to Clipboard

Now we have a working function that will do OCR on the uploaded image. We can test it again with serverless.

Copy to Clipboard

 

So far so good.

Deploy Tesseract on AWS Lambda

We’ve done quite a bit so far. The function that will do OCR is written and tested. The project is configured to invoke our Lambda from the HTTP request sent to API Gateway. What could go wrong at this point?

As I mentioned above, pytesseract is just a wrapper around tesseract. It’s not sufficient to just run pip install. AWS Lambda service is using Amazon Linux. Therefore, we need to build tesseract on this platform and add it to our deploy package.

Fortunately, we can use Docker to build tesseract on the selected platform.

After we build tesseract, we can add it to the AWS Lambda layer using the serverless framework. Layers can be used to add additional dependencies to AWS Lambda functions.

Building tesseract in Docker

To build tesseract we will use the Amazon Linux Docker image. Firstly, we create a script for building.

Copy to Clipboard

The script downloads, builds, and installs tesseract. Then it creates ZIP with tesseract, libraries, and trained data files.

The next step is to create a Docker image where we can build tesseract. We will take amazonlinux:2018.03.0.20200318.1 for our base image. We add build dependencies and Leptionca. Lastly, we add the build script to the image.

Copy to Clipboard

 

Let’s build a Docker Image.

Copy to Clipboard

 

Now create a new directory, called tesseract, inside the project.

Copy to Clipboard

Great, we are ready to build tesseract now.

Let’s run a container from previously built image to start tesseract building.

Copy to Clipboard

 

Now wait until finished …

Create a Lambda layer

Tesseract is finally built for AWS Lambda. Now we can create an AWS Lamda layer.

Create a new directory, called layer, inside the project.

Copy to Clipboard

Afterwards, extract tesseract.zip to the directory layer.

Copy to Clipboard

AWS Lambda layers are mounted to /opt/. Python packages located in /opt/python/lib/python./site-packages/ are automatically added to the path inside AWS Lambda. Therefore, we create the python/lib/python3.7/site-packages/ structure inside our layer directory.

Copy to Clipboard

After that, we install pytesseract to this location.

Copy to Clipboard

We have everything that we need for our layer. Let’s edit serverless.yml. We need to add the layer definition and add it to the function. Files and folders located inside layer directory will be added to AWS Lambda Layer.

We also include handler.py to the deployment package of our function and exclude all other files to keep deploy archives small.

Copy to Clipboard

Environment variables for running tesseract on AWS Lambda

There is one thing left to do in our project. Since we’ve added tesseract to AWS Lambda layer – tesseract executable, libraries and trained data won’t be located at the default location anymore. Consequently, we have to set environment variables when code is executing on AWS Lambda. As I mentioned before, layer content is located in /opt/. We need to update our handler.py:

Copy to Clipboard

AWS_EXECUTION_ENV is an environment variable set in the AWS Lambda environment. When this variable is set, the code is running on AWS Lambda. Therefore, we set environment variables for trained data, libraries, and tesseract command path.

Let’s deploy Tesseract on AWS Lambda

We are almost done. All that’s left is actual deployment and test of our REST API for OCR. Deployment can be done using serverless framework CLI.

Copy to Clipboard

Wait until it’s finished …

Great, now we can test our OCR as a service for real. An URL, on which your service is available, was printed from serverless deploy command. It should look like this:

Copy to Clipboard

Go and install the requests package.

Copy to Clipboard

 

We add a file called use_ocr_as_a_service.py to test our OCR service for real.

Copy to Clipboard

 

Run this script – don’t forget to set the URL of your OCR service. You should see output like this:

Copy to Clipboard

 

After you’ve finished playing around, don’t forget to remove this project.

Copy to Clipboard

Conclusion

We’ve come a long way in this article. As you can see, you can run many different things on AWS Lambda. Although, in cases such as tesseract you have to build libraries yourself. Now that you know how to run tesseract on AWS Lambda, you can set up your own OCR service. At the point on which OCR is not enough – when you need advanced data extraction – check typless and save yourself time and hassle.