There are many great OCR engines out there. One of them is Tesseract. It’s widely used because it’s open-source and free to use. In this article, we will take a look at – how to run Tesseract on AWS Lambda to create OCR as a service accessible through REST API.
The following topics will be covered:
- Creating a project with the serverless framework.
- Building tesseract to run on AWS Lambda.
- Writing the function for AWS Lambda that will be triggered by the AWS API Gateway HTTP request.
- Creating an AWS Lambda layer.
- Deploying a serverless project
To follow this tutorial you will need:
- an AWS account,
- basic knowledge of AWS Lambda,
- basic knowledge Serverless framework,
- basic knowledge of Docker,
- basic knowledge of Python.
Don’t worry if you miss any of this. It’s written step by step so you can follow it.
All the code can be downloaded from this Github repository.
Tesseract on AWS Lambda – where to start
What is the serverless framework?
Serverless is a framework for developing and deploying web applications on cloud platforms such as AWS without dedicated servers. Those applications use the power of services like AWS Lambda.
Therefore, we will use it to write our serverless application – OCR as a service.
Install the serverless framework
After serverless is installed, it’s time to create a new serverless project for our OCR as a service. We can use the serverless command to create a new project.
The abovementioned command will prompt you – Do you want to create a new project? Write Y to create it and press ENTER.
Then you need to select which template to use for your project. We will write our service in Python. Therefore, select – AWS Python.
After that, you should define a name for a project.
Write tesseract-aws-lambda and click ENTER.
When you are asked if you want to enable free account select n.
Do the same for the tab completion.
So what have we done so far? We have the Serverless framework installed and a project created.
Our newly created project has 3 files:
- handler.py – File where our function, that will run on AWS Lambda, is located.
- serverless.yml – Configuration for serverless.
All of them already contain some boilerplate code for easier first steps. Our code lives inside handler.py. It already contains a function called hello. We register functions to AWS Lambda inside a serverless.yml file.
Besides function definition, it contains other configuration for the deployment of our service. We can configure settings such as:
- AWS Lambda runtime,
- service name,
- timeout for AWS Lambda,
The serverless framework contains many useful commands. One of them is – invoke local. We can use it to test function hello.
It should produce output like this:
Using Tesseract on AWS Lambda
Now we know where to put our code, configuration, and how to test our function with the serverless command. So we can start adding tesseract.
Firstly, we should install pytesseract.
Pytesseract library is a wrapper around the Tesseract OCR engine (You can follow this guide to install it if you don’t have it already). After pytesseract is installed, we can check the OCR results.
Add an image called test.jpg to your project’s directory. You can get an example here.
After that, import pytesseract to your handler.py and use it inside function hello.
To get the full-page text we need to use the image_to_string method. At this point, we should get a result for image test.jpg which should be located in the same directory as handler.py. Running invoke local serverless command should produce output with full-page text from OCR.
It should produce output like this.
Although it works, it can’t be used as a service yet. Our function will always return the text from test.jpg. Also, the name of our function isn’t really descriptive, right? Let’s rename it to ocr first.
We should also update our serverless.yml configuration. We need to update the function name and handler property.
After the function is renamed, we should add the ability to upload any file. We want to expose our Lambda function through the REST API. We can use AWS Lambda + API Gateway integration. It’s quite easy to do it with our framework.
We need to add events property to our function definition inside serverless.yml. It will be triggered with an HTTP POST request from API Gateway on path /ocr.
Now that configuration is set up, we need to rewrite our function.
- We need to get an image from the API Gateway request.
- OCR should be done on the uploaded image.
- Finally, we return the result.
API Gateway passes request data to Lambda inside the event function parameter. By default, serverless will use Lambda-proxy integration for invoking function from API Gateway. To test our function we need an example event that will be sent to our function from the API Gateway.
Let’s create a file called lambda_event.json inside our project’s directory. It will contain an example event from API Gateway. You can get a basic example here. We will use it in our test.
Write tests first
Firstly, install the pytest.
If we want to write test first, we should define our POST body for uploading an image. To keep things simple we will accept a JSON body with key image and a base64 encoded string of image as value.
Consequently, we need to update the value of the body inside our lambda_event.json. Make sure to include a real base64 encoded string. You can download the example here.
Now that we have an example, we can write a test for our function.
We should check that the status code is 200 and that text is not empty.
Run the test.
The test passes because the function returns the expected result. That’s great, right?
Let’s refactor the code, using the image from the event.
Firstly, we load an image from the request body.
It’s loaded to BytesIO so it can be opened with PIL.
Then we do OCR.
Our handler.py should now look like this.
Now we have a working function that will do OCR on the uploaded image. We can test it again with serverless.
So far so good.
Deploy Tesseract on AWS Lambda
We’ve done quite a bit so far. The function that will do OCR is written and tested. The project is configured to invoke our Lambda from the HTTP request sent to API Gateway. What could go wrong at this point?
As I mentioned above, pytesseract is just a wrapper around tesseract. It’s not sufficient to just run pip install. AWS Lambda service is using Amazon Linux. Therefore, we need to build tesseract on this platform and add it to our deploy package.
Fortunately, we can use Docker to build tesseract on the selected platform.
After we build tesseract, we can add it to the AWS Lambda layer using the serverless framework. Layers can be used to add additional dependencies to AWS Lambda functions.
Building tesseract in Docker
To build tesseract we will use the Amazon Linux Docker image. Firstly, we create a script for building.
The script downloads, builds, and installs tesseract. Then it creates ZIP with tesseract, libraries, and trained data files.
The next step is to create a Docker image where we can build tesseract. We will take amazonlinux:2018.03.0.20200318.1 for our base image. We add build dependencies and Leptionca. Lastly, we add the build script to the image.
Let’s build a Docker Image.
Now create a new directory, called tesseract, inside the project.
Great, we are ready to build tesseract now.
Let’s run a container from previously built image to start tesseract building.
Now wait until finished …
Create a Lambda layer
Tesseract is finally built for AWS Lambda. Now we can create an AWS Lamda layer.
Create a new directory, called layer, inside the project.
Afterwards, extract tesseract.zip to the directory layer.
AWS Lambda layers are mounted to /opt/. Python packages located in /opt/python/lib/python./site-packages/ are automatically added to the path inside AWS Lambda. Therefore, we create the python/lib/python3.7/site-packages/ structure inside our layer directory.
After that, we install pytesseract to this location.
We have everything that we need for our layer. Let’s edit serverless.yml. We need to add the layer definition and add it to the function. Files and folders located inside layer directory will be added to AWS Lambda Layer.
We also include handler.py to the deployment package of our function and exclude all other files to keep deploy archives small.
Environment variables for running tesseract on AWS Lambda
There is one thing left to do in our project. Since we’ve added tesseract to AWS Lambda layer – tesseract executable, libraries and trained data won’t be located at the default location anymore. Consequently, we have to set environment variables when code is executing on AWS Lambda. As I mentioned before, layer content is located in /opt/. We need to update our handler.py:
AWS_EXECUTION_ENV is an environment variable set in the AWS Lambda environment. When this variable is set, the code is running on AWS Lambda. Therefore, we set environment variables for trained data, libraries, and tesseract command path.
Let’s deploy Tesseract on AWS Lambda
We are almost done. All that’s left is actual deployment and test of our REST API for OCR. Deployment can be done using serverless framework CLI.
Wait until it’s finished …
Great, now we can test our OCR as a service for real. An URL, on which your service is available, was printed from serverless deploy command. It should look like this:
Go and install the requests package.
We add a file called use_ocr_as_a_service.py to test our OCR service for real.
Run this script – don’t forget to set the URL of your OCR service. You should see output like this:
After you’ve finished playing around, don’t forget to remove this project.
We’ve come a long way in this article. As you can see, you can run many different things on AWS Lambda. Although, in cases such as tesseract you have to build libraries yourself. Now that you know how to run tesseract on AWS Lambda, you can set up your own OCR service. At the point on which OCR is not enough – when you need advanced data extraction – check typless and save yourself time and hassle.