So your company decided to implement OCR scan recognition into your work process. It could be to digitize all documents, or maybe your solution also involves some kind of data classification like Typless, which additionally helps you with your other work processes and gives you extended knowledge of your documents.
OCR solutions are a great way to get your business on track with modern technology solutions, which have gotten much better in the last few years.
However, like most technology solutions, it comes with tradeoffs. You previously didn’t have to scan your documents, meaning that your work process has now changed. For your OCR solution to work efficiently you also need to be aware of the way you are scanning the documents – if you are too hasty you might not be satisfied with the results of the solution, giving you less business value and technology frustration because it is not working the way you expected it to.
That is why it is important to have basic knowledge of how the OCR system works and what kind of output you can expect from inputs of various qualities.
How does OCR scan recognition work
To the surprise of many, OCR technologies have already been around for over 100 years. The first example of a machine converting printed text tracks back to the year 1914. The Optophone, as it was called, converted printed text to tones, which with some practice allowed blind people to read.
Starts of the OCR, as we know it today, belongs to the years between 1960 and 1970 when postal services started using it to read addresses with software which could recognize multiple different fonts.
OCR technology today works the following way: firstly, it removes all the artefacts of the scanned document – eg. dust, other various graphics. Then it aligns the text properly and converts all the colours to binary form – either a black or a white depending on the threshold. Then it depends on the OCR implementation. Some just compare the characters to different fonts in the database, others also take into account the curves of the characters, pixel density etc. Modern OCR engines today use deep learning, using multiple layers to extract similar features of the characters. They are trained to recognize harder to read font typefaces and also handwritten text – quite challenging since every person in the world has a different font :).
After the recognition, they also use dictionary helpers to filter out nonsense words due to inaccurate scanning.
There are a lot of OCR engines out today ranging from paid to free. The most known in the world is Tesseract, which is open source and serves as a base for a lot of OCR scan solutions which are built on top of it.
User use case
Giving guidelines to users about scanning will improve the OCR results along with their satisfaction with the solution. We provide a few examples of the same document scanned in different ways. We then use the base Tesseract OCR, version 4.1.1, to do a whole page OCR and compare the different results and accuracy on the character level. The data is presented in a table along with character error rates. We are using a standard low-priced scanner and printer HP Photosmart C3180 and the calculation is a simple comparison of string similarities between the OCR outputs.
Here are the results of the test:
|300 DPI text mode||0.02%|
The results don’t surprise us. The character level accuracy starts dropping significantly below 150 DPI. In our case, there is no difference between our reference and 150 DPI scan because our document was freshly printed. Depending on the state of the original document there could be a big difference if the document was blurry, slightly damaged or worn out – receipts from the back pocket.
The skew of the document may be a big difference as well. That also depends on the type of preprocessing your OCR scan solution is doing. Our Typless API does quite a bit of image manipulation, deskewing the document and removing a lot of artefacts, so you do not have to worry about the skewness. Still, it is recommended that your scans are as straight as they can be.
We can also see that there is still a difference when using phone scanning software, but using the latest phones with great cameras can come close to the results provided by the scanners.
Recommendations for optimal data recognition
Our test was constructed fairly simply and should give a general overview of what gives better results and what users should prioritize when scanning their documents.
There are still some tradeoffs that the users must be aware of. If we want good OCR results, we need to provide it with good quality scans, preferably 300 DPI. However that also means that the sizes of the scans will be bigger, meaning that there will be more data stored when archiving. If that is not a concern, the user should definitely take it into consideration. The OCR will still perform well with DPI down to 150, which is the minimum that is considered acceptable.
You should also ask your OCR scan solution provider whether they do extra processing of the image. If that might be the case, as it is with Typless, they might prefer raw and coloured scan, giving better results as opposed to black and white scans which already process the image at the scanner level.
My general recommendation for setting up your scanner is:
- 300 DPI
- Colour 24bit scan
- No image processing, provide the raw image
- Make sure your scanner cover is closed when scanning
- Scan straight up, skewed scans will give worse results
- Make sure that the data is visible and the document is not stained or damaged too much
Today we took a look at best practices for scanning documents preparing them for OCR. OCR is evolving year after year, which means the results will improve every year as well. But we as OCR service providers must make sure that the users are aware of the limitations of the software. Providing bad input won’t magically produce correct output :). Making sure that users are well educated will provide a better experience for both – the user and the service provider.