So, your company decided to implement an Intelligent Document Processing service into your work process.
Whether it’s to digitize paper documents or delve into data classification like typless, you’re about to witness significant improvements in your workflow. This upgrade not only streamlines your workflow but also gives you extended knowledge of your documents.
However, like most technology solutions, IDP also comes with some tradeoffs.
For instance, transitioning to IDP may require adjustments in your workflow, such as incorporating document scanning processes that were previously unnecessary.
To ensure the optimal efficiency of your IDP solution, it’s crucial to consider how you scan documents. Rushing through this process may lead to unsatisfactory outcomes, undermining the business value and leaving you frustrated.
Therefore, understanding the fundamentals of IDP systems and expected outcomes from inputs of varying qualities is essential. But don’t worry, we’ll explain it as simply and clearly as possible.
A little bit of history…
Surprisingly, OCR technologies have already been around for over 100 years. The first example of a machine converting printed text tracks back to 1914. The Optophone, as it was called, converted printed text to tones, providing blind individuals the ability to read with some practice.
The journey to modern OCR began between 1960 and 1970, with postal services using it to read addresses using software capable of recognizing multiple fonts.
Source of photographs.
..And a little bit about IDP technologies today
Intelligent Document Processing technology today works as follows:
Firstly, the OCR removes all artifacts from the scanned document, such as dust and other graphical elements. Then, it accurately aligns the text and converts all colours into binary form — either black or white depending on the threshold.
Finally, IDP searches for the context in the recognized text and classifies the required data for our output.
This streamlined process ensures efficient document processing and helps businesses optimize their workflow for greater productivity.
So, to Clarify: What is the Difference Between an OCR and an IDP?
Every Intelligent Document Processing solution utilizes an OCR (Optical Character Recognition) engine as the foundation for text recognition. This software tool converts various document types—such as scanned paper documents, PDF files, or images containing text—into editable and searchable data. Essentially, it scans the document, recognizes characters, and converts them into machine-readable text. The specifics of OCR engine implementation vary; some merely compare characters to different fonts in a database, while others consider factors such as character curves and pixel density.
On the other hand, intelligent Document Processing (IDP) goes beyond basic OCR capabilities. IDP solutions, such as typless, combine OCR technology with machine learning(ML) and natural language processing (NLP) to automate document-centric processes. They can extract not only text but also key data fields, understand context, and make decisions based on the extracted information. IDP solutions handle various document types, grasp document structures, and perform tasks like classification, data extraction, validation, and integration with other systems.
In summary, while OCR focuses on converting text from images or scanned documents into an editable format, IDP represents a more advanced system that automates document processing tasks. Leveraging OCR alongside AI technologies enables deeper understanding and intelligent decision-making.
Types of OCR Engines
Today, there is a wide range of OCR engines available, with Tesseract being the most renowned globally. Other notable OCR engines include Adobe Acrobat OCR, Microsoft OCR, Nuance OmniPage, and Readiris. If you’re interested in setting up Tesseract in your system, this article can provide you with valuable insights of how to do that on AWS.
User use case
Providing scanning guidelines enhances OCR results and user satisfaction. We provided a few examples of the same document scanned in different ways. Subsequently, we utilized the base Tesseract OCR, version 5.x, to conduct a whole-page OCR and compare the different results and accuracy on the character level. The data is presented in a table along with character error rates. We used a standard low-priced scanner and printer, the HP Photosmart C3180, and the calculation involved a simple comparison of string similarities between the OCR outputs.
Those were our inputs:
The above documents were scanned at qualities of 150 DPI, 75 DPI, and 300 DPI text mode. Additionally, three different scanning modes were employed using a phone: directly with the phone camera and with the assistance of two apps, CamScanner and Adobe Scan.
Results of the Test
Here are the results of the test:
Scan Type | Charracter-error-rate |
150 DPI | 0.02% |
70 DPI | 35% |
300 DPI text mode | 0.02% |
CamScanner | 12% |
Adobe Scan | 10% |
Phone Camera | 16% |
The results don’t surprise us. They revealed a significant drop in character-level accuracy below 150 DPI, emphasizing the importance of high-quality scans for optimal OCR outcomes. In our case, there is no difference between our reference and the 150 DPI scan because our document was freshly printed. However, depending on the condition of the original document, there could be a significant difference if the document was blurry, slightly damaged, or worn out—such as receipts from a back pocket, for example.
Mobile Scanning Apps
Today, many applications on the market transform a photograph into a scan or a PDF file. For this analysis, we used two of the most well-known ones, Adobe Scan and CamScanner. We found that Adobe Scan produces better scans and that the recognition from these applications is superior in our case. Of course, much also depends on the phone used to capture documents in these applications.
While documents scanned with a dedicated scanner still offer superior quality, modern phone cameras can come close to replicating those results.
Phone Camera
However, it’s important to be cautious when capturing documents for typless:
- We aim to photograph documents as straight as possible, minimizing any skew.
- We ensure that the document fills the frame of the lens as much as possible, with minimal background or surface visible in the image.
Below, you can find an examples of well and poorly-photographed documents with a phone camera.
Addressing the Skew
We’ve conducted another analysis, specifically focusing on recognizing skewed documents. Recognizing Tesseract’s struggle with skewed documents, we developed our algorithm on Typless to straighten and enhance documents, significantly improving recognition accuracy.
Our typless API performs a lot of image manipulation – in addition to deskewing the document it also removes various artifacts, and enhances the image. However, it’s still recommended that your scans are as straight as possible.
If you’re dealing with many received documents that aren’t scanned straight, our Intelligent Document Processing solution will significantly help speed up the process for you. You won’t have to worry about manually entering or rescanning skewed documents. It’s all seamlessly integrated within typless, making your life a whole lot easier.
Below, a comparison between recognition results solely with Tesseract and with Typless API demonstrates the efficiency of our solution, especially for skewed documents.
Skewness Rate | Charracter-to-error rate Without typless API | Charracter-to-error rate With typless API |
2˙< 15 ̇ | 4% | 0.5% |
15˙< | 98% | 0.8% |
Recommendations for Optimal Data Recognition
Our test was constructed fairly simply and should provide a general overview of what yields better results and what users should prioritize when scanning their documents.
However, there are still some tradeoffs that users must be aware of. To achieve good OCR results, it’s essential to provide high-quality scans, preferably at 300 DPI.
While this means that the scan sizes will be larger, resulting in more data being stored during archiving, users should definitely consider it if this isn’t a concern. The OCR will still perform well with DPI down to 150, which is considered the minimum acceptable quality.
Additionally, we advise you to inquire with your OCR scan solution provider about any extra processing of the image. For instance, with typless, they may prefer raw and coloured scans, which can yield better results compared to black and white scans that have already undergone processing at the scanner level.
Scanner and Scanning Recommendation
Our general recommendation for setting up your scanner is as follows:
- 300 DPI
- Colour 24bit scan
- No image processing, provide the raw image.
- Make sure your scanner cover is closed when scanning.
- Scan straight up.
- Make sure that the data is visible and the document is not stained or damaged too much.
Use Cases
If you’re not sure which kind of documents you can use Typless for, check out our Use Cases (link). Here you can see some examples of successful typless integrations.
If you have a different use case, don’t worry: typless is a highly adaptable service that recognizes and can be trained to recognize various types of documents in different languages and scripts. Contact us (link s kontaktnim obrazcem), and together we’ll find the best solution for you.
Roundup
In this article, we’ve covered the best ways to scan documents and get them ready for Intelligent Document Processing service. IDP keeps getting better every year, so the results will improve over time too.
But as IDP service providers, we need to make sure users know about the software’s limits. If you put in bad-quality documents, you sadly won’t get good results.
See you in the next blog!
The typless team