In this post, we will guide you through the first steps in the typless world. We will cover the next topics:
- Extract data from the documents with typless invoice OCR REST API
- Learn typless with invoice data, using our REST API, to work more precisely
Summary
Get API Key
So let’s get started.
Firstly, login to developers hub. As you will log in, you will come to our showroom, but let’s skip it for now.
Secondly, click on Settings tab in the side navigation – you will need your API key to follow this tutorial, so keep it open.
Download example invoice
You have it? Nice.
Extract data from invoice
Now you have everything to start extracting data. Open your terminal or cmd.exe in the directory with the invoice which you’ve downloaded in the previous step.
curl -H "Authorization: Token YOUR-API-KEY" -F document_type_name=simple-invoice -F customer=myself -F file=@invoice.pdf https://developers.typless.com/api/document-types/extract-data/
After that, you can copy code (Copy to clipboard is in the upper right corner :)) from the box and execute it. As a result, you will receive a response in a couple of seconds.
{
"file_name":"invoice.pdf",
"object_id":"726e3fb6-1645-43c9-9670-d144e9fbc0e3",
"extracted_fields":[
{
"name":"invoice_number",
"values":[
{"x":1971,"y":571,"width":324,"height":45,"value":"20190700006606","confidence_score":"0.423","page_number":1},
{"x":1971,"y":571,"width":324,"height":45,"value":"20190700006606","confidence_score":"0.415","page_number":1},
{"x":1971,"y":571,"width":324,"height":45,"value":"20190700006606","confidence_score":"0.336","page_number":1}
],
"data_type":"STRING",
},
{
"name":"supplier",
"values":[
{"x":-1,"y":-1,"width":-1,"height":-1,"value":"ScaleGrid","confidence_score":"0.951","page_number":-1}
],
"data_type":"AUTHOR",
},
{
"name":"issue_date",
"values":[
{"x":-1,"y":-1,"width":-1,"height":-1,"value":null,"confidence_score":"0.000","page_number":-1}
],
"data_type":"DATE",
},
{
"name":"total_amount",
"values":[
{"x":2145,"y":1205,"width":127,"height":45,"value":"15.84","confidence_score":"0.663","page_number":1},
{"x":2166,"y":821,"width":127,"height":45,"value":"15.84","confidence_score":"0.640","page_number":1},
{"x":1927,"y":1315,"width":340,"height":66,"value":"15.84","confidence_score":"0.438","page_number":1}
],
"data_type":"NUMBER",
}
],
"customer":"myself",
}
What does that mean? Let’s take a look.
Response from data extraction
Response on endpoint for data extraction contains following properties:
- file_name – Name of your uploaded file.
- object_id – ID of an uploaded file, you can use it to learn typless invoice OCR faster
- customer – identifier of customer which requested extraction (for billing split), in this case yourself
- extracted_fields – List of fields that were extracted from the document.
Extracted fields
Fields that will be extracted from documents are defined in DocumentType. It was selected with its name; -F document_type_name=simple-invoice in curl request. Document type simple-invoice is created by default. Each element in the list has the following properties
-
- name – the name of the field as defined in DocumentType
- data_type – a type of field data see the API reference
- AUTHOR – Your string identifier of the entity which has published document.
- NUMBER – Integer or float values such as amounts. String with . as decimal separator. For example 12345.67.
- DATE – dates – Date string in YYYY-MM-DD format.
- STRING – Generic string data type without constraints. For example, invoice number (example value: A21-123-321).
- values – List of value blocks with field value and coordinates.
Values
There are up to 5 best fits, ordered by a confidence score. Each value has the following properties:
-
- x – The x coordinate of field.
- y – The y coordinate of field.
- width – The width of the field.
- height – The height of the field.
- value – The value of the field, null if not known.
- page_number – Number of the page on which field was detected.
- confidence_score – Confidence of algorithm. It can be between 0 and 1.
Coordinates are set to -1 when the location is not known.
See API reference if you are interested in finding out more. Extraction is finished and clear now. After that, we can start learning typless.
Learn typless invoice OCR with data values
What to do in cases when data are not extracted correctly? You can learn typless invoice OCR to fit your needs. First you extract data with typless invoice OCR. After that, you can learn it with correct values. In other words, you can continuously improve the precision of your data extraction.
curl -X POST -H "Authorization: Token YOUR-API-KEY" -F document_type_name=simple-invoice -F customer=myself -F learning_fields[0]='{"name": "supplier", "value": "ScaleGrid"}' -F learning_fields[1]='{"name": "invoice_number", "value": "20190700006606"}' -F learning_fields[2]='{"name": "issue_date", "value": "2019-08-13"}' -F learning_fields[3]='{"name": "total_amount", "value": "15.84"}' -F document_object_id=OBJECT-ID-FROM-PREVIOUS-STEP https://developers.typless.com/api/document-types/learn/
To learn typless invoice OCR, you need a single API call to REST API. In other words, copy code (Copy to clipboard is in the upper right corner :)) from the box and execute it. In a couple of seconds, you should receive a response like this:
{"details":"Learned successfully!"}
As for extraction, DocumentType must be selected by name. You must also provide learning_fields array. Its elements must be JSON stringified objects. They must contain a name (as defined in document type) and value. Make sure that values follow these rules:
- DATE – string in format YYYY-MM-DD
- STRING – non-empty string
- NUMBER – decimal string with . as decimal separator, without any other separators e.g., 12345.67
- AUTHOR – non-empty string
Congratulations, you’ve extracted data from the invoice and learned typless invoice OCR.
Happy OCRing!