Optophone machine from the 1920s
To the surprise of many, OCR technologies have already been around for over 100 years. The first example of a machine converting printed text tracks back to the year 1914. The Optophone, as it was called, converted printed text to tones, which with some practice allowed blind people to read.
Starts of the OCR, as we know it today, belongs to the years between 1960 and 1970 when postal services started using it to read addresses with software which could recognize multiple different fonts.
OCR technology today works the following way: firstly, it removes all the artefacts of the scanned document – eg. dust, other various graphics. Then it aligns the text properly and converts all the colours to binary form – either a black or a white depending on the threshold. Then it depends on the OCR implementation. Some just compare the characters to different fonts in the database, others also take into account the curves of the characters, pixel density etc. Modern OCR engines today use deep learning, using multiple layers to extract similar features of the characters. They are trained to recognize harder to read font typefaces and also handwritten text – quite challenging since every person in the world has a different font :).
After the recognition, they also use dictionary helpers to filter out nonsense words due to inaccurate scanning.
There are a lot of OCR engines out today ranging from paid to free. The most known in the world is Tesseract, which is open source and serves as a base for a lot of OCR scan solutions which are built on top of it.