What is OCR?

There’s a simple answer to that question: Optical Character Recognition. But that simple answer belies the true complexity of what it means to OCR a document. Extract Systems builds enterprise OCR solutions—in fact many of our customers refer to our software as “OCR”— but the reality is we don’t produce OCR software at all; we produce solutions that enable the results of OCR to provide real-world value to our customers. But let’s dive in a little deeper into OCR to understand what is going on under the hood with OCR.

First, what does “optical character recognition” even mean? Basically, it means to turn a picture of text into text. Once textual data is in binary form in a computer system, it opens realms of possibilities to use the information that doesn’t exist when a human is required to read a document image. So, the first step in any OCR-based solution is getting that text.

There are two general classes of solutions to this problem. The first, referred to as matrix matching, has been around for decades. This solution involves pre-loading the OCR engine with “pictures” of letters in a wide range of fonts. When processing a document, the engine will one-by-one compare a character from the provided document to each of those pre-loaded character images and measure the differences between them pixel by pixel. The character with the smallest difference to the one in the image is chosen as the character “read” by OCR. Simple? Yes. But no.

This method can be extremely accurate in ideal conditions. Unfortunately, most documents that have been scanned or faxed for business use do not fall in this category.

The first challenge for a matrix matching OCR engine is identifying the appropriate regions of a source document for comparison. That is, the engine must first be able to identify where each character is before it is able to compare them. It turns out, that’s not always easy. Documents may have been scanned in at an angle. Documents may have artifacts (such as lines) from failing scanning hardware. Documents may have color which becomes speckled “noise” when scanned into a black and white image. All of these things not only make the comparisons less likely to produce the correct match, but also make it harder to find the characters to compare. Therefore, before even attempting letter comparisons, there is pre-processing step that applies filters to compensate for image quality issues, to re-orient images that are at an angle, and to, in general, make improve the chances of each character to be found and correctly compared. After these corrections, the engine attempts to break the document image down into zones, such as finding those zones that appear to contain text, then further segment that text into words and then characters for comparison.

You’ve probably guessed that while pre-processing helps, it does not solve all matrix matching recognition problems. But while pre-processing can reduce the number of errors that occur comparing characters, mistakes that are made can be corrected in post-processing as well. One such method is voting—rather than have a single engine compare and select the best match, several engines, each optimized for somewhat different circumstances, will run and “vote” on the best matching character; this compensates for a mismatched character in any one engine. Another post-processing step frequently used is to compare the engine output to dictionaries. These dictionaries include both proper language and industry-specific lexicon and shorthand. If there are small differences in the recognized characters to dictionary words, particularly when the mis-matched characters didn’t have strong matches in the engine, these characters may be replaced with more probable options from the dictionary.

After utilizing both pre and post processing, matrix-matching OCR output can be quite good across a wide range of printed text examples. Emphasis is on printed. Also, the wide range referred to still leaves a lot of examples that are not handled well. Handwriting, text that shows up at odd-angles to other text on the page, unusual fonts and just plain poor-quality scans can all be unrecognized or mis-recognized by matrix matching OCR algorithms.

This is where the second class of OCR techniques comes into play. Actually, to call it a single class isn’t very fair because it really encompasses a wide range of techniques. But the general premise among all of them is the same—to attempt to recognize text as we do, as inter-related patterns and shapes rather than strict comparisons to character images. This class of OCR may be referred to a number of different ways, but ICR (Intelligent Character Recognition), is probably the most common. An ICR algorithm will attempt to associate a horizontal line vertical line intersecting at the top of a vertical line as the capital letter “T” instead of comparing pixel-by-pixel to the image of “T” in a whole bunch of different fonts.

This problem has been approached a lot of different ways, but with ever increasing frequency the problem is being addressed with machine learning.How machine learning works is well beyond the scope of this blog, but the idea is that rather than attempt to explain (via code) to the machine how to recognize characters, the code instead defines a process for a machine to teach itself to recognized text.

There are many different machine learning algorithms, but one often employed for this problem (or image recognition problems in general) is called a Convolutional Neural Network (CNN). In fact, CNNs were developed to simulate the relationships observed in the neurons of the visual cortex in mammals. The key thing to know about a CNN is that it focuses on establishing relationships among pieces of the data it is given. These relationships allow the concept of a “T” to be developed based on the relationship of the two lines, then further for that character to be tied to the concept of a word. Indeed, these neural networks will often be trained not to recognize individual characters at a time, but entire words at a time. This bears strong resemblance to how we as humans read words rather than characters at a time. If you haven’t ever received the email about “rscheearch at Cmabrigde Uinervtisy”, you can read about it here, but the crux of it is that we don’t typically parse individual characters at time when we read, and there is probably good reason. It also demonstrates how we (and ML that has been trained to read whole words at a time) might be less prone to difficulty with handwriting and image quality problems that affect matrix matching OCR.

I hope this gives you a better understanding of what is happening when a machine OCRs a document. Yet, it barely scratches the surface when it comes to extracting value from an OCR solution. Getting text from an image is a critical first step, but before it can be put to use, the software must then derive an understanding of the text as a whole. It must identify the type of a document. It must identify vital entities in the document and their role-- the name of a witness in a court document, for example. It must allow for a customer to efficiently verify and add data not identified automatically. It must be able to associate that with data in a customer’s system so that the proper records can be updated. Finally, it must be able to seamlessly integrate into the customer’s existing workflows so as to add value, not complication.

Sound complicated? Not to worry-- at Extract Systems, we’ve got you covered. Enterprise OCR solutions is what we do. Give us a call to find out what we can do for you.


Written By:

Steve Kurth- Software Development Manager