What if the OCR misses a field or value?

Software relying on OCR

You have a software package that relies on optical character recognition (OCR) to classify, pick up words, numbers or phrases from a document.  As long as the quality of the document is mostly clean, everything works well.  However, what happens when the document arrives and the quality is simply, not good?  Does the software give up and run away with its tail between its legs? Are there any options to classify or capture anything on these documents?  

Fixing Poor Quality Data

Fortunately, there are techniques that Extract uses to combat poor quality.  Properly classifying the document can be a big help to determine the next steps.  If you are able to determine the correct document type, this can enable the ability to know whether the potential exists for sensitive items in a redaction project.  Knowing the document type can also allow you to focus on a particular region of a document and perform special OCR techniques.  These special techniques can help mitigate problems caused by shaded or speckled regions and underexposed or overexposed text. 

Machine Learning and Artificial Intelligence

Another special technique that can aid in finding items on documents is the use of machine learning and artificial intelligence (AI).  You can train a machine to learn which particular document types may or may not contain sensitive data.  The use of machine learning and AI has become a critical method of data identification and capture.

What about simple OCR mistakes? 

Document handling

Take 1ike…or is it like.  0r that one…which should be Or. 

Depending on the font used and the document quality, there are many simple OCR mistakes that happen and knowing the context of what you are capturing can also enable you to auto-correct these mishaps very easily.  If you need to correct larger phrases or more sophisticated problems, the use of a synonym table can greatly enhance capture rates.  This methodology works over time as new instances of words/phrases are automatically captured, but then corrected by a user.  The key here is tracking those corrections so that the next time, the software knows how to apply the synonym in place of the incorrectly OCR’d data.


As another option for redaction projects, our platform can look at the overall character confidence of a document (or page).  Based upon a pre-determined threshold, we can flag documents that fall below this threshold for OCR enhancement and/or manual review.  This is generally a last resort as it could potentially cause a large number of documents to be selected for review.  Ideally, there is some combination of using the character confidence with the document type and have the software flag only those that fall below the threshold. 

The Perfect Solution for Document Capture

OCR alone will not solve your indexing and redaction needs. You need to add an intelligent layer on top of the OCR that takes the OCR data as a variable. One that is capable of reading that data like a human being would and acting intelligently with that information. Extract has been doing this since the 90’s. We’ve processed billions of documents and have innovated our platform to be on the cutting edge of document capture and redaction.

About the Author: Mike Beles

Mike Beles, Customer Support Specialist & Project Manager at Extract, has been involved with implementing and leading software projects for over 18 years. He has a well-versed background in software training, support, quality assurance as well as both product and project management.