5 Examples of how document type and quality can affect accuracy

Document formatting and quality can have a dramatic effect on OCR and rules accuracy when data capture is concerned. The 5 examples shown below are meant to educate a potential or current data capture user on what can cause accuracy to rise or fall.  Although sometimes it’s hard or impossible to correct the issues that cause accuracy to fall, there are generally steps that can be taken to help prevent them.

Example 1: Vertical Scan Lines

The image below is a section of a scanned image that has a set of vertical lines all the way down the left side. One of the data elements can be used to demonstrate the effect these vertical line has on the whole image.  In bold below is the raw OCR data of what was captured where NEUTROPHILS is present in the image. 

OCR output: NEUTROP I /111

kevphoto1.png

This problem is generally attributed to a poor scanner and can be remedied by either fixing or replacing the device.

Example 2: Shaded Lines

Shaded lines are great for creating contrast between lines in a spreadsheet. They make reading the data incredibly easy for a human being, unfortunately as easy as they make it for a human they make it difficult for a computer. Take the example below, each alternating line is easy to determine from another, and you could easily follow one row across the sheet.

kevinphoto2.png

When zooming in on the line containing UREA NITROGEN it becomes obvious why an OCR engine might have a problem determining what letters are contained within. All of the little lines used to create the grey effect overlap with and get in the way of the letters themselves.

kevinphoto3.png

Digging in to the OCR output from that line and the line below perfectly highlights how this can have an effect on capture accuracy. Creatinine is easily recognized but there is no way to determine that the value above is urea nitrogen. See below for the OCR output of those two lines.

OCR output:

°I kg%#.0000K05*

CREATININE

This particular capture issue can be very difficult to work around, but increasing the resolution of incoming documents or your scanner can go a long way to making this text easier for an OCR engine to read.

Example 3: Dithering

Dithering is intentionally applied noise used to randomize quantization error, intended to prevent patterns such as color banding in images. This works well for human vision as it smooths out transitions between colors, but has extremely negative effects on OCR quality and therefore accuracy. In the example below the word Cholesterol is clearly visible:

kevinphoto4.png

Zooming into the ‘tero’ section shows why dithering can be so problematic. The letters begin to blend into one another and it’s hard for the OCR engine to determine when one letter stops and another beings.

kevinphoto5.png

The raw OCR output can be seen below, in this instance it would be very difficult for a rules engine to correctly capture the value.

OCR Output: LDL Ct101eNara

Like the above example, often the best way around this problem is to ensure the resolution settings of your scanner or incoming fax are set as high as they can be.

Example 4: Third and Fourth generation documents

Each time a document is printed and rescanned image quality is lost. The third or fourth time this process is repeated can make a document illegible. While the overall effect can differ depending on the font, type size, and printer/scanner the result is always an incredibly hard to read image.

The below example illustrates the effects of repeated re-scans. 

In this example we see that the letters have been eroded to the point that only largest font could be interpreted by a human, and even then it’s difficult without context.  The lines below it are so degraded that they don’t actually show up as OCR’d characters at all.

kevinphoto6.png

This example shows the effect of multiple generations on bold typeface. The letters here blend in a way that again, most of them don’t even show up in the raw OCR data. 

kevinphoto7.png

The easiest and most effective way to prevent capture issues with regards to multi-generational images is to revise document workflows. Many times it is possible to optimize a document workflow so that any individual image can be intercepted earlier and scanned before it needs to be printed and scanned.  

Example 5: First generation image

Below is simply an example of a first generation image (in this case, a PDF) that OCR’s with 100% accuracy. While generally not possible, when a first generation image can be available it should always be used for OCR and data capture workflows.

kevinphoto8.png