Best Practices for OCR Usage¶
Optical Character Recognition or OCR detects and recognizes the text on a scanned image or scanned document. OCR technologies are not able to convert 100% of the text found in such documents. However, there are certain properties of the scanned document that complement the OCR technology.
This document provides a list of properties that a quality scan should possess for the OCR present in Astera ReportMiner to recognize and process the data with at least 95% accuracy.
Scanned Document Properties¶
1. It is recommended to maintain a dots per inch (DPI) quality of at least 250 DPI. Improving the camera quality or the lighting can increase the DPI of a document. PDF readers can detect the DPI of scanned document.
2. Avoid tilted document scans. Whenever you scan a document, ensure the orientation is set to portrait, and the image is aligned with the borders.
3. Avoid pencil/pen marks near the text. If you have to sign a document, it is recommended to do it at the bottom corner of the document to be as far away as possible from the central text.
4. Avoid watermarks on a document.
5. The original document should maintain an adequate spacing between columns and records so as to avoid overlapping of text before the scan takes place.
6. Black and white color themed documents are recommended, where the font color is black on a white background.
7. Table headers are also recommended to not have filled in colors.
8. High contrast between text and the background is helpful for better visibility.
9. Avoid highlighting of text on the document.
10. Consistent font size within a line is helpful for better data processing.
11. The minimum font size is 12 pts Calibri. Any font type can be used, Calibri is used as a reference.
Resolution Option in ReportMiner¶
In its OCR implementation, ReportMiner provides the OCR Resolution option. Resolutions, Low, Medium, and High are based on the zoom factor, that is, how much zoom factor would OCR use to convert the image data.
Just like we take pictures with a phone camera on different resolutions and get more pixels on high resolution and vice versa, ReportMiner’s OCR converts images based on the resolution selected.
It is recommended to use Low resolution when the scanned PDFs have less text. This will consume lesser time as well. For more data or text in the PDF, we can shift to the Medium resolution and similarly High on even more data. This will take comparatively more time than low and medium resolutions.
Medium and High resolutions can be interchangeably used, sometimes medium resolution does not convert all the text and the result has some missing values. In this case, using high resolution can be beneficial. It can bring all the data in a more structured manner.
This concludes our discussion on best practices for OCR in Astera ReportMiner.