Using ReportMiner to Extract Business Information from Printed Documents

In this tutorial, we will explore Astera’s ReportMiner features.

To mine a report, you need to create a report model containing the definition of the report’s structure, and then use your report source object in a dataflow just as you would with any other hierarchical source object.

Let’s demonstrate how this can be accomplished.

We will start by creating a report model.

NOTE: A report model normally has several regions and fields belonging to those regions. An example of a region is the header, footer, Data region, and any additional ‘append’ regions. An example of a field within a region is CompanyName, AccountNo, Quantity etc. A region may have child regions located within that region. A field can only belong to one region at a time, and fields cannot overlap.

To create a new report layout, go to File -> New and select Report Model.

../_images/e72c29a4403f56c9fdf361228fc4cd1681408de7adb97617588b891235c59dbf.png

Select a sample report file in the Open dialog box. We will use this sample report to create our report model. Using a sample of an actual report will allow to ‘visualize’ our report showing regions and fields making up the report as well as their actual values from the sample.

../_images/4f3644a8f20fad2006da57de1d4aa022874f9664f747782248a0e2945d7a55ec.png

NOTE: Centerprise supports reading flat text reports, PRN reports and PDF reports.

In the screenshot above, we selected a sample report file for Orders. The selected sample is loaded to the Report Definition Editor:

NOTE: You can also load a different sample file in the report definition editor at a later time. Click the ../_images/8ea243d9211c94b342102dbb93209aa77cbfb5ae574e65513594b5350e6c9deb.png icon on the toolbar and navigate to the file you want to load.

Let’s take a look at this report. At the top of our sample is general Order information, such as Company Name, Order Date and time, Customer Name,Account Number and others. Following it is the detailed Order info, such as order items making up the order.

Our sample report has two logical regions, the Header, the Data region. Unlike some other common reports, this report has no Footer.

The header is at the very top of the report, spanning three lines starting at the line with the order date.

../_images/24d3a855ec674ed32487477f8a3a9ad5bc8f885122fa8b9335d07b9a11150e3b.png

So the first step in creating our report model will be to define the header for our report.

In the Report Definition Editor, select the top three lines. This is the area that covers the Header. Right-click on your selection and using the context menu select one of the following options, shown in the context menu below:

../_images/1c3cfc06e1196597ea9cca5d0dd1b5406e9262536fdb8ea24fd757eba94d2fb3.png

Since we are creating the Header, select Add Page Header Region.

Report Browser on the left hand side of Centerprise now shows a new node Header.

../_images/6bc094011a83fce4b13bffd783c9f22a3604f6415eae65f91a2797e5da6d4da5.png

Now, let’s take a closer look at the header. The header in our sample report always starts with a date, shown at the very first line and in the very first character position of the header. We can use the date as an identifying pattern for the header. Any time the ../_images/e128222dbcac2d83a7b5485a12b6301a6d12b5ee65c6f504383f7808382f6600.png pattern occurs in the report file, Centerprise will treat it as the beginning of the Header.

Let’s enter the https://astera.zendesk.com/hc/en-us/article_attachments/360014885593/e128222dbcac2d83a7b5485a12b6301a6d12b5ee65c6f504383f7808382f6600.png wildcard characters denoting digits as shown below:

../_images/5d74925e21b71bb44d5774f5ada7fd1b052bd32c6adfb5f9decbeab66c8d22ee.png

Any time this pattern occurs inside the report, Centerprise will treat it as the starting point of the Header.

Notice that the Report Definition Editor now highlights the header in purple. The header spans 3 lines, as shown by the purple block in the editor. The height of the header or any other region, (i.e. the number of lines that the header spans) is controlled by the Line Count input below the Report toolbar.

The next step is to create fields making up the header.

There are two ways to create fields.

\1. Highlight a field, right-click and select Add Field.

../_images/0e593a61c3e38180b211a898839179c92995aa3fd2eaab1b3f52a3f9d714471f.png

\2. Right-click within the Header area, and select Auto Create fields.

Centerprise will scan our sample report and identify any changing values within any occurences of the Header. These changing values will be marked as fields.

In our example, the Auto Create Fields feature added five fields. They are now displayed in the Report Browser under the Header node. Notice that our new fields are also highlighted in darker purple in the Report Definition Editor.

../_images/1969756c0714162266fb426f06e25049f45673246abe8c42b623c0b510c7f91c.png

The fields created this way are assigned unique names, such as FIELD_0, FIELD_1 and so on.

You can rename a field if needed. Let’s rename our newly created fields to make them more descriptive.

\1. select a field in the Report Browser, double click and enter the new name

Or:

\2. select a field in the Report Browser, right-click it and select “Rename”

Or:

\3. select a field in Report Definition Editor (the selected field is highlighted in yellow), right-click and select Rename from the context menu.

NOTE: The selected field is always highlighted in yellow in the report definition editor.

We can also change the field’s data type if needed. In our example, Centerprise was able to correctly assign fields data types from our sample report:

../_images/816760418ae9b55590cbf84a65d7208a4325ad2391f4eef078376c1885476722.png

Now that we created the definition of the Header, let’s look into the main region of the report. As we saw earlier, the main region starts with the Customer Name and then includes Account Number, Contact Name, and finally specific order details.

Let’s select the main region in the report definition editor, then right-click it and select “Add Data Region” from the context menu.

../_images/b5fba164a0d75755044f27c33319ea6db92b0f996a9475bf291e291caeab3982.png

This will add a new node “Data” in the Report Browser. This new node has no fields at this point.

../_images/0846b75c04fe5052bd2defe55ec4148e2e1de538fa1b83367c811b9dd16fcfb5.png

Notice that Centerprise assigned the default vertical size of this region as 23 lines based on our selection. We can adjust this number as needed by using the Line Count input under the toolbar.

Now we will identify the starting point of the region. Place the cursor at the position where the text ‘CUSTOMER:’ begins as shown in the screenshot, and enter CUSTOMER: in the pattern text input.

../_images/d54e82cf8e65d5ab67c23933ad4038653ced95b7b61560a539dabda79aa276f1.png

Report Definition Editor highlights any occurrences of the Data region in report. Remember that we can easily adjust the height of the region by using the Line Count input.

Let’s now rename our region CustomerData. Now our report has two regions: Header, and CustomerData.

Now, let’s identify the fields making up CustomerData region.

You can either manually assign fields, or you can use the Auto Create Fields feature.

To manually add a field, highlight a field with the mouse cursor, right-click it and select Add Field. A new field is added to the Report Browser. The Report Definition Editor shows all the occurrences of this field in the report.

NOTE: To automatically add fields, right-click within the header area, and select Auto Create fields. You can then modify, rename, add or delete fields as necessary.

Next, let’s take a closer look at CustomerData. Notice that each Customer can have one or more orders, and each order may have several items in it. In Centerprise terms, we say that the region has a collection of items, or to put it simply, is a Collection. Also note that order data in located within the CustomerData region we defined above. In other words, CustomerData region is also a container for order details.

Select CustomerData node in ReportBrowser. Right-click it and select Add Collection Data Region. This will add a new region under the CustomerData node. The default name here is Data, which we will rename OrderData to make it more descriptive.

Now, let’s now define the starting point of our new region.

Type ORDER NUMBER: in the text pattern input.

The report definition editor highlights all instances of OrderData region.

../_images/6acf6f827446a7f33ec9cd32473d392e7f4a084ee2990c18d20b5efcdacf0989.png

Right-click anywhere within our region, and select Auto Create Fields. This creates Order Number field and Ship Date field, named Field_0 and Field_1 respectively. Let’s give these fields more user-friendly names.

As we saw earlier, a Customer can have more than one order (which in Centerprise parlance is called a Collection of items). Whenever a node has a collection of items, we need to turn on its “Is Collection” property as shown below. Notice that the appearance of the icon for ORDER node in the report browser changes to help identify this node as collection. Note that when we add a Collection Data Region via the context menu, the “Is Collection” property is enabled automatically.

Now, let’s create the definition of Order Items. Select CustomerData node in the Report Browser. Add a new Collection Data Region in the report definition editor in the same manner we did earlier.

Specify the text pattern that will identify our order items. In our example, we will use part of the Quantity data followed by a space character to identify a line with the order item. To that end, enter “Match any digit” and then “Match any blank character” ../_images/ae82a01ff6551e878448e6798346e3cfb2aa75b90f490daa052d4bb23540a17b.png as shown below.

../_images/08c5c4e3fc871c82410150c366cce6e605afc1f86d32e3e0bac6a427d1376cdd.png

Next we will rename our new region OrderDetails, and auto create fields using the protocol we demonstrated earlier.

This action adds 6 fields, such as media type, quantity, description, label/no, unit price and amount. The fields are named by default Field_0, Field_1, so we will need to rename them as desired.

Our sample report does not have the footer. If necessary, a footer can be added in the same way we added the Header and Data regions above.

We have now created the report definition.

NOTE: Report definitions are used by Centerprise to correctly parse, interpret and assign data as it is fed from the report source. Report definitions are assigned *.rmd extension.

Let’s save our report model, by clicking Save icon on the main toolbar. Now we can preview our data to see how it is parsed by Centerprise.

Click the ../_images/e17939a676dda4c9b92726eb5db31c82aef6c60cc5d18b002e15d8d66b015fd5.png icon on the top toolbar. This opens the Data Preview window showing the entire report structure with the actual values for all the fields we have defined above.

Let’s now take a look at some additional functionality that Centerprise offers to help you customize your report.

Selecting Fields and Regions

To select a field, left-click on it in the Report Browser’s tree. The field is highlighted in yellow in the Report Definition Editor. Some of the more common field properties are displayed in the top pane of the editor:

../_images/56e855c464ca95d6346f92c22a6624ce5edafdd820aee65b1f191cac76cc528a.png

To select a region, click on it in the Report Browser’s tree. The region is highlighted in light purple in the Report Definition Editor, and the fields in the selected region are also highlighted in darker purple. The top pane shows the properties that are applicable for the region.

Managing Field and Region Properties

To view and update all other properties of a field or a region, right-click on a field (or region) inside Report Browser, and select Edit Field (or Edit Region) from the context menu.

The same functionality is also available on the top toolbar, by pressing the ../_images/50520f76c0529d45c82ba648ba551a8de06281d89486a9bed05543d1b81e7f6a.png icon.

You can also access field properties by right-clicking the field in the Report Definition Editor and selecting ‘Field Properties…‘ from the context menu.

Renaming Fields and Regions

To rename a field, double click it on the tree in the Report Browser and enter a new name.

To rename a region, double click it on the tree in the Report Browser and enter a new name.

You can also rename a field or a region by entering the new name in the Name input on the top pane.

Deleting Fields and Regions

To delete a field, right-click it in Report Browser or Report Definition Editor and select Delete Field.

To delete a region, right-click on a region (or a field inside the region), and select Delete Region from the context menu. Note that this action will also delete any fields in that region.

Customizing Fields

After your field has been created, you can change its start position by moving it a number of characters to the left or to the right. Right-click on a field and select “Move Field Marker Right One Character” or “Move Field Marker Left One Character” from the context menu. Repeat as needed to move the field the desired number of characters. Note that the same functionality is also accessible from the top toolbar via the ../_images/8005263e3dddb060decaa395e4c8bb8ca5168d28e7d200e854d6d0d4f5652216.png icons accordingly.

You can also change the field length, by selecting “Decrease Field Length by one character” and “Increase Field Length by one character” from the context menu. Repeat as many times as needed to change the field length by the desired number of characters.

Note that the same functionality is also accessible from the top toolbar via the ../_images/37eb345da4114a470509c12df19eff41d3895fd9c8ef5be3969efb7b2f1271b7.png icons accordingly.

To auto determine field length based on the available sample data, right-click a field and select “Auto determine field length“ from the context menu. Or click the ../_images/0d21a4ae7ccaea2c1f85b5211f006c425ec89f515687d17cb7ecd6f1534588aa.png icon on the top toolbar.

Alternatively, you can also move all fields within the same region left or right by a specified number of characters. To do it, right-click on a region or field, and select “Move All field markers left one character” or “move all field markers right one character”. You can also use the https://astera.zendesk.com/hc/en-us/article_attachments/360014885893/8005263e3dddb060decaa395e4c8bb8ca5168d28e7d200e854d6d0d4f5652216.png icons on the top toolbar.

NOTE: To undo any action in the editor, use the Undo dropdown menu on the toolbar or press CTRL + Z.

Identifying Text Patterns for Fields and Regions

The following options are available to help you create a text pattern that will identify the starting point of a field, or a region.

../_images/6bbbc1c8502f2cac46f3874aca9e421a5051348de9a3f9813a8be5f627f59ffd.png Match any alphabet

../_images/049412d563a26531f8e97ea7f5a29b8291af230b40ec6043f37b7a8bbe792097.png Match any digit

../_images/4ce935ab88827d4dc7ff27002e532a897c668d48aa490e31569b86e96ae1db38.png Match any alphabet or digit

../_images/d2b076382d1706e9d00e80401286d509e5f35290828331e71bdee891c72eac98.png Match any non-blank character

../_images/f2cd6203620ab26bfa2bec9c0b0f18965519bbf3174c4deb183b941ab754e43d.png Match any blank character

For example, to match a date 12/15/2011, you can use the pattern https://astera.zendesk.com/hc/en-us/article_attachments/360014886013/049412d563a26531f8e97ea7f5a29b8291af230b40ec6043f37b7a8bbe792097.png is “match any digit”.

Report Options

To change report options, click ../_images/33a98722dca898a399f16e41253e3fca48e62e20137d84c2ca9b190656218222.png icon on the report toolbar.

The following options are available:

Sample File Path – provides the path to the sample report file that you want to use for creating your report model.

Line Count – controls how many lines are loaded from the sample file

Other useful options are Tab Size (default value of 8), Font and Numeric Format.

Let’s now preview our sample report based on the report model we have created. To preview it, click ../_images/5f8775774b66d851d33c5ec02391f0f12008e8fc2581a89e791cd90dee49bbe3.png icon on the Report toolbar. Our report displays in the Data Preview pane as shown in the screenshot below.

../_images/6e73b240ef48cebe21c012a8c514849ea738ceb36d107015a9ac7b42bc8534ec.png

Now that we created and previewed our report, let’s add it to a dataflow so we can read the entire source report and feed it to a destination object.

Go to File -> New -> Dataflow. This creates a new dataflow.

Using the Toolbox pane, expand the Sources category and select Report Source.

Drag and drop Report Source onto the Designer.

Double click ReportModel1 object that we just added (or right-click it and select Properties) to open the Properties dialog.

Using the properties dialog, enter the path to the report source file and the report model. The report model location should point to the report model we created and saved earlier.

../_images/d088a3b749b563b0d035da5b9b65b3607ef057cdf6c36b94e2eb3043ef6dce29.png

Click OK to close the dialog. ReportModel1 object shows the report structure according to the report model we created:

../_images/8b3e9c68d49bbd982adc1bc59e80809904c202244c463de0459c55a0f7f69a4a.png

NOTE: You may need to expand the tree nodes to see all the child nodes under the root node.

Our new report source is ready to feed data to the downstream objects in our dataflow.