Parquet File Source (Beta)

What is Apache Parquet File Format?

Apache Parquet is a column storage file format used by Hadoop systems such as Pig, Spark, and Hive. The file format is language-independent and has a binary representation. Parquet is used to efficiently store large data sets and has the extension of .parquet.

The key features of Parquet with respect to Centerprise are:

  • It offers the option of compression with a lesser size post-compression.
  • It encodes the data.
  • It stores data in a column layout.

Using Parquet File Source in Astera Centerprise

In Centerprise, you can use a Parquet file in which the cardinality of the data is maintained, i.e., all columns must have the same number of fields.

Note: There should only be one row for each data field.

1. Drag and drop the Parquet File Source from the Sources section of the Toolbox onto the dataflow designer.

01-Drag-Drop-Parquet

2. Right-click on the Parquet File Source object and select Properties from the context menu.

02-Right-Click-Parquet

This will open a new window.

03-Parquet-Properties-Source

Let’s have a look at the options present here.

04-Parquet-File-Path

File Location

File Path: This is where you will provide the path to the .parquet file.

05-01-Data-Load-Options-Parquet

Data Load Option

If you wish to control memory consumption and increase read time, then the Data Load option can be used.

Batch Size: This is where the size of each batch is defined.

05-Checkboxes-Parquet

Advanced File Processing: String Processing

Treat empty string as null value: Checking this will give a null value on every empty string.

Trim strings: Checking this box will trim the strings.

3. Once done, click Next and you will be led to the Layout Builder screen.

06-Layout-Builder-Parquet

The layout will be automatically built. Otherwise, you can build it using the Build Layout from layout spec option at the top of the screen.

07-Build-Layout-Spec-Parquet

4. Once done, click Next and you will be taken to the Config Parameters screen.

08-Config-Parameters-Parquet

This allows you to further configure and define dynamic parameters for the Parquet source file.

Note: Parameters left blank will use their default values assigned on the properties page.

5. Click Next and you will be taken to the General Options screen.

09-General-Options-Parquet

Here, you can add any comments that you wish to add.

6. Click OK and the Parquet File Source object will be configured.

10-Parquet-Source-Configured

You can now map these fields to other objects as part of the dataflow.

Data Types Supported in Parquet Centerprise:

  • Integer
  • Time/Timestamp
  • Date
  • String
  • Float
  • Real
  • Decimal
  • Double
  • Byte Array
  • Guid

Data Types not Supported in Parquet Centerprise:

  • Base64
  • Integer96
  • Image

Limitations:

  • Hierarchy is not supported.

This concludes our discussion on the definition and configuration of the Parquet File Source object in Astera Centerprise.