Wednesday, November 22, 2023

Parquet File Format in Hadoop

Apache Parquet is a columnar storage file format available to any project in the Hadoop ecosystem (Hive, Hbase, MapReduce, Pig, Spark)

What is a columnar storage format

In order to understand Parquet file format in Hadoop better, first let’s see what is columnar format. In a column oriented format values of each column of in the records are stored together.

For example if there is a record which comprises of ID, emp Name and Department then all the values for ID column will be stored together, values for Name column together and so on. If we take the same record schema as mentioned above having three fields ID (int), NAME (varchar) and Department (varchar)

ID Name Department
1 emp1 d1
2 emp2 d2
3 emp3 d3

For this table in a row wise storage format the data will be stored as follows-

1 emp1 d1 2 emp2 d2 3 emp3 d3

Where as the same data will be stored as follows in a Column oriented storage format-

1 2 3 emp1 emp2 emp3 d1 d2 d3

How columnar storage format helps

As you can see from the storage formats, if you need to query few columns from a table then columnar storage format is more efficient as it will read only required columns since they are adjacent thus minimizing IO.

For example, let’s say you want only the NAME column. In a row storage format each record in the dataset has to be loaded, parsed into fields and then data for Name is extracted. With column oriented format it can directly go to Name column as all the values for that columns are stored together and get those values. No need to go through the whole record.

So column oriented format increases the query performance as less seek time is required to go the required columns and less IO is required as it needs to read only the columns whose data is required.

If you see from BigData context, where generally data is loaded to Hadoop after denormalizing it so columns are generally more in number, using a columnar file format like parquet brings a lot of improvement in performance.

Another benefit that you get is in the form of less storage. Compression works better if data is of same type. With column oriented format columns of the same type are stored together resulting in better compression.

Parquet format

Coming back to parquet file format, since it is a column oriented format so it brings the same benefit of improved performance and better compression.

One of the unique feature of Parquet is that it can store data with nested structures also in columnar fashion. Other columnar file formats flatten the nested structures and store only the top level in columnar format. Which means in Parquet file format even the nested fields can be read individually with out the need to read all the fields in the nested structure.
Note that Parquet format uses the record shredding and assembly algorithm described in the Dremel paper for storing nested structures in columnar fashion. Read more about it here.

Primitive data types in Parquet format

Primitive data types supported by the Parquet file format are as follows

  • BOOLEAN: 1 bit boolean
  • INT32: 32 bit signed ints
  • INT64: 64 bit signed ints
  • INT96: 96 bit signed ints
  • FLOAT: IEEE 32-bit floating point values
  • DOUBLE: IEEE 64-bit floating point values
  • BYTE_ARRAY: arbitrarily long byte arrays.

Logical types in Parquet format

Parquet format also defines logical types that can be used to store data, by specifying how the primitive types should be interpreted. This keeps the set of primitive types to a minimum and reuses parquet’s efficient encoding. For example, strings are stored as byte arrays (binary) with a UTF8 annotation, DATE must annotate an int32. These annotations define how to further decode and interpret the data.

For example- Defining a String in Parquet

message p {
    required binary s (UTF8);
}
Defining a date field in Parquet.
message p {
  required int32 d (DATE);
}

You can get the full list of Parquet logical types here - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

Parquet file format

To understand the Parquet file format in Hadoop you should be aware of the following three terms-

  • Row group: A logical horizontal partitioning of the data into rows. A row group consists of a column chunk for each column in the dataset.
  • Column chunk: A chunk of the data for a particular column. These column chunks live in a particular row group and is guaranteed to be contiguous in the file.
  • Page: Column chunks are divided up into pages written back to back. The pages share a common header and readers can skip over page they are not interested in.

Parquet file format structure has a header, row group and footer. So the Parquet file format can be illustrated as follows.

Parquet File Format
Parquet File Format

Here Header just contains a magic number "PAR1" (4-byte) that identifies the file as Parquet format file.

Footer contains the following-

  • File metadata- The file metadata contains the locations of all the column metadata start locations. Readers are expected to first read the file metadata to find all the column chunks they are interested in. The columns chunks should then be read sequentially. It also includes the format version, the schema, any extra key-value pairs.
  • length of file metadata (4-byte)
  • magic number "PAR1" (4-byte)

That's all for this topic Parquet File Format in Hadoop. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Hadoop Framework Tutorial Page


Related Topics

  1. How to Read And Write Parquet File in Hadoop
  2. Sequence File in Hadoop
  3. Apache Avro Format in Hadoop
  4. How to Configure And Use LZO Compression in Hadoop
  5. How to Compress Intermediate Map Output in Hadoop

You may also like-

  1. What is SafeMode in Hadoop
  2. HDFS High Availability
  3. Speculative Execution in Hadoop
  4. HDFS Commands Reference List
  5. Data Compression in Hadoop
  6. Lambda Expressions in Java 8
  7. How to Create Password Protected Zip File in Java
  8. Compressing And Decompressing File in GZIP Format - Java Program