Parquet File Format in Hadoop

Apache Parquet is a columnar storage file format available to any project in the Hadoop ecosystem (Hive, Hbase, MapReduce, Pig, Spark)

What is a columnar storage format

In order to understand Parquet file format in Hadoop better, first let’s see what is columnar format. In a column oriented format values of each column of in the records are stored together.

For example if there is a record which comprises of ID, emp Name and Department then all the values for ID column will be stored together, values for Name column together and so on. If we take the same record schema as mentioned above having three fields ID (int), NAME (varchar) and Department (varchar)

ID	Name	Department
1	emp1	d1
2	emp2	d2
3	emp3	d3

For this table in a row wise storage format the data will be stored as follows-

emp1

emp2

emp3

Where as the same data will be stored as follows in a Column oriented storage format-

emp1

emp2

emp3

How columnar storage format helps

As you can see from the storage formats, if you need to query few columns from a table then columnar storage format is more efficient as it will read only required columns since they are adjacent thus minimizing IO.

For example, let’s say you want only the NAME column. In a row storage format each record in the dataset has to be loaded, parsed into fields and then data for Name is extracted. With column oriented format it can directly go to Name column as all the values for that columns are stored together and get those values. No need to go through the whole record.

So column oriented format increases the query performance as less seek time is required to go the required columns and less IO is required as it needs to read only the columns whose data is required.

If you see from BigData context, where generally data is loaded to Hadoop after denormalizing it so columns are generally more in number, using a columnar file format like parquet brings a lot of improvement in performance.

Another benefit that you get is in the form of less storage. Compression works better if data is of same type. With column oriented format columns of the same type are stored together resulting in better compression.

Parquet format

Coming back to parquet file format, since it is a column oriented format so it brings the same benefit of improved performance and better compression.

One of the unique feature of Parquet is that it can store data with nested structures also in columnar fashion. Other columnar file formats flatten the nested structures and store only the top level in columnar format. Which means in Parquet file format even the nested fields can be read individually with out the need to read all the fields in the nested structure.
Note that Parquet format uses the record shredding and assembly algorithm described in the Dremel paper for storing nested structures in columnar fashion. Read more about it here.

Refer Converting Text File to Parquet File Using Hadoop MapReduce to see how to convert existing file to Parquet file using MapReduce.

Primitive data types in Parquet format

Primitive data types supported by the Parquet file format are as follows

BOOLEAN: 1 bit boolean
INT32: 32 bit signed ints
INT64: 64 bit signed ints
INT96: 96 bit signed ints
FLOAT: IEEE 32-bit floating point values
DOUBLE: IEEE 64-bit floating point values
BYTE_ARRAY: arbitrarily long byte arrays.

Logical types in Parquet format

Parquet format also defines logical types that can be used to store data, by specifying how the primitive types should be interpreted. This keeps the set of primitive types to a minimum and reuses parquet’s efficient encoding. For example, strings are stored as byte arrays (binary) with a UTF8 annotation, DATE must annotate an int32. These annotations define how to further decode and interpret the data.

For example- Defining a String in Parquet

message p {
    required binary s (UTF8);
}

Defining a date field in Parquet.

message p {
  required int32 d (DATE);
}

You can get the full list of Parquet logical types here - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md