PARQUET: FILE FORMAT INTERNALS

After a whole week studying the inner workings of Parquet, I created this blog post to document everything I learned and how the format…

Douglas Souza

dataletternews

· ~13 min read · December 3, 2024 (Updated: January 25, 2025) · Free: No

After a whole week studying the inner workings of Parquet, I created this blog post to document everything I learned and how the format became the basis for supporting Big Data.

READ FOR FREE.

TABLE OF CONTENT

· INTRODUCTION · CONCEPT · BENEFIT · INTERNALS · FILE ∘ ROW GROUP ∘ COLUMN CHUNK ∘ PAGES (DATA PAGES) ∘ INDEX PAGE AND DICT PAGE OFFSET: BRIEF EXPLANATION · WRITING: HOW DATA BECOME PARQUET FILES. · READING: HOW DATA SEARCH WORKS. · CONCLUSION · REFERENCES.

INTRODUCTION

Created in Twitter Labs and donated to ASF (Apache Software Foundation) becoming open-source, Parquet soon became one of the main ways to store data in Data Lake.

Offering excellent performance for writing and, especially, reading, the columnar format became the foundation for the growth of Big data.

Safe and reliable, it ended up being chosen to serve as a structure for open-tables such as (Hudi, Iceberg e Delta).

Knowing the importance of Parquet, I decided to delve deeper into my studies and learn about its entire internal structure.

Parquet is a storage format used in Big Data.

In this first part, we will see how it is organized and break down the reading and writing process.

I hope you like it, let's get to the article!

CONCEPT

Parquet is a storage format used in Big Data.

Its premise is the columnar format, which makes it optimized for large volumes and analytics.

The columnar format, different from the "linear", compacts and groups the lines into "row groups".

An addendum, do not confuse the columnar format of parquet with the columnar index, Columnstore. The columnstore creates its rowgroups differently, grouping a maximum of 1.5 million rows, which changes when we go to parquet.

In parquet, it does not limit by number of lines, but recommends by size, which is between 128MB and 1G. Different values can cause latencies with reading and writing.

BENEFIT

Why choose the columnar format over the traditional one?

Think about the following reading scenario: we want to read a dataset with 100 columns and 100 million rows.

When we are working with linear, even if requesting just 3 columns and a few thousand rows, the entire set must be read, even indexed. In the columnar format, only the desired columns can form the consulted dataset.

Since each column contains its own set of rows (row group), it does not depend on others to form the data set.

Offering excellent performance for writing and, especially, reading, the columnar format became the foundation for the growth of Big data.

Now, we have the necessary agility for processing and Analytics.

INTERNALS

Parquet has some fundamental components in its structure, which are:

Page
Row Groups
Column Chunks
Magic Number
File Metadata
Page Header

The structure is organized as follows:

Image of Parquet Structure: Big Picture

Parquet has some key concepts, such as:

Block (HDFS Block) → is a block created in HDFS when we use Hadoop File System.
File (File) → is a file created in the format parquet at the storage layer, containing some metadata, but not necessarily containing data.
Row Group → is the logical way that data is stored in the format, consisting of a group of columns called column chunk.
Column group (Column Chunk) → these are the columns that make up the dataset and were written in Parquet. Each chunk has its row group stored.
Data Pages (Pages) → pages concentrate groups of columns in Parquet format

Base structure, in a more macro view of a created Parquet file.

Let's start with the basic components of the file.

FILE

At the top of the hierarchy is the physical file written to disk, the file_name.parquet.

The file is divided into

Row Group Size;
Data page Size.

In the beginning, When we create the file with data persistence, there is a signature of that file, which we call Magic Number.

It allows file identification, quickly enabling systems to understand that it is a parquet file.

It guarantees consistency, because if it is incomplete or incorrect, it indicates that there was corruption and, finally, security.

This numeric representation is available with programs capable of reading hexadecimal.

ROW GROUP

Because it is a columnar format, each row group it's like a one column data unit and this row group is divided into blocks of rows.

It is interesting to note that the row group It is customizable, and can be batches of 10 or 100k lines, depending on the configuration, environment and dataset.

Row Group. By: Vu Trinh

So, assuming data processing with Spark and writing to the Parquet storage layer, we have an example of what it would look like:

Code example to customize row groups

Let's break down the options:

ROW_GROUP_SIZE → determines the size of the number of rows for each row group. In the example, it will be 10k.
PARQUET.BLOCK.SIZE → Here, we change the size of the row group to bytes, indicating that each group will have a size of 10MB,
PARQUET.PAGE.SIZE → each data page and the chunk of columns that compose it will be 1MB in size.
PARQUET.DICTIONARY.PAGE.SIZE → defines page compression, with size set to 1MB

This size can be adjusted according to the needs of your dataset, taking into account factors such as:

Read/write performance
Available memory size
Line access frequency

Each row group has the metadata of that column that is "crazy" with the data.

So, let's assume that row group 0 contains the data from column A. As shown in the figure, column A has its metadata group that is part of row group 0.

In the first set of metadata, we have:

Type → informs the data type of the columns (INT32, FLOAT..)
Path → full path of the column in the schema. Useful for complex data.
Encoding → encoding defines and informs which method was used.
Codecs → Compression algorithm used such as: SNAPPY, GZIP.. Compression algorithms improve reading and storage efficiency.

It is interesting to note that the row group It is customizable, and can be batches of 10 or 100k lines, depending on the configuration, environment and dataset.

Below is a representation of the column and row metadata schema:

T he ROW GROUP has a simpler structure, just with its number of lines and the minimum and maximum values.

ATTENTION: There are other aggregated statistics that can complement, but these are the main ones.

COLUMN CHUNK

The column group has a slightly more complex structure, requiring more information, since the lines are grouped there.

The column-wise format: BY Vu Trinh

In addition to what was shown previously, we have some more important data.

Num Values → total number of lines within the line group, including NULLS and NOT NULLS.
Total compressed Size → total size without applying data compression.
Total uncompressed Size → total size with compression applied.
Data Page Offset → the starting position of the column data page
Index Page Offset → starting position of the index page created for the column.
Dict Page Offset → where the column dictionary page begins, which is used to compress and decode data with high cardinality.

Column Chunks metadata: in details.

Thus, the column chunk is a grouper of data pages so that they can be written by row groups, which is the data of a dataframe itself.

Here, each column will have its set of data pages which will store the data itself.

PAGES (DATA PAGES)

Pages are the smallest units within the file and they store written data.

Data Page Metadata: In Details.

The header (header) of the data page is with a guide so that the application can read the data from the desired columns correctly.

At the end of each page (footer) there is also an index that serves to guide whether such a data page can be ignored or not, based on the statistics contained.

This structure serves as read optimization, preventing all data pages from being loaded into memory, consuming unnecessary resources.

And there are three types of pages in Parquet:.

Data page;
Index Page e;
Dict Page.

INDEX PAGE AND DICT PAGE OFFSET: BRIEF EXPLANATION

Something that raised doubts in my mind about these structures was their usefulness within Parquet.

At the beginning of the article, I noted the difference between the columnar data storage format and the columnar index.

During the creation of the Parquet file, andhe can create a Index Page Offset, which is also a structure that points to the location of columns in the file.

So, if a user only wants columns 1 and 3, the index page will point to the pages that are storing the columns data. As there are several sequences of pages for the same column, we would have a offset indicating where the referring pages begin and end.

In the image below, let's assume that in a query on Parquet files, the user only wants columns 1 and 3.

The Index Page would then guide the engine search for which page sequences they would be in, i.e. where they start and end.

Like this, the result will be returned with greater speed and only with the necessary data, without having to pull everything into memory and then filter.

How Index Pages works.

The Index page starts working, pointing out the location of the data.

So, if a user only wants columns 1 and 3, the index page will point to the pages that are storing the columns data. As there are several sequences of pages for the same column, we would have a offset indicating where the referring pages begin and end.

ATTENTION: The creation of the columnar index in the file is not standard, it depends on the library used, whether it was specified and the like.

Already the Dict Page Offset, works differently. Unique values are mapped and stored in a dictionary.

Each value receives a number, of type INT, which corresponds to the stored data.

Now, we have a key-value relationship created for the column data.

With the relationship created, the writing in Parquet will no longer be the values of each key, but the key itself, that is, the integers.

Imagine the column with country records.

How Dictionary Pages works.

When the file is accessed for reading, the dictionary is decoded and the data is reconstructed.

This coding model is very advantageous in columns that have different values, but are repeated..

An example of code using the resource would be:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
   .appName("Parquet Dictionary Example") \
   .getOrCreate()

# dummy data
data = [("C001", "P101", 100.0), ("C002", "P102", 200.0), ("C001", "P103", 150.0), ("C003", "P101", 300.0)]
columns = ["CustomerID", "ProductID", "Amount"]

df = spark.createDataFrame(data, columns)

# creating dict encoding
df.write \
   .option("parquet.enable.dictionary", "true") \
   .parquet("output_with_dictionary")

But there is a catch: if the dictionary grows a lot, in size or in different values, the Dict Encoding is undone, returning to the more optimized encoding.

This happens to keep the data reading optimized, then in a column with high cardinality, the benefit of Dict encoding gets lost. For example, a column of customerid in a Customer table.

WRITING: HOW DATA BECOME PARQUET FILES.

The writing process in Parquet begins when the user finishes processing data.

In general, the process works as follows:

The user application makes a request:

1a — With the data that will be written;

1b — Compression type for each column schema;

1c — The encoding of the schemas for each column;

1d — Whether it will be one or more files and;

1e — Custom metadata.

2. With the information from metadata, encoding, compression and more above, it writes all this information in the FileMetaData structure.

Users sends a write request to parquetWriter()

3. Now authenticate the file with the Magic Number at the beginning.

Logical schema to write data on a Parquet file.

At this point, the writing of row groups begins:

4. Start calculating the number of row groups for each of them and which file they belong to. Afterwards, physical writing on the pages begins.

The amount of data in each row group is configurable. With the information, the writing of column chunks for row groups begins.

See the code below:

# create dataframe
df = spark.createDataFrame([
   (1, "Alice", "alice@example.com", "2022-01-01"),
   (2, "Bob", "bob@example.com", "2022-02-01"),
   (3, "Charlie", "charlie@example.com", "2022-03-01")
], ["id", "name", "email", "registration_date"])

# defining row group size (row numbers)
row_group_size = 10000

# Write the DataFrame in Parquet format
df.write.format("parquet") \
      .option("parquet.block.size", row_group_size * 1024 * 1024) \ # row group size in bytes
      .option("parquet.page.size", 1024 * 1024) \  # size page in bytes
      .option("parquet.dictionary.page.size", 1024 * 1024) \ # dict size in bytes
      .save("path/to/file. Parquet")

5. For each row group, the column chunks belonging and:

5a — Check and use the encoding and a;

5b — A compression.

6. Column writing begins, for each row group in that column, it will do:

6a — Number of lines per page (editable)

6b — Minimum and maximum values (when possible — numeric types)

6c — Calculate maximum page size.

6d — Maximum chunk size.

7. So the chunk from the column with the data is written sequentially on the pages, one after the other, and also:

7a — Written two page header on the pages;

7b — Encoding;

7c — Definition and repetition level;

7d — If the page uses dict encode, if so, creates the dictionary with its associated header.

8. With all pages written, PageWriter writes the column chunk metadata with the information:

8a — Minimum and maximum sizes;

8b — Total compressed and decompressed;

8c — First page offset

8d — Dict offset

Now the column chunk writes again.

9. Writing all row groups of the column chunk until they are all persisted to disk, sequentially, with all column metadata written to the row group metadata.

10. Then all the metadata of the generated row groups are written to FileMetadata

11. FileMetadata is written in the footer of the Parquet file.

12. The process ends with the Magic number also written at the end of the file.

In the end, what we have in the Data lake or any other storage location are files partitioned with data into chunks of columns, optimized for reading and data analysis.

Completed process of how data is written on Parquet.

Having understood the internal workings and the writing process, let's move on to the reading process.

READING: HOW DATA SEARCH WORKS.

To read any file in the format, an API is necessary, which we can call parquet reader.

The reader then searches for the files and begins the process:

Here, it validates the file, checking if it is parquet, and queries the metadata.

It starts by checking whether the file is valid by checking the magic number assigned in the creation and writing process.
Query the metadata of the row groups to then filter the necessary files.
Filters the necessary row groups through metadata, applying them to return only the desired data.
The application reads the column metadata, as the row groups have it, and begins to validate whether that group of columns has the necessary data to serve the application.

4.a — Each validated row group is appended to a list of valid row groups for later reading, with the desired columns.

Up to this point, we realize that it is just validating and separating (pre-selecting) the necessary data.

5. With the list of required row groups, it will iterate through the list of row groups reading the column metadata (column metadata), separating them by those that meet the query.

Separate column chunks, or Parquet reader will validate if there is any type of index or dict encode to speed up reading

6. If it is the first reading it will check the position of the first page of data (at least page header) or dict page (if using dict encoding). Then, I read every page, until the end.

There is a comparison between the total number of lines written and the pages read. So the reader You know where you are and how much time you have until the end, if you haven't already arrived.

7.In the page header are all the definitions and data values that have been encoded and, once the pages have been selected, the parquet reader query for:

7a — Validator o encoding

7b — Repetition level

7c — Definition level

8. The process is repeated for each valid row group, receiving a mark that that row group was read.

9. When all row groups have been read, the reading process is finished and the data is presented.

So, we have how the parquet reader and writer work to write data to parquet and return it to any application that queries it.

The entire process is done quickly and functionally, based on the metadata of the structures that make it up.

These two processes reinforce the idea that parquet is a great reader of its own metadata to deliver data in large volumes, with speed and validity.

CONCLUSION

This was an article where I tried to explain in the best way how the Parquet format works and its composition.

What motivated me to undertake this in-depth study is that the format serves as a basis for others, such as Open-table, Delta and Iceberg.

As they are the support base for Big Data today, I decided to make a deep-dive in search of understanding for more technical content with this topic and even topics with Iceberg and performance tests.

In the future I intend to bring articles about Iceberg, Parquet encoding and Parallelism. Understanding Parquet will bring me more clarity in the integration and relationship between the two.

I hope you like it and it helps you, thank you!

God bless you!