Index in Parquet

Question

Update Dec/2018:

Parquet Format version 2.5 added column indexes.

https://github.com/apache/parquet-format/blob/master/CHANGES.md#version-250

See https://issues.apache.org/jira/browse/PARQUET-1201 for list of sub-tasks for that new feature.

Notice that this feature just got merged into Parquet format itself, it will take some time for different backends (Spark, Hive, Impala etc) to start supporting it.

This new feature is called Column Indexes. Basically Parquet has added two new structures in parquet layout – Column Index and Offset Index.

Below is a more detailed technical explanation what it solves and how.

Problem Statement

In the current format, Statistics are stored for ColumnChunks in ColumnMetaData and for individual pages inside DataPageHeader structs. When reading pages, a reader has to process the page header in order to determine whether the page can be skipped based on the statistics. This means the reader has to access all pages in a column, thus likely reading most of the column data from disk.

Goals

Make both range scans and point lookups I/O efficient by allowing direct access to pages based on their min and max values. In particular:

A single-row lookup in a rowgroup based on the sort column of that
rowgroup will only read one data page per retrieved column. Range
scans on the sort column will only need to read the exact data pages
that contain relevant data.
Make other selective scans I/O
efficient: if we have a very selective predicate on a non-sorting
column, for the other retrieved columns we should only need to
access data pages that contain matching rows.
No additional decoding
effort for scans without selective predicates, e.g., full-row group
scans. If a reader determines that it does not need to read the
index data, it does not incur any overhead.
Index pages for sorted
columns use minimal storage by storing only the boundary elements
between pages.

Non-Goals

Support for the equivalent of secondary indices, ie, an index structure sorted on the key values over non-sorted data.

Technical Approach

We add two new per-column structures to the row group metadata:
ColumnIndex: this allows navigation to the pages of a column based on column values and is used to locate data pages that contain matching values for a scan predicate
OffsetIndex: this allows navigation by row index and is used to retrieve values for rows identified as matches via the ColumnIndex. Once rows of a column are skipped, the corresponding rows in the other columns have to be skipped. Hence the OffsetIndexes for each column in a RowGroup are stored together.

The new index structures are stored separately from RowGroup, near the footer, so that a reader does not have to pay the I/O and deserialization cost for reading the them if it is not doing selective scans. The index structures’ location and length are stored in ColumnChunk and RowGroup.

Cloudera’s Impala team has made some tests on this new feature (not yet available as part of Apache Impala core product). Here’s their performance improvements:

HDFS I/O in Bytes

and

Scanner CPU time in ms

As you can see some of the queries had a huge improvement in both both cpu time and amount of data it had to read from disks.

Original answer back from 2016:

struct IndexPageHeader {
  /** TODO: **/
}

https://github.com/apache/parquet-format/blob/6e5b78d6d23b9730e19b78dceb9aac6166d528b8/src/main/thrift/parquet.thrift#L505

Index Page Header is not implemented, as of yet.

See source code of Parquet format above.
I don’t see it even in Parquet 2.0 currently.

But yes – excellent answer from Ryan Blue above on Parquet that it has pseudo-indexing capabilities (bloom filters).

If your’re interested in more details, I recommend great document on how Parquet bloom filters and predicate push-down work

Parquet performance tuning: the missing guide from Ryan Blue

a more technical implementation-specific document –
https://homepages.cwi.nl/~boncz/msc/2018-BoudewijnBraams.pdf

Leave a Comment Cancel reply