Parquet File Performance In Power BI/Power Query

There has been a lot of excitement around the newly-added support for reading from Parquet files in Power BI. However I have to admit that I was disappointed not to see any big improvements in performance when reading data from Parquet compared to reading data from CSV (for example, see here) when I first started testing it. So, is Power Query able to take advantage of Parquet’s columnar storage when reading data?

The answer is yes, but you may need to make some changes to your Power Query queries to ensure you get the best possible performance. Using the same data that I have been using in my recent series of posts on importing data from ADLSgen2, I took a single 10.1MB Parquet file and downloaded it to my PC. Here’s what the data looked like:

I then created a query to count the number of rows in the table stored in this Parquet file where the TransDate column was 1/1/2015:

let
  Source = Parquet.Document(
    File.Contents(
      "C:\myfile.snappy.parquet"
    )
  ),
  #"Filtered Rows" = Table.SelectRows(
    Source,
    each [TransDate] = #date(2015, 1, 1)
  ),
  #"Counted Rows" = Table.RowCount(
    #"Filtered Rows"
  )
in
  #"Counted Rows"

Here’s the output:

I then used SQL Server Profiler to find out how long this query took to execute (as detailed here): on average it took 3 seconds.

Here’s what I saw in Power BI Desktop while loading the data just before refresh finished:

As you can see, Power Query is scanning all the data in the file.

I then added an extra step to the query to remove all columns except the TransDate column:

let
  Source = Parquet.Document(
    File.Contents(
      "C:\myfile.snappy.parquet"
    )
  ),
  #"Removed Other Columns"
    = Table.SelectColumns(
    Source,
    {"TransDate"}
  ),
  #"Filtered Rows" = Table.SelectRows(
    #"Removed Other Columns",
    each [TransDate] = #date(2015, 1, 1)
  ),
  #"Counted Rows" = Table.RowCount(
    #"Filtered Rows"
  )
in
  #"Counted Rows"

This version of the query only took an average of 0.7 seconds to run – a substantial improvement. This time the maximum amount of data read by Power Query was only 2.44MB:

As you can see, in this case removing unnecessary columns improved the performance of reading data from Parquet files a lot. This is not always true though – I tested a Group By transformation and in that case the Power Query engine was clever enough to only read the required columns, and manually removing columns made no difference to performance.

This demonstrates that Power Query is able to take advantage of Parquet’s columnar storage to only read data from certain columns. However, this is the only performance optimisation available to Power Query on Parquet – it doesn’t do predicate pushdown or anything like that. What’s more, when reading data from the ADLSgen2 connector, the nature of Parquet storage stops Power Query from making parallel requests for data (I guess the same behaviour that is controlled by the ConcurrentRequests option) which puts it at a disadvantage compared to reading data from CSV files.

I think a lot more testing is needed to understand how to get the best performance when reading data from Parquet, so look out for more posts on this subject in the future…

[Thanks once again to Eric Gorelik from the Power Query development team for providing the information about how the Parquet connector works, and to Ben Watt and Gerhard Brueckl for asking the questions in the first place]

Bonus fact: in case you’re wondering, the following compression types are supported by the Parquet connector: GZip, Snappy, Brotli, LZ4, and ZStd.

5 responses

  1. Pingback: Parquet File Performance in Power Query – Curated SQL

  2. Hi, I was wondering if you had previously compared the performances of opening a Parquet file through ADLSgen2 or using the native Power BI Parquet connector.

    Is there any advantages of using ADLSgen2 here? Is less data downloaded and processed on the user’s workstation when using ADLSgen2?

  3. Hi Chris,

    When I connect to an Azure Data Lake gen2 to read a Parquet file I don’t see any improvement if I remove columns. Looks like all the content of the file is read before removing the columns.

    But if I read the same file from my computer, I see the improvement when I remove columns. Power Query only read data from 1 column.

    The compression type of the Parquet file is Snappy.
    I’m using the latest version of Power BI Desktop.

    Is it possible to filter the columns at source when using Parquet.Document with Azure Data Lake?

    These are the queries:

    let
    Source = Parquet.Document(AzureStorage.DataLakeContents(“https://xxx.dfs.core.windows.net/taxi/taxi_2019_04.parquet”)),
    #”Removed Other Columns” = Table.SelectColumns(Source,{“vendor_id”})
    in
    #”Removed Other Columns”

    let
    Source = Parquet.Document(File.Contents(“C:\taxi\taxi_2019_04.parquet”)),
    #”Removed Other Columns” = Table.SelectColumns(Source,{“vendor_id”})
    in
    #”Removed Other Columns”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: