Chris Webb's BI Blog: Parquet File Performance In Power BI/Power Query

There has been a lot of excitement around the newly-added support for reading from Parquet files in Power BI. However I have to admit that I was disappointed not to see any big improvements in performance when reading data from Parquet compared to reading data from CSV (for example, see here) when I first started testing it. So, is Power Query able to take advantage of Parquet’s columnar storage when reading data?

The answer is yes, but you may need to make some changes to your Power Query queries to ensure you get the best possible performance. Using the same data that I have been using in my recent series of posts on importing data from ADLSgen2, I took a single 10.1MB Parquet file and downloaded it to my PC. Here’s what the data looked like:

I then created a query to count the number of rows in the table stored in this Parquet file where the TransDate column was 1/1/2015:

let
  Source = Parquet.Document(
    File.Contents(
      "C:\myfile.snappy.parquet"
    )
  ),
  #"Filtered Rows" = Table.SelectRows(
    Source,
    each [TransDate] = #date(2015, 1, 1)
  ),
  #"Counted Rows" = Table.RowCount(
    #"Filtered Rows"
  )
in
  #"Counted Rows"

Here’s the output:

I then used SQL Server Profiler to find out how long this query took to execute (as detailed here): on average it took 3 seconds.

Here’s what I saw in Power BI Desktop while loading the data just before refresh finished:

As you can see, Power Query is scanning all the data in the file.

I then added an extra step to the query to remove all columns except the TransDate column:

let
  Source = Parquet.Document(
    File.Contents(
      "C:\myfile.snappy.parquet"
    )
  ),
  #"Removed Other Columns"
    = Table.SelectColumns(
    Source,
    {"TransDate"}
  ),
  #"Filtered Rows" = Table.SelectRows(
    #"Removed Other Columns",
    each [TransDate] = #date(2015, 1, 1)
  ),
  #"Counted Rows" = Table.RowCount(
    #"Filtered Rows"
  )
in
  #"Counted Rows"

This version of the query only took an average of 0.7 seconds to run – a substantial improvement. This time the maximum amount of data read by Power Query was only 2.44MB:

As you can see, in this case removing unnecessary columns improved the performance of reading data from Parquet files a lot. This is not always true though – I tested a Group By transformation and in that case the Power Query engine was clever enough to only read the required columns, and manually removing columns made no difference to performance.

This demonstrates that Power Query is able to take advantage of Parquet’s columnar storage to only read data from certain columns. However, this is the only performance optimisation available to Power Query on Parquet – it doesn’t do predicate pushdown or anything like that. What’s more, when reading data from the ADLSgen2 connector, the nature of Parquet storage stops Power Query from making parallel requests for data (I guess the same behaviour that is controlled by the ConcurrentRequests option) which puts it at a disadvantage compared to reading data from CSV files.

I think a lot more testing is needed to understand how to get the best performance when reading data from Parquet, so look out for more posts on this subject in the future…

[Thanks once again to Eric Gorelik from the Power Query development team for providing the information about how the Parquet connector works, and to Ben Watt and Gerhard Brueckl for asking the questions in the first place]

Bonus fact: in case you’re wondering, the following compression types are supported by the Parquet connector: GZip, Snappy, Brotli, LZ4, and ZStd.

10 thoughts on “Parquet File Performance In Power BI/Power Query”

Pingback: Parquet File Performance in Power Query – Curated SQL

Hi, I was wondering if you had previously compared the performances of opening a Parquet file through ADLSgen2 or using the native Power BI Parquet connector.

Is there any advantages of using ADLSgen2 here? Is less data downloaded and processed on the user’s workstation when using ADLSgen2?

Hi Chris,

When I connect to an Azure Data Lake gen2 to read a Parquet file I don’t see any improvement if I remove columns. Looks like all the content of the file is read before removing the columns.

But if I read the same file from my computer, I see the improvement when I remove columns. Power Query only read data from 1 column.

The compression type of the Parquet file is Snappy.
I’m using the latest version of Power BI Desktop.

Is it possible to filter the columns at source when using Parquet.Document with Azure Data Lake?

These are the queries:

let
Source = Parquet.Document(AzureStorage.DataLakeContents(“https://xxx.dfs.core.windows.net/taxi/taxi_2019_04.parquet”)),
#”Removed Other Columns” = Table.SelectColumns(Source,{“vendor_id”})
in
#”Removed Other Columns”

let
Source = Parquet.Document(File.Contents(“C:\taxi\taxi_2019_04.parquet”)),
#”Removed Other Columns” = Table.SelectColumns(Source,{“vendor_id”})
in
#”Removed Other Columns”

Hi Chris, Does PowerBI supports dynamic query with parquet files?

Chris Webb says:

June 10, 2022 at 4:20 pm

Do you mean DirectQuery? No, it doesn’t support DirectQuery on Parquet.

Loading...

Reply

Pingback: Power BI y Synapse: Archivos Parquet - dataXbi

Hi Chris, can we select the columns to load in m query like we do in REST API or Odata feed?
I have a parquet file with 40+ columns and need to load only 10 of them. Looking for the ways to filter the columns before loading them in Power BI.

Chris Webb says:

February 8, 2024 at 3:46 pm

Yes, just select the columns in the Power Query Editor as shown in the post.

Loading...

Reply

Hi Chris, our PBI report is using Parquet file as datasource. The files are stored in Azure blob container. The report works fine on desktop. But when it is published to PBI service, the auto dataset refresh sometimes will not work. To fix it we have to refresh it on desktop, then re-publish again. Then the dataset can refresh will work for a few days and then go wrong again. Do you know what would be the problem?
When dataset refresh not work, the error are usually:
The ‘null’ column does not exist in the rowset
[ValidateMarkupTags][ccon]Expression.Error: We cannot convert the value [ccon]null[/ccon] to type Table.

Chris Webb says:

February 21, 2024 at 3:42 pm

Is it an authentication issue maybe?

Loading...

Reply

Pingback: Parquet File Performance in Power Query – Curated SQL
Christian says:

March 18, 2022 at 7:10 pm

Hi, I was wondering if you had previously compared the performances of opening a Parquet file through ADLSgen2 or using the native Power BI Parquet connector.

Is there any advantages of using ADLSgen2 here? Is less data downloaded and processed on the user’s workstation when using ADLSgen2?

Loading...

Nelson says:

May 20, 2022 at 11:34 am

Hi Chris,

When I connect to an Azure Data Lake gen2 to read a Parquet file I don’t see any improvement if I remove columns. Looks like all the content of the file is read before removing the columns.

But if I read the same file from my computer, I see the improvement when I remove columns. Power Query only read data from 1 column.

The compression type of the Parquet file is Snappy.
I’m using the latest version of Power BI Desktop.

Is it possible to filter the columns at source when using Parquet.Document with Azure Data Lake?

These are the queries:

let
Source = Parquet.Document(AzureStorage.DataLakeContents(“https://xxx.dfs.core.windows.net/taxi/taxi_2019_04.parquet”)),
#”Removed Other Columns” = Table.SelectColumns(Source,{“vendor_id”})
in
#”Removed Other Columns”

let
Source = Parquet.Document(File.Contents(“C:\taxi\taxi_2019_04.parquet”)),
#”Removed Other Columns” = Table.SelectColumns(Source,{“vendor_id”})
in
#”Removed Other Columns”

Loading...

Debashis M says:

June 10, 2022 at 3:18 pm

Hi Chris, Does PowerBI supports dynamic query with parquet files?

Loading...

1. Chris Webb says:
  
  June 10, 2022 at 4:20 pm
  
  Do you mean DirectQuery? No, it doesn’t support DirectQuery on Parquet.
  
  Loading...
  
Pingback: Power BI y Synapse: Archivos Parquet - dataXbi
Pooja S says:

February 8, 2024 at 12:32 pm

Hi Chris, can we select the columns to load in m query like we do in REST API or Odata feed?
I have a parquet file with 40+ columns and need to load only 10 of them. Looking for the ways to filter the columns before loading them in Power BI.

Loading...

1. Chris Webb says:
  
  February 8, 2024 at 3:46 pm
  
  Yes, just select the columns in the Power Query Editor as shown in the post.
  
  Loading...
  
Create Site Collection says:

February 21, 2024 at 1:49 pm

Hi Chris, our PBI report is using Parquet file as datasource. The files are stored in Azure blob container. The report works fine on desktop. But when it is published to PBI service, the auto dataset refresh sometimes will not work. To fix it we have to refresh it on desktop, then re-publish again. Then the dataset can refresh will work for a few days and then go wrong again. Do you know what would be the problem?
When dataset refresh not work, the error are usually:
The ‘null’ column does not exist in the rowset
[ValidateMarkupTags][ccon]Expression.Error: We cannot convert the value [ccon]null[/ccon] to type Table.

Loading...

1. Chris Webb says:
  
  February 21, 2024 at 3:42 pm
  
  Is it an authentication issue maybe?
  
  Loading...

Parquet File Performance In Power BI/Power Query

Like this:

Published by Chris Webb

10 thoughts on “Parquet File Performance In Power BI/Power Query”

Leave a ReplyCancel reply

Share this:

Like this:

Published by Chris Webb

10 thoughts on “Parquet File Performance In Power BI/Power Query”

Leave a ReplyCancel reply

Discover more from Chris Webb's BI Blog