Partitioned Tables, Power BI And Parquet Files In ADLSgen2

Earlier in this series on importing data from ADLSgen2 into Power BI I showed how partitioning a table in your dataset can improve refresh performance. In that post I used CSV files in ADLSgen2 as my source and created one partition per CSV file, but after my recent discovery that importing data from multiple Parquet files can be tuned to be a lot faster than importing data from CSV files, I decided to try creating partitions linked to Parquet files instead.

As a reminder, here’s what Phil Seamark’s Power BI refresh report showed for the version of my dataset with one partition per CSV file:

As you can see, refresh in the Power BI Service took 40 seconds and throughput was just under 200,000 rows per second.

Here’s the same report for a partitioned dataset based on five Parquet files (with one partition per Parquet file) containing exactly the same data:

Using Parquet files as a source refresh only took 33 seconds and throughput was almost 250,000 rows per second.

But what does this actually tell me? It shows that it is possible to improve on the previous 40-second refresh time quite substantially. It certainly looks like reading data from individual Parquet files is faster than reading data from individual CSV files too; however I’m reluctant to draw too many other conclusions. This is because:

  • I only had five Parquet files in this case, compared to nine CSV files. I suspect that some extra parallelism might be possible if had six Parquet files instead of just five, because there are six refresh slots available in the PPU workspace I’m using.
  • This then leads onto the question of whether the number and size of the files affects refresh performance in the same way for CSV files as for Parquet files (any difference could be something to do with the format itself or the Power BI connectors used to read the data from the files).
  • This also leads onto the question of partitioning strategies – after all, you probably also want to only refresh partitions where data has changed, which will also have a bit impact on refresh times – and the distribution of data over partitions. That’s out of scope for this series, but if you have a copy of Marco Russo and Alberto Ferrari’s book “Tabular Modeling in Microsoft SQL Server Analysis Services” then check out chapter 11 and the section in chapter 12 on “Segmentation and partitioning”: what’s good for Analysis Services Tabular is mostly good for Power BI, although bear in mind that Power BI uses segment sizes of 1 million rows.

Optimising The Performance Of Combining Data From Multiple Parquet Files In Power Query/Power BI

In all the testing I’ve done recently with importing data from Parquet files into Power BI I noticed something strange: loading data from a folder containing multiple Parquet files seemed a lot slower than I would expect, based on the time taken to load data from a single file. So I wondered – is there something that can be optimised? It turns out there is and in this blog post I’ll show you what I did.

If you import data from a folder containing Parquet files – whether it’s a local folder or a folder in ADLSgen2 storage – you’ll see a series of queries created for you in the Power Query Editor window that looks like this:

The query called Query1 shown in the screenshot iterates over all the files in the folder you’ve chosen and calls a function that reads the data from each Parquet file. It returns a table that contains a column with the name of the original source file in (which isn’t all that interesting for Parquet files) and all the columns from the Parquet files you’re containing.

Using the Parquet files from my series of posts on importing data from ADLSgen2 as a source, here’s the M code Power Quey generates for this query which I have modified to remove the column with the source file name in:

let
  Source = Folder.Files("C:\MyFolder"),
  #"Filtered Hidden Files1"
    = Table.SelectRows(
    Source,
    each [Attributes]?[Hidden]? <> true
  ),
  #"Invoke Custom Function1"
    = Table.AddColumn(
    #"Filtered Hidden Files1",
    "Transform File (3)",
    each #"Transform File (3)"(
      [Content]
    )
  ),
  #"Renamed Columns1"
    = Table.RenameColumns(
    #"Invoke Custom Function1",
    {"Name", "Source.Name"}
  ),
  #"Removed Other Columns1"
    = Table.SelectColumns(
    #"Renamed Columns1",
    {"Transform File (3)"}
  ),
  #"Expanded Table Column1"
    = Table.ExpandTableColumn(
    #"Removed Other Columns1",
    "Transform File (3)",
    Table.ColumnNames(
      #"Transform File (3)"(
        #"Sample File (3)"
      )
    )
  ),
  #"Changed Type"
    = Table.TransformColumnTypes(
    #"Expanded Table Column1",
    {
      {"TransDate", type date},
      {"GuestId", type text},
      {"ProductId", type text},
      {"NetAmount", type number}
    }
  )
in
  #"Changed Type"

Here’s the output:

On my PC this query took an average of 102 seconds to refresh.

Apart from this query being slower than I expected, I also noticed that there is a “Changed Type” step at the end – which I thought was unnecessary because unlike CSV files, Parquet has typed columns. If you connect to a single Parquet file in Power Query then it recognises the column types, so why not here? Well, it’s because of the way it’s combining files by expanding table columns, and there is a way to work around this that I blogged about here:
https://blog.crossjoin.co.uk/2017/09/25/setting-data-types-on-nested-tables-in-m/

Setting a type on the table column before expanding it did indeed improve performance, but this led me to another optimisation.

I know that using the Table.Combine M function can perform differently to the Table.ExpandTableColumn function used in the original version of the query (although it does not always perform better). Therefore I made the following change to the query above: using Table.Combine to return a single table with all the data in (note that setting a type on the table column is not necessary for this optimisation). Here’s the new version:

let
  Source = Folder.Files("C:\Myfolder"),
  #"Filtered Hidden Files1"
    = Table.SelectRows(
    Source,
    each [Attributes]?[Hidden]? <> true
  ),
  #"Invoke Custom Function1"
    = Table.AddColumn(
    #"Filtered Hidden Files1",
    "Transform File",
    each #"Transform File"([Content])
  ),
  #"Renamed Columns1"
    = Table.RenameColumns(
    #"Invoke Custom Function1",
    {"Name", "Source.Name"}
  ),
  #"Removed Other Columns1"
    = Table.SelectColumns(
    #"Renamed Columns1",
    {"Source.Name", "Transform File"}
  ),
  Combine = Table.Combine(
    #"Removed Other Columns1"[
      Transform File
    ]
  )
in
  Combine

This version of the query took, on average 43 seconds to refresh – a massive improvement.

If you’ve been following my series on ADLSgen2 refresh you may remember that I blogged about importing from a folder of Parquet files there too: in this post I noted that it took on average 72 seconds to load the same data from an ADLSgen2 folder in the Power BI Service using the original code; that was with the Source File column in and removing that column made no different to performance. This new version of the query took on average 49 seconds.

The conclusion is obvious: if you need to load data from a folder of Parquet files then you should use this new approach because the performance benefits are substantial. I know what you’re thinking: does this technique work for other file types apart from Parquet like CSV? The answer is no, because these file types don’t have typed columns like Parquet so it won’t work unfortunately.

Parquet File Performance In Power BI/Power Query

There has been a lot of excitement around the newly-added support for reading from Parquet files in Power BI. However I have to admit that I was disappointed not to see any big improvements in performance when reading data from Parquet compared to reading data from CSV (for example, see here) when I first started testing it. So, is Power Query able to take advantage of Parquet’s columnar storage when reading data?

The answer is yes, but you may need to make some changes to your Power Query queries to ensure you get the best possible performance. Using the same data that I have been using in my recent series of posts on importing data from ADLSgen2, I took a single 10.1MB Parquet file and downloaded it to my PC. Here’s what the data looked like:

I then created a query to count the number of rows in the table stored in this Parquet file where the TransDate column was 1/1/2015:

let
  Source = Parquet.Document(
    File.Contents(
      "C:\myfile.snappy.parquet"
    )
  ),
  #"Filtered Rows" = Table.SelectRows(
    Source,
    each [TransDate] = #date(2015, 1, 1)
  ),
  #"Counted Rows" = Table.RowCount(
    #"Filtered Rows"
  )
in
  #"Counted Rows"

Here’s the output:

I then used SQL Server Profiler to find out how long this query took to execute (as detailed here): on average it took 3 seconds.

Here’s what I saw in Power BI Desktop while loading the data just before refresh finished:

As you can see, Power Query is scanning all the data in the file.

I then added an extra step to the query to remove all columns except the TransDate column:

let
  Source = Parquet.Document(
    File.Contents(
      "C:\myfile.snappy.parquet"
    )
  ),
  #"Removed Other Columns"
    = Table.SelectColumns(
    Source,
    {"TransDate"}
  ),
  #"Filtered Rows" = Table.SelectRows(
    #"Removed Other Columns",
    each [TransDate] = #date(2015, 1, 1)
  ),
  #"Counted Rows" = Table.RowCount(
    #"Filtered Rows"
  )
in
  #"Counted Rows"

This version of the query only took an average of 0.7 seconds to run – a substantial improvement. This time the maximum amount of data read by Power Query was only 2.44MB:

As you can see, in this case removing unnecessary columns improved the performance of reading data from Parquet files a lot. This is not always true though – I tested a Group By transformation and in that case the Power Query engine was clever enough to only read the required columns, and manually removing columns made no difference to performance.

This demonstrates that Power Query is able to take advantage of Parquet’s columnar storage to only read data from certain columns. However, this is the only performance optimisation available to Power Query on Parquet – it doesn’t do predicate pushdown or anything like that. What’s more, when reading data from the ADLSgen2 connector, the nature of Parquet storage stops Power Query from making parallel requests for data (I guess the same behaviour that is controlled by the ConcurrentRequests option) which puts it at a disadvantage compared to reading data from CSV files.

I think a lot more testing is needed to understand how to get the best performance when reading data from Parquet, so look out for more posts on this subject in the future…

[Thanks once again to Eric Gorelik from the Power Query development team for providing the information about how the Parquet connector works, and to Ben Watt and Gerhard Brueckl for asking the questions in the first place]

Bonus fact: in case you’re wondering, the following compression types are supported by the Parquet connector: GZip, Snappy, Brotli, LZ4, and ZStd.