Partitioned Tables, Power BI And Parquet Files In ADLSgen2

Earlier in this series on importing data from ADLSgen2 into Power BI I showed how partitioning a table in your dataset can improve refresh performance. In that post I used CSV files in ADLSgen2 as my source and created one partition per CSV file, but after my recent discovery that importing data from multiple Parquet files can be tuned to be a lot faster than importing data from CSV files, I decided to try creating partitions linked to Parquet files instead.

As a reminder, here’s what Phil Seamark’s Power BI refresh report showed for the version of my dataset with one partition per CSV file:

As you can see, refresh in the Power BI Service took 40 seconds and throughput was just under 200,000 rows per second.

Here’s the same report for a partitioned dataset based on five Parquet files (with one partition per Parquet file) containing exactly the same data:

Using Parquet files as a source refresh only took 33 seconds and throughput was almost 250,000 rows per second.

But what does this actually tell me? It shows that it is possible to improve on the previous 40-second refresh time quite substantially. It certainly looks like reading data from individual Parquet files is faster than reading data from individual CSV files too; however I’m reluctant to draw too many other conclusions. This is because:

  • I only had five Parquet files in this case, compared to nine CSV files. I suspect that some extra parallelism might be possible if had six Parquet files instead of just five, because there are six refresh slots available in the PPU workspace I’m using.
  • This then leads onto the question of whether the number and size of the files affects refresh performance in the same way for CSV files as for Parquet files (any difference could be something to do with the format itself or the Power BI connectors used to read the data from the files).
  • This also leads onto the question of partitioning strategies – after all, you probably also want to only refresh partitions where data has changed, which will also have a bit impact on refresh times – and the distribution of data over partitions. That’s out of scope for this series, but if you have a copy of Marco Russo and Alberto Ferrari’s book “Tabular Modeling in Microsoft SQL Server Analysis Services” then check out chapter 11 and the section in chapter 12 on “Segmentation and partitioning”: what’s good for Analysis Services Tabular is mostly good for Power BI, although bear in mind that Power BI uses segment sizes of 1 million rows.

One response

  1. Pingback: Comparing CSV to Parquet File Loading Performance in Power BI – Curated SQL

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: