Earlier in this series on importing data from ADLSgen2 into Power BI I showed how partitioning a table in your dataset can improve refresh performance. In that post I used CSV files in ADLSgen2 as my source and created one partition per CSV file, but after my recent discovery that importing data from multiple Parquet files can be tuned to be a lot faster than importing data from CSV files, I decided to try creating partitions linked to Parquet files instead.
As a reminder, here’s what Phil Seamark’s Power BI refresh report showed for the version of my dataset with one partition per CSV file:
Here’s the same report for a partitioned dataset based on five Parquet files (with one partition per Parquet file) containing exactly the same data:
But what does this actually tell me? It shows that it is possible to improve on the previous 40-second refresh time quite substantially. It certainly looks like reading data from individual Parquet files is faster than reading data from individual CSV files too; however I’m reluctant to draw too many other conclusions. This is because:
- I only had five Parquet files in this case, compared to nine CSV files. I suspect that some extra parallelism might be possible if had six Parquet files instead of just five, because there are six refresh slots available in the PPU workspace I’m using.
- This then leads onto the question of whether the number and size of the files affects refresh performance in the same way for CSV files as for Parquet files (any difference could be something to do with the format itself or the Power BI connectors used to read the data from the files).
- This also leads onto the question of partitioning strategies – after all, you probably also want to only refresh partitions where data has changed, which will also have a bit impact on refresh times – and the distribution of data over partitions. That’s out of scope for this series, but if you have a copy of Marco Russo and Alberto Ferrari’s book “Tabular Modeling in Microsoft SQL Server Analysis Services” then check out chapter 11 and the section in chapter 12 on “Segmentation and partitioning”: what’s good for Analysis Services Tabular is mostly good for Power BI, although bear in mind that Power BI uses segment sizes of 1 million rows.