So far in this series of posts looking at the performance of importing data from files stored in ADLSgen2 into Power BI I have looked at trying to tune refresh by changing various options in the Power Query engine or by changing the file format used. However there is one very important optimisation in the Analysis Services engine in Power BI that can make a significant difference to refresh performance: partitioning.
Using the same set of nine CSV files stored in ADLSgen2 storage described in the first post in this series I created a baseline dataset using the standard “From Folder” to load all the data into a single table in Power BI:
I published the dataset to a PPU workspace in the Power BI Service and I ran a Profiler trace while refreshing it. In line with my previous tests it took 64 seconds to do a full refresh. However, this time I did something extra: I ran a Profiler trace and captured the Job Graph data for the refresh command. This is something I have blogged about before, but recently my colleague Phil Seamark has created a great Power BI file for visualising this data much more easily – you can read about it here:
Power BI is able to read almost 125,000 rows per second but there isn’t any parallelism here – and that’s what partitioning a table can offer.
It isn’t currently possible to create partitions in a table using Power BI Desktop, so instead I created a new .pbix file with a single Power Query query that loaded the data from just one CSV file into a single table in the dataset. Every table in Power BI has a single partition; I then used Tabular Editor to duplicate the existing partition eight times and changed the M query bound to each partition to point to a different CSV file:
I then refreshed in in the Service while again running a Profiler trace. Here’s what Phil’s report showed for this refresh:
As you can see up to six partitions can be refreshed in parallel on a PPU capacity (the amount of parallelism varies depending on the size of the capacity) and this makes a big difference to refresh performance: this dataset refreshed in 40 seconds, 24 seconds faster than the version with a single partition. While the number of rows read per second for any single partition is lower than before overall the number of rows read per second was much higher at almost 200,000 rows per second.
Conclusion: partitioning tables in Power BI can lead to a significant improvement in refresh times when loading data from ADLSgen2 (and indeed any data source that can support a reasonable number of parallel queries).