Improving The Performance Of Importing Data From ADLSgen2 In Power BI By Partitioning Tables

So far in this series of posts looking at the performance of importing data from files stored in ADLSgen2 into Power BI I have looked at trying to tune refresh by changing various options in the Power Query engine or by changing the file format used. However there is one very important optimisation in the Analysis Services engine in Power BI that can make a significant difference to refresh performance: partitioning.

Using the same set of nine CSV files stored in ADLSgen2 storage described in the first post in this series I created a baseline dataset using the standard “From Folder” to load all the data into a single table in Power BI:

I published the dataset to a PPU workspace in the Power BI Service and I ran a Profiler trace while refreshing it. In line with my previous tests it took 64 seconds to do a full refresh. However, this time I did something extra: I ran a Profiler trace and captured the Job Graph data for the refresh command. This is something I have blogged about before, but recently my colleague Phil Seamark has created a great Power BI file for visualising this data much more easily – you can read about it here:

Here’s what Phil’s report showed about the refresh:

Power BI is able to read almost 125,000 rows per second but there isn’t any parallelism here – and that’s what partitioning a table can offer.

It isn’t currently possible to create partitions in a table using Power BI Desktop, so instead I created a new .pbix file with a single Power Query query that loaded the data from just one CSV file into a single table in the dataset. Every table in Power BI has a single partition; I then used Tabular Editor to duplicate the existing partition eight times and changed the M query bound to each partition to point to a different CSV file:

I then refreshed in in the Service while again running a Profiler trace. Here’s what Phil’s report showed for this refresh:

As you can see up to six partitions can be refreshed in parallel on a PPU capacity (the amount of parallelism varies depending on the size of the capacity) and this makes a big difference to refresh performance: this dataset refreshed in 40 seconds, 24 seconds faster than the version with a single partition. While the number of rows read per second for any single partition is lower than before overall the number of rows read per second was much higher at almost 200,000 rows per second.

Conclusion: partitioning tables in Power BI can lead to a significant improvement in refresh times when loading data from ADLSgen2 (and indeed any data source that can support a reasonable number of parallel queries).

6 responses

  1. How would I apply this approach when the number of files in the folder is not static? For example when a new file is added to the folder, and some files are modified (making incremental refresh tricky) how can I dynamically define the partitions?

  2. Pingback: Chris Webb's BI Blog: Partitioned Tables, Power BI And Parquet Files In ADLSgen2 Chris Webb's BI Blog

  3. Chris- any limitations/issues you see after using tabular editor for partitioning?(Like dataset becoming not compatible for further edit etc). Tabular editor or SSDT is recommended for partitioning ?
    Please let me know. Thanks in advance

  4. Pingback: Chris Webb's BI Blog: Importing Data From ADLSgen2 Into Power BI: Summary Of Findings Chris Webb's BI Blog

  5. These edits with the Tabular Editor still feel like living dangerously, however, it seems to work for you and I think I can use part of this method.

    Here’s the process/problem:

    – There’s data from 2/3/2019 to 1/24/2021 and today is in April 221
    – A history of data needs to be rebuild in a power BI Dataset, keep 104 weeks and only refresh the last 4 weeks
    – Note that the earliest week with historic data is further back than the 104 week keep range.
    – all 104 weeks is about 5 Bio Rows coming from a dedicated Synapse SQL Pool at DWUc 3000
    – even with parallel reads by months takes more than 5 hours, not sure how to speed that up, perhaps using Serverless Synapse.

    – in order to stay within the timeout of 5 hours, batches of 4 weeks must be used for each refresh, every time the next 4 weeks until all 104 weeks have been imported.

    – An initial refresh completes, creates the 26 months partitions and reads the first 4 weeks in the source table
    — It seems that the Power BI service stamps the Last Processed on all partition even if it didn’t load data.
    — As a consequence, subsequent refresh doesn’t read any more months even if we add more data in the keep range because it seems that these have already been processed.

    If we add one moth partition one at a time as in your method, will we be able to control the refresh duration and rebuild the whole history?

    Can the tabular editor set the last processed date on a partition?

    Is there another process to achieve this, batch wise history import that I am overlooking?

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: