Improving The Performance Of Importing Data From ADLSgen2 In Power BI By Partitioning Tables

So far in this series of posts looking at the performance of importing data from files stored in ADLSgen2 into Power BI I have looked at trying to tune refresh by changing various options in the Power Query engine or by changing the file format used. However there is one very important optimisation in the Analysis Services engine in Power BI that can make a significant difference to refresh performance: partitioning.

Using the same set of nine CSV files stored in ADLSgen2 storage described in the first post in this series I created a baseline dataset using the standard “From Folder” to load all the data into a single table in Power BI:

I published the dataset to a PPU workspace in the Power BI Service and I ran a Profiler trace while refreshing it. In line with my previous tests it took 64 seconds to do a full refresh. However, this time I did something extra: I ran a Profiler trace and captured the Job Graph data for the refresh command. This is something I have blogged about before, but recently my colleague Phil Seamark has created a great Power BI file for visualising this data much more easily – you can read about it here:

Here’s what Phil’s report showed about the refresh:

Power BI is able to read almost 125,000 rows per second but there isn’t any parallelism here – and that’s what partitioning a table can offer.

It isn’t currently possible to create partitions in a table using Power BI Desktop, so instead I created a new .pbix file with a single Power Query query that loaded the data from just one CSV file into a single table in the dataset. Every table in Power BI has a single partition; I then used Tabular Editor to duplicate the existing partition eight times and changed the M query bound to each partition to point to a different CSV file:

I then refreshed in in the Service while again running a Profiler trace. Here’s what Phil’s report showed for this refresh:

As you can see up to six partitions can be refreshed in parallel on a PPU capacity (the amount of parallelism varies depending on the size of the capacity) and this makes a big difference to refresh performance: this dataset refreshed in 40 seconds, 24 seconds faster than the version with a single partition. While the number of rows read per second for any single partition is lower than before overall the number of rows read per second was much higher at almost 200,000 rows per second.

Conclusion: partitioning tables in Power BI can lead to a significant improvement in refresh times when loading data from ADLSgen2 (and indeed any data source that can support a reasonable number of parallel queries).

20 thoughts on “Improving The Performance Of Importing Data From ADLSgen2 In Power BI By Partitioning Tables

  1. How would I apply this approach when the number of files in the folder is not static? For example when a new file is added to the folder, and some files are modified (making incremental refresh tricky) how can I dynamically define the partitions?

  2. Chris- any limitations/issues you see after using tabular editor for partitioning?(Like dataset becoming not compatible for further edit etc). Tabular editor or SSDT is recommended for partitioning ?
    Please let me know. Thanks in advance

  3. These edits with the Tabular Editor still feel like living dangerously, however, it seems to work for you and I think I can use part of this method.

    Here’s the process/problem:

    – There’s data from 2/3/2019 to 1/24/2021 and today is in April 221
    – A history of data needs to be rebuild in a power BI Dataset, keep 104 weeks and only refresh the last 4 weeks
    – Note that the earliest week with historic data is further back than the 104 week keep range.
    – all 104 weeks is about 5 Bio Rows coming from a dedicated Synapse SQL Pool at DWUc 3000
    – even with parallel reads by months takes more than 5 hours, not sure how to speed that up, perhaps using Serverless Synapse.

    – in order to stay within the timeout of 5 hours, batches of 4 weeks must be used for each refresh, every time the next 4 weeks until all 104 weeks have been imported.

    – An initial refresh completes, creates the 26 months partitions and reads the first 4 weeks in the source table
    — It seems that the Power BI service stamps the Last Processed on all partition even if it didn’t load data.
    — As a consequence, subsequent refresh doesn’t read any more months even if we add more data in the keep range because it seems that these have already been processed.

    If we add one moth partition one at a time as in your method, will we be able to control the refresh duration and rebuild the whole history?

    Can the tabular editor set the last processed date on a partition?

    Is there another process to achieve this, batch wise history import that I am overlooking?

  4. I have a Power BI report in import mode to a Sql Server db, when I open Tabular Editor from external tools and try to create a partition it says “Adding a M partition to a Power BI data model is not allowed”. Is there any way I can create partitions in a Power BI data model?
    I created partitions using SSMS (connected to the Power BI Premium data set) but when I published the report or the model with some changes the partitions are deleted.

  5. I have a Power BI report in import mode to a Sql Server db, when I open Tabular Editor from external tools and try to create a partition it says “Adding a M partition to a Power BI data model is not allowed”. Is there any way I can create partitions in a Power BI data mode?
    I created partitions using SSMS (connected to the Power BI Premium data set), when I published the report with some changes the partitions are deleted.

  6. Triparna Ray – Passionate Business Intelligence Expert offering vast experience  leveraging software engineering and agile methodologies to deliver  highly effective and creative solutions to business and technology  challenges. Utilizes highly attuned analytical skills to develop BI solutions  employing cutting-edge technologies to increase productivity.  Consistently drives high standards of service through effective project  management, communication, and strategic planning to develop and  manage strong client relationships. Highly organized with strong capacity  to prioritize workload, delegate deliverables, and steer project  completion within established deadlines. 
    Triparna Ray says:

    Hi Chris, Thanks for sharing such useful insights. For Synapse analytics as source, which supports massive parallelism, how we should define these partitions in M query, do we need to define any logic for example each month for each partition with the help of tabular editor. how to make it dynamic?

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      You can’t define partitions in an M query – you have to create partitions using tools like Tabular Editor. You can’t make partition creation dynamic either unless you write some code to create partitions automatically; there are plenty of examples of how to do this for Analysis Services, and these will be applicable to Power BI too.

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      It could either be a bug on our side or a bug in Tabular Editor. Have you asked Tabular Editor support if this is a known issue?

      1. Hi Chris, this is the reply from support:

        Unfortunately, modifying partitions using the External Tools integration of Power BI Desktop is not supported by Microsoft. By default, Tabular Editor should block such operations, unless you enable the Allow unsupported modeling operations option under Tools > Preferences.

      2. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
        Chris Webb says:

        To be clear: are you trying to build partitions in the dataset inside Desktop (which is not supported) or in a dataset when it’s published to the Service?

  7. Hi Chris,
    Do you know why the first partition query appears to run first, then followed by the remaining partitions? Is there any way around this that you know of to allow true concurrency between all partitions?

Leave a ReplyCancel reply