When Can Partitioned Compute Help Improve Fabric Dataflow Performance?

Partitioned Compute is a new feature in Fabric Dataflows that allows you to run certain operations inside a Dataflow query in parallel and therefore improve performance. While UI support is limited at the moment it can be used in any Dataflow by adding a single line of fairly simple M code and checking a box in the Options dialog. But as with a lot of performance optimisation features (and this is particularly true of Dataflows) it can sometimes result in worse performance rather than better performance – you need to know how and when to use it. And so, in order to understand when this feature should and shouldn’t be used, I decided to do some tests and share the results here.

For my tests I created two queries within a single Dataflow Gen2 CICD. First, an M function called SlowFunction that takes a numeric value and returns that value with 1 added to it after a two second delay:

(input as number) as number =>
Function.InvokeAfter(()=>input+1, #duration(0,0,0,2))

Then the main query which returns a table of ten rows and calls the SlowFunction M function once per row:

let
Rows = List.Transform({1 .. 10}, each {_}),
MyTable = #table(type table [RowNumber = number], Rows),
#"Added custom" = Table.TransformColumnTypes(
Table.AddColumn(MyTable, "FunctionOutput", each SlowFunction([RowNumber])),
{{"FunctionOutput", Int64.Type}}
)
in
#"Added custom"

Here’s what the output of the query looks like:

Now, the first important question. How long does this Dataflow take to run? 10 rows calling a function that takes 2 seconds to run, so 10*2=20 seconds maybe? The answer is yes if you look at how long the preview takes to populate in the Dataflow Editor:

That’s just the preview though. If you’re refreshing a Dataflow there are other things that happen that affect how long that refresh takes, such as Staging and loading the data to a destination. There’s no way you can split the performance of your M code from these operations when looking at the duration of a Dataflow refresh in Recent Runs, which explains why some of the timings you will see later in this post seem strange. Don’t worry, though, it doesn’t stop you from seeing the important trends. I’m told that setting a CSV file in a Lakehouse as your data destination is the best way of minimising the impact of loading data on overall refresh durations but at the time of writing the CSV destination can’t be used with Partitioned Compute so all my tests used a Fabric Warehouse as a destination.

Here’s what Recent Runs showed when this Dataflow was refreshed:

The overall refresh time was 59 seconds; the query (called NonPartitioned here) that returns the table of ten rows and which was staged took 29 seconds.

Could this Dataflow benefit from Partitioned Compute? With Partitioned Compute enabled in the Options dialog, I added the necessary M code to the query:

let
Rows = List.Transform({1 .. 10}, each {_}),
MyTable = #table(type table [RowNumber = number], Rows),
#"Added custom" = Table.TransformColumnTypes(
Table.AddColumn(MyTable, "FunctionOutput", each SlowFunction([RowNumber])),
{{"FunctionOutput", Int64.Type}}
),
ReplacePartitionKey = Table.ReplacePartitionKey(#"Added custom", {"RowNumber"})
in
ReplacePartitionKey

…and then refreshed. Here’s what Recent Runs showed:

The overall refresh duration went up to 1 minute 25 seconds; the query that does all the work (called Partitioned in this case) took 40 seconds. Note the screenshot immediately above shows the engine used is “PartitionedCompute” and that there are now ten activities listed instead of one: my M code used the RowNumber column in the table as the partition key so the Dataflow attempted to run each row of the table as a separate operation in parallel. And as you can see, this made performance worse. This is because using Partitioned Compute introduces yet another overhead and that overhead is much greater than any benefit gained from parallelism in this case.

So I wondered: what if the delay in the query is increased from 2 second to 100 seconds then? Does this increase in the delay mean that parallelism results in faster overall performance?

Here’s what Recent Runs showed for a version of my Dataflow with a 100 second delay for each row and which didn’t use Partitioned Compute:

10 rows * 100 seconds = 1000 seconds = 16 minutes 40 seconds, so it’s not surprising that the overall duration of this version of the Dataflow was slow at 17 minutes 29 seconds.

Here’s what Recent Runs shows for the version of this Dataflow that did use Partitioned Compute:

The overall duration was 4 minutes 41 seconds and the main query took 3 minutes 14 seconds. The important takeaway is that this is a lot faster than the version that didn’t use Partitioned Compute, so clearly Partitioned Compute made a big difference to performance here. As you might expect, it looks like parallelising operations that only take a few seconds results in worse performance while parallelising operations that take longer, say a minute or more, is probably a good idea. As always, you’ll need to test to see what benefits you get for your Dataflows.

These results raise a lot of questions too. 100 seconds = 1 minute 40 seconds, which is a lot less than 3 minutes 14 seconds. Does this mean that not every row in the table was evaluated in parallel? Is partitioning on the RowNumber column counter-productive and would it be better to partition in some other way to try to reduce the amount of attempted parallelism? Is there something else that is limiting the amount of parallelism? While this version of the Dataflow always performs better than the non-partitioned version, performance did vary a lot between refreshes. While these tests show how useful Partitioned Compute can be for slow Dataflows, there’s a lot more research to do and a lot more blog posts to write.

Leave a Reply