Improve performance of staging in Fabric Dataflows Gen2 by disabling V-Order

Quite a few new Dataflows Gen2 features were released recently without much fanfare, but that doesn’t mean they aren’t important. I will take a look at them all in my next few posts; in this first post I’ll look at the ability to disable V-Order on staged data.

As the (very detailed) documentation for this new feature describes, V-Order is a write-time optimisation for the parquet files that underpin the Delta tables that OneLake uses to store data. It slows down writing data to the tables but means that reading data from them, for example in Power BI Direct Lake mode, is much faster. It used to be the case that when you staged data inside a dataflow that data always had V-Order applied; now you have the option to disable V-Order. Disabling V-Order makes staging faster and because staged data is rarely queried more than a few times, disabling V-Order usually improves overall refresh performance.

To test this I created a simple dataflow that connected to a large (5.58GB) CSV file that contained 17.6 million rows of data, staged the data in a query called StageData, then did a group by on that data in a second query called GroupBy.

I turned off Fast Copy and left the “Enable V-Order compression” setting on:

[At the time of writing this post the ability to disable V-Order only works when Fast Copy is not used – I expect this to change in the future]

I refreshed the dataflow and it took 1 minute 59 seconds. The StageData query (where the staging takes place) took 1 minute 31 seconds; the GroupBy query took 12 seconds.

I then disabled V-Order compression for staging:

…and refreshed again. This time overall refresh took 1 minute 32 seconds, the StageData query took 1 minute 13 seconds and the GroupBy query took 7 seconds. While there is always a certain amount of variation in dataflow refresh timings it’s clear that disabling V-Order resulting in staging being about 20 seconds faster with no reduction in performance of the group by transformation on the staged data. So, in this case at least, disabling V-Order was a good thing for refresh performance.

When you decide whether to use staging in a dataflow you have to test to see whether the extra time needed to stage the data is worth it compared to the performance improvements you get by doing transformations on the staged data (which mostly come from those transformations having the opportunity to be folded). Since turning off V-Order makes staging faster it means that staging is more useful and will result in better overall dataflow refresh performance more often.

Improve Performance Of Staging In Fabric Dataflows Gen2 By Disabling V-Order

Like this:

Published by Chris Webb

Leave a ReplyCancel reply

Share this:

Like this:

Published by Chris Webb

Leave a ReplyCancel reply

Discover more from Chris Webb's BI Blog