Three New Power BI/Power Query Books

I decided to stop writing book reviews here on my blog a long time ago: it’s a lot of work to read a book and write a proper, detailed review and what’s more I don’t like the idea of writing a bad review and upsetting someone who has gone to all the effort of writing a book. That said, from time to time I get given free copies of books (which I’m always happy to receive – I like to see how other people go about explaining Power BI concepts and functionality) and in return I give the authors some free publicity here. Recently I received two copies of new books from people that I know:

Expert data modeling with Power BI, by Soheil Bakhshi (Buy it here on Amazon UK)

Soheil is an MVP whose blog I have read and admired for some time so I’m pleased to see he has written a book. It’s an important subject too: good data modelling is key to success with Power BI, and the problems of many customers I work with stem from not taking the time to learn how data should be modelled for Power BI. This book introduces you to concepts like dimensional modelling and star schemas and shows you how to build datasets that follow best practices. It also covers topics such as calculation groups and object-level security that won’t be in older books.

Power Query cookbook, by Andrea Janicijevic (Buy it here on Amazon UK)

Andrea is a colleague of mine at Microsoft and of course Power Query is a technology close to my heart. This book follows the cookbook format which teaches through a series of worked examples and easy-to-follow steps; anyone learning Power Query will find it useful to follow these recipes to get practice creating queries. I liked the inclusion of Power BI Dataflows as well as Power Query in Power BI Desktop, and again this book has the advantage of being new – it covers recently-added features such as Schema View and Diagram View in Dataflows and Query Diagnostics in Power BI Desktop that won’t be covered in other books.

There’s another book I was curious about and was lucky enough to be able to read via Microsoft’s online library for employees:

Pro Power BI theme creation, by Adam Aspin (Buy it here on Amazon UK)

When I hear someone had written a book about Power BI theme files I couldn’t believe it, but Adam is an experienced writer and has pulled it off. As you might expect it’s everything you ever wanted to learn about Power BI themes and as such, if themes are something you’re interested in you should read this book. It explains how theme files are structured, how to edit them and how the various attributes are applied to different visuals.

How Query Folding And The New Power BI Dataflows Connector Can Help Dataset Refresh Performance

You may have noticed that a new dataflows connector was announced in the August 2021 release of Power BI Desktop, and that it now supports query folding between a dataset and a dataflow – which you may be surprised to learn was not possible before. In this post I thought I’d take a look at how much of an improvement in performance this can make to dataset refresh performance.

For my tests I created a new PPU workspace and a dataflow, and made sure the Enhanced Compute Engine was turned on for the dataflow on the Settings page:

Query folding will only happen if the Enhanced Compute Engine is set to “On”, and won’t happen with the “Optimized” setting. The Enhanced Compute Engine is only available with PPU and Premium.

For my data source I used a CSV file with a million rows in and seven integer columns. I then created two tables in my dataflow like so:

The Source table simply connects to the CSV file, uses the first row as the headers, then sets the data type on each column. The second table called Output – which contains no tranformations at all – is needed for the data to be stored in the Enhanced Compute Engine, and the lightning icon in the top-left corner of the table in the diagram shows this is the case.

Next, in Power BI Desktop, I created a Power Query query that used the old Power BI dataflows connector:

If you have any existing datasets that connect to dataflows, this is the connector you will have used – it is based on the PowerBI.Dataflows function. My query connected to the Output table and filtered the rows to where column A is less than 100. Here’s the M code, slightly edited to remove all the ugly GUIDs:

let
    Source = PowerBI.Dataflows(null),
    ws = Source{[workspaceId="xxxx"]}[Data],
    df = ws{[dataflowId="yyyy"]}[Data],
    Output1 = df{[entity="Output"]}[Data],
    #"Filtered Rows" = Table.SelectRows(Output1, each [A] < 100)
in
    #"Filtered Rows"

Remember, this connector does not support query folding. Using this technique to measure how long the query ran when the results from the query were loaded into the dataset, I could see it took almost 12.5 seconds to get the data for this query:

In fact the performance in Desktop is worse: when refresh was taking place, I could see Power BI downloading 108MB of data even though the original source file is only 54MB.

Why is the data downloaded twice? I strongly suspect it’s because of this issue – because, of course, no query folding is happening. So the performance in Desktop is really even worse.

I then created the same query with the new dataflows connector:

This connector uses the PowerPlatform.Dataflows function; it’s not new, but what is new is that you can now access Power BI dataflows using it.

Here’s the M code, again cleaned up to remove GUIDS:

let
    Source = PowerPlatform.Dataflows(null),
    Workspaces = Source{[Id="Workspaces"]}[Data],
    ws = Workspaces{[workspaceId="xxxx"]}[Data],
    df = ws{[dataflowId="yyyy"]}[Data],
    Output_ = df{[entity="Output",version=""]}[Data],
    #"Filtered Rows" = Table.SelectRows(Output_, each [A] < 100)
in
    #"Filtered Rows"

When this query was loaded into the dataset, it only took 4 seconds:

This is a lot faster, and Power BI Desktop was a lot more responsive during development too.

It’s reasonable to assume that query folding is happening in this query and the filter on [A]<100 is now taking place inside the Enhanced Compute Engine rather than in Power BI Desktop. But how can you be sure query folding is happening? The “View Native Query” option is greyed out, but of course this does not mean that query folding is not happening. However, if you use Query Diagnostics, hidden away in the Data Source Query column of the detailed diagnostics query, you can see a SQL query with the WHERE clause you would expect:

In conclusion, you can see that the new dataflows connector can give you some big improvements for dataset refresh performance and a much better development experience in Power BI Desktop. Query folding support also means that you can now use dataset incremental refresh when using a dataflow as a source. However, you will need to use Premium or PPU, you may also need to make some changes to your dataflow to make sure it can take advantage of the Enhanced Compute Engine, and you will also need to update any existing Power Query queries to use the new connector. I think the potential performance gains are worth making these changes though. If you do make these changes in your dataflows and find that it helps, please leave a comment!

Power BI Dataflow Performance, Premium Per User And The Enhanced Compute Engine

Over the years I have written a lot about Power BI/Power Query performance but it has always been in the context of loading data direct into datasets, not dataflows. A lot of cool things have been happening in dataflows recently, though, and now that Premium Per User has made Premium features to a much wider audience I thought it would be interesting to look at an example of how PPU can help dataflow performance and specifically how and when the Enhanced Compute Engine can make dataflow refresh faster.

Using the same CSV file that I used in my posts from last year on optimising the performance of merges in Power Query, a file with one million rows and seven numeric columns named A, B, C, D, E, F and G, I created the following dataflow in a shared capacity (ie non-Premium) workspace:

The queries called First and Second are identical and just load all the data from the (same) source CSV file; they also have their load disabled. The query called Merge does an inner join between these two queries on the column called A:

The Merge query has its load enabled so it’s the only output of the dataflow; after it has joined the data it expands the nested columns returned and sets the data types on all the output columns.

Refreshing this dataflow in shared capacity took on average 150 seconds.

I then moved the workspace to Premium Per User capacity and without making any changes, I refreshed again.

Refreshing the same dataflow in PPU took on average 73 seconds.

So the first finding is that moving the dataflow to PPU more than halved the refresh time, which is a pretty good result.

However, at this point the Enhanced Compute Engine is not being used – so, to enable it, I enabled loading for the First and Second queries which in turn made the Merge query a Computed Table (what used to be a Computed Entity before the recent terminology changes, indicated by the lightning bolt icon):

For a full explanation of when the Enhanced Compute Engine can and can’t be used see Matthew Roche’s blog post here; basically it loads data into a SQL-based cache which Computed Tables can then leverage which means that data access is faster and the Power Query engine can push transformations back to it via query folding. The only other change I made was to set data types on the columns in the output of First and Second.

Refreshing this new version of the dataflow in PPU took on average 90 seconds

So performance was worse – but why? Enabling loading on First and Second means that more work is done at refresh time because their output needs to be ingested twice (once into ADLSgen2 and once into the SQL cache used by the Enhanced Compute Engine) before the Enhanced Compute Engine can access it. In this case the extra work needed to load First and Second outweighs the performance gains from using the Enhanced Compute Engine. The new metrics available from the dataflow’s Refresh History provide some insight into this (I strongly recommend you read the docs on these metrics here); here’s some of the data from one of the refresh history CSV files loaded into Excel:

In this particular case the overall refresh time of the dataflow was 88 seconds. First and Second refreshed in parallel – First taking 48 seconds and Second taking 51 seconds – and once they had both finished, Merge could refresh and only took 36 seconds to join the output of First and Second. So in this case Merge is indeed faster (36 seconds compared to 73 seconds before) as a result of using the Enhanced Compute Engine but that improvement isn’t enough to cancel out the additional time needed to load the data returned by First and Second into it.

What about a scenario where the Enhanced Compute Engine does make a positive difference? Take a look at the following dataflow, a slight variation on the dataflow above:

There are now three new tables: Ouput Table 1, Output Table 2 and Output Table 3. Each of these tables gets the maximum value from a different column in the table returned by Merge. Note that there are no Computed Tables in this dataflow so the Enhanced Compute Engine is not used, and that First, Second and Merge have load disabled.

Refreshing this dataflow on PPU took on average 95 seconds

Here are the refresh metrics for one of the refreshes:

As you can see, the three tables were refreshed in parallel and took between 84 and 93 seconds. It’s important to remember that for each of these tables the source data was loaded and the Merge query evaluated independently, which explains why they each take so long. The fact that Merge is evaluated three times when this dataflow refreshes is counter-intuitive but really important here – for more details see this post on how queries are evaluated in Power Query.

Now consider this version of the same dataflow where First, Second and Merge have their load enabled, making Merge, Output Table 1, Output Table 2 and Output Table 3 all Computed Tables.

Refreshing this dataflow on PPU took on average 88 seconds

Not a massive improvement, but an improvement. Now look at how different the refresh metrics are:

In this case Output Table 1, Output Table 2 and Output Table 3 only take 1 second to evaluate, but that’s because they are working from data cached in the Enhanced Compute Engine – the table returned by Merge – and the transformations in them fold. The Merge table also uses data cached in the Enhanced Compute Engine: the tables returned by First and Second. What’s more, because Merge is a Computed Table it is only evaluated once in this dataflow. Loading the data for First and Second takes 52 seconds and 50 seconds respectively and Merge takes 35 seconds. In this case the hit of loading the data into the Enhanced Compute Engine is worth taking.

In conclusion, there are two things that these tests have shown:

  • Moving your dataflow to PPU can make a big difference to refresh performance.
  • The Enhanced Compute Engine can make dataflow refresh faster but not in all cases: you need to understand how it works, and in some cases the overhead of loading the data into it outweighs the performance advantages it gives you for any transformations later on. Use the information in Refresh History to work out what’s happening for your dataflow.

Power Query Memory Usage, Dataflow Container Size And Refresh Performance

When Power BI dataflows were officially released a few weeks ago there was a new setting for Premium capacities mentioned briefly in the announcement blog post: Container Size.

The blog post only says the following:

We’re introducing a new dataflow workload on premium capacity settings, enabling you to optimize dataflow workload performance for processing more complex, compute-heavy dataflows. This setting is available in the Capacity Admin portal, Dataflow workload settings.

…which does not tell you much at all. Pedro Fernandes contacted me to see if I knew more and because I didn’t, I started investigating. This resulted in me learning lots of new information around how the Power Query engine in Power BI Desktop and Excel uses memory, how things are different in the Power BI service, and how all of this can have an impact on query refresh performance.

If you’ve read this blog post, and this related thread on the Power Query forum, you’ll know that when a Power Query query is evaluated the work is done by a Microsoft.Mashup.Container process, visible in tools such as Task Manager and Resource Monitor. A single refresh operation in Excel or Power BI Desktop might result in multiple evaluations of multiple queries for different reasons, so it’s not uncommon to see multiple Microsoft.Mashup.Container processes.

image

In another thread from the Power Query forum, Curt Hagenlocher of the dev team explains there is a 256MB limit on the amount of physical RAM that each Microsoft.Mashup.Container process can use, although there is is no limit on the amount of virtual memory that can be used. The thread is about how using Table.Buffer can be bad for refresh performance but the details are more widely applicable. Here are the highlights:

Certain operations force the rows of a table value to be enumerated. If the enumeration operation itself is expensive, then using Table.Buffer can be a performance optimization because we store the values in memory so that second and subsequent enumerations of the rows go against memory.

If the table is only being enumerated once (which is the most common scenario) or if the underlying enumeration is fast anyway, then Table.Buffer won’t help performance.

Table.Buffer can actually hurt performance in some cases, because we cap RAM usage of the query at 256 MB — which means that a query which uses more than 256 MB is now forced to page RAM to/from disk. Enough paging, and the performance cost can be quite dramatic.

Currently, “table at a time” operations like joins, sort, many groupings, pivot, unpivot, etc., all happen in RAM (unless folded). For large tables, these will consume a lot of memory.

 

The 256MB limit is also mentioned briefly on this thread.

Here’s a screenshot from Resource Monitor showing this: the Microsoft.Mashup.Container process in this case is evaluating a query that reads data from a large (900,000ish rows) csv file and does a Pivot, and while the Working Set value has peaked at around the 256MB limit you can see the Commit value is much higher, and so paging must be taking place.

image

Currently there is no way to change this 256MB in Power BI Desktop or Excel although someone has already posted a suggestion on the Ideas site to allow us to change it. How much of an impact does this actually have on refresh performance though? Without the ability to change this setting it’s hard to say, but I suspect it could be significant and that a lot of Power Query performance problems could be explained by this behaviour.

The situation is different in the Power BI service, where I understand there is a limit on the overall amount of memory that a single Power Query query evaluation can use. This makes a lot of sense in the context of Power BI Pro and shared capacity because Microsoft could not allow one user to run lots of complex, expensive Power Query queries that might affect other users inside or outside the same tenant. With Power BI Premium, which gives you your own dedicated capacity in the Power BI service, there is no chance that anything you do will affect other tenants and so Microsoft gives you more control over how resources are used. As a result, the new Container Size setting for a dataflow in a Premium capacity lets you configure the amount of memory that can be used for a single entity refresh within a dataflow – and refreshing a single entity in a dataflow is, as far as I understand it, the equivalent of what a Microsoft.Mashup.Container process does on the desktop.

I did some (not very scientific) testing and it looks like increasing the Container Size setting can have a noticeable impact on the performance of memory-intensive queries. Using the technique I blogged about here I measured the execution time of the query mentioned above that does a Pivot on data from a large, local csv file in Power BI Desktop: it took 98 seconds on my laptop. I then used a dataflow to load the source data from the csv file into an entity in a dataflow on an A4 capacity, without making any changes. This took 132 seconds; you can get the time taken for a dataflow refresh by clicking on the Refresh History option, as Matthew Roche shows here. I then created a computed entity that used this new entity as its source and which did the same Pivot operation as the original query on the desktop. The following table shows the time taken to refresh this computed entity in the A4 capacity with different Container Size settings:

Container Size (MB) Refresh Time (seconds)
700 61
2000 57
5000 52

700MB is the default setting for Container Size and the smallest value that you can use; I guess the maximum value you can set will depend on the size of the capacity you’re using. There are two conclusions that I think you can draw from these results:

  • Even with the default setting for Container Size, it was faster for the computed entity to read the data from the source entity (which, remember, stores its data in Azure Data Lake Gen2 storage) and do the Pivot in the Power BI service than it was for Power Query in Power BI Desktop to read the data from the csv file and do the same Pivot operation on my laptop.
  • Increasing the Container Size setting reduced refresh time quite significantly.

So, as the blog post I referenced at the very beginning of this post states, if you are doing memory-intensive operations such as group bys, sorts, pivots, unpivots and joins against non-foldable data sources in a dataflow, and if that dataflow is on a Premium capacity, then increasing the Container Size property is probably a good idea because it may reduce refresh times. If you can reproduce this on your own Premium capacities please let me know by leaving a comment – I would be very interested to hear about your experiences.

[Thanks to Curt Hagenlocher and Anton Fritz for providing information for this blog post]

Defining Relationships Between Entities In The Common Data Model To Automatically Create Relationships In Power BI

Following on from my last post on attaching manually-created Common Data Model folders in Power BI dataflows, I was asked whether defining relationships between entities in the model.json file of the CDM folder results in relationships being created between tables in a Power BI dataset. I’ve just tested it, and I can confirm it does.

Consider a CDM folder that contains two entities, Sales and Fruit. The Sales entity contains monthly sales data:

image

…and the Fruit entity contains a list of fruit sold:

image

Now –and this is important – let’s say you have defined the entities in the model.json so that the Sales entity  has attributes Month, Product and Sales:

image

…and the Fruit entity has a single attribute called Fruit:

image

Because the Sales entity and the Fruit entity use different names for the attributes that contain the names of the fruit sold, when the two entities are loaded into Power BI no relationships are created between the two tables. There’s no way that Power BI can guess that there should be a relationship between the Product and Fruit columns based on these names.

However, if you add a relationship definition (as documented here) to the model.json file like so:

image

Then, when you load the two entities as tables in the same Power BI dataset, you get a relationship created automatically:

image

As far as I can see it does this by adding a key definition to the Sales table in a similar way to the way the Table.AddKeyM M function/Remove Duplicates transformation does, as I blogged here.

Now if we could add DAX calculations (calculated columns and especially measures) to the definition of the model.json file, so they were automatically created when entities were imported from a dataflow, that would be really cool. I don’t see why this would not be possible because you could store all kinds of useful information – such as the DAX for the calculations – in the metadata record of an M query (you can access this yourself using the Value.Metadata M function) loaded into a dataset, and the Power BI engine could read it when the table is loaded and create the calculations from this.

Attaching Manually-Created Common Data Model Folders In Power BI Dataflows

Most of the focus on Power BI dataflows so far has been on the use of Power Query Online to load data from external data sources. However, when you create a dataflow you also have the option to attach an existing Common Data Model (CDM) folder stored in Azure Data Lake Storage Gen2:

image

The documentation has some very detailed information on this subject: you can read how to configure Power BI to use your own Azure Data Lake Storage Gen2 account here, and you can read about how to attach an existing CDM folder here.

The interesting thing for me is that this means that any external service – whether it’s a Microsoft service, a third party service or something you build yourself – can output data into a folder in Azure Data Lake Storage Gen2, and so long as it’s in the right format you can expose this data as a dataflow in Power BI. So, I thought, why not create some data manually, put it in the cloud and see if I can get Power BI to attach it as a dataflow?

It turns out it isn’t too difficult to do, and what’s more when you look at the documentation for the Common Data Model and specifically the documentation for the format of the model.json file there are all kinds of interesting features to check out.

Consider the following folder containing three files, all created manually and uploaded to my Azure Data Lake Storage Gen2 account using Azure Storage Explorer:

image

The two csv files are pretty simple:

image

image

The model.json file, which contains the metadata for the dataflow, is also not that complicated:

image

This model.json file describes a single entity called Sales with three columns: Month, Product and Sales. It also specifies that the data for this entity is stored in two partitions and these partitions are linked to the two csv files above (I’m pretty sure this is how incremental refresh is handled in dataflows too). I like the idea of another application being able to save new csv files to a folder and, after updating the model.json file to add a new partition, this data being appended onto the table of data returned by the entity.

It took a few goes to attach the folder to a dataflow, mostly because when it fails the error messages (at least at the time of writing) are completely unhelpful. Two things I ran into are:

  • Any spaces in folder or file paths need to be url encoded, ie replaced with %20
  • You really do need to grant permissions on the folder to the Power BI users who are going to access the dataflow. RTFM.

After that, the dataflow is visible in the Power BI service:

image

…and the data from the Sales entity can be loaded into your dataset in Power BI Desktop:

image

image

As I said, there’s more for me to research here. For example, looking in the documentation, you can create two types of entities – Local and Referenced (are the latter the same as Linked Entities?) – and you can also define relationships between entities (do these become relationships between tables in Power BI?). I think this opens up a lot of new possibilities for close integration between Power BI and other services!

%d bloggers like this: