Extracting All The M Code From A Power BI Dataset Using The DISCOVER_M_EXPRESSIONS DMV

DMVs (Dynamic Management Views) are, as the Analysis Services documentation states, “queries that return information about model objects, server operations, and server health”. They’re also available in Azure Analysis Service, Power BI and Power Pivot and are useful for a variety of reasons, for example for generating documentation.

Several as-yet undocumented DMVs have appeared in Power BI recently and one that caught my eye was DISCOVER_M_EXPRESSIONS. Unfortunately, when I tried to run it in DAX Studio against an open Power BI file I got an error saying it was only available in the Power BI Service:

image

Luckily, now that XMLA Endpoints are now in preview and SQL Server Management Studio 18 has been released (which supports connections to Power BI via XMLA Endpoints) we can test it against a published dataset stored in a Premium capacity. The following query can be run from a DAX query window in SQL Server Management Studio:

select * from
$system.discover_m_expressions

…returns a list of all the Power Query queries  in the selected dataset and their M code:

image

If you don’t have Premium you can run the same query from an Excel table against any published dataset using the technique I blogged about here:

image

I know there are other methods for doing this (for example using copy/paste) it’s useful to be able to do this via a DMV because it means you can automate the process of extracting all your M code easily.

Some of the other new DMVs look like they are worthy of a blog post too – I can guess what most of them do from their names, but others are more mysterious and perhaps hint at features that have not been announced yet.

DAX Machine Learning Functionality Used By The Key Influencers Visual In Power BI

I’m one of those people who can’t resist peeking behind the scenes, and so when the Key Influencers visual appeared in Power BI I couldn’t help wondering how it worked its machine learning magic. Using DAX Studio to look at the DAX queries generated by the visual proved to be very revealing: it turns out that it uses a number of new DAX functions that are undocumented and probably not meant to be used outside Microsoft. For example, the following screenshot shows a DAX query generated by the Key Influencers visual that uses functions called AI.SampleStratified, AI.Train, AI.KeyDrivers and AI.ExtractProfileFilters:

image

Using Profiler (in a similar way to what I describe in this post) to go into even more detail about what happens when these queries run, shows that they raise the DAX Extension events that I’ve been wondering about for a long time now:

image

image

So Power BI can train and query machine learning models inside its own database engine – which, when you think about it, is pretty darned cool. And then I thought: hold on, other visuals have had machine learning features for a long time. For example, the Line Chart visual can create forecasts, but although DAX Studio shows yet another undocumented function called SampleAxisWithLocalMinMax() this does not actually seem to perform the forecasting, which I assume must be done inside the code of the visual itself:

image

My guess is that the functionality used by the Key Influencers visual is new functionality in the engine.

A fascinating insight into how Power BI works, but is this any practical use to us? Let me be clear: I don’t think you should be using any of these functions yourself in a real-world report. I’m sure all this would be documented and publicised if Microsoft did want us to use it ourselves! Another consideration is that these new functions return tables and that makes them awkward to use in regular .pbix Power BI reports – I guess we could create calculated tables although that’s not as flexible as returning a table from a query as shown above. That said, even though we can’t write our own DAX queries in regular Power BI reports, we can write our own DAX queries in Paginated Reports and we can now create Paginated Reports that use a Power BI dataset as a data source. I tested putting one of the queries generated by the Key Influencers visual into a Paginated Report connected to the same dataset and it worked ok (even after publishing). You can also embed DAX queries connected to a published dataset in Excel too, as I show here. Hmm, plenty to think about then…

Using The “Skip Test Connection” Option For Power BI Data Sources To Avoid Scheduled Refresh Failures

One of the most popular blog posts I have written in recent years is this post on how the RelativePath and Query options for the Web.Contents() M function can be used to allow datasets that use web services as data sources to refresh successfully in the Power BI Service. It’s very frustrating when you write some complex M code in Power BI Desktop and find that your dataset can’t be refreshed after it has been published! While, at the time of writing, this problem is still present, yesterday’s announcement of the new “Skip Test Connection” option for data sources in the Power BI Service does mean there are more scenarios where your datasets can be made to refresh.

As I mentioned in that original post, the trick of using the RelativePath and Query options with Web.Contents() only works if the first parameter passed to Web.Contents() is a url that itself can be called on its own without an error. For example, the following expression:

Web.Contents(
	"http://www.myservice.com",
	[RelativePath="hello",
	 Query=[q="SearchTerm"]
	])

…will only refresh successfully in the Power BI Service if http://www.myservice.com can be reached as a url on its own without returning an error. I believe this is because this url is used when the Power BI Service is preparing to run the query to check whether the credentials it has stored are valid.  As the blog post says there is a workaround where dummy query parameter values can be used in that first parameter, but this won’t work in all cases. Instead, now you can turn on the “Skip Test Connection” option on the data source in the Power BI Service and the dataset will refresh even if http://www.myservice.com on its own returns an error.

Similarly this new option helps when you are using API keys passed in through query parameters for authentication. In Power BI Desktop and Excel Get&Transform/Power Query, you have the option to use the Web API authentication type to store sensitive API keys that need to be passed through url parameters securely rather than embed them in your M code. Unfortunately this is not supported in the Power BI Service yet and (as I found out on a recent consultancy engagement) there are other complications with using Anonymous authentication. For example, say you need to call a web service with an M expression something like this:

Web.Contents("http://www.myservice.com?key=password123")

…where the key query parameter contains an API key used for authentication. If you set this data source to use Anonymous authentication in Power BI Desktop or Excel it will refresh with no problems. However, after you publish to the Power BI Service you will not be able to refresh your dataset unless you turn on the “Skip Test Connection”, because the Power BI Service apparently strips out any query parameters from the url when it is testing the connection. In this case, if it strips out the key parameter, then obviously the call will fail because it has removed the means of authentication.

This new option does not mean every dataset can now be refreshed. As far as I can see, if you dynamically generate a url for use with Web.Contents() and do not use the RelativePath and/or Query options you will not be able refresh your dataset because the Power BI Service will not be able to analyse your code to discover what urls are present in it. I suspect that if you see the “Some data sources may not be listed because of hand-authored queries” warning in the Data Source Settings dialog in Power BI Desktop (which I blogged about here) then you will not be able to refresh your dataset in the Power BI Service, but I need to do more testing to be sure.

As always, if you test this yourself please let me know about any interesting things you find by leaving a comment!

Power Query Memory Usage, Dataflow Container Size And Refresh Performance

When Power BI dataflows were officially released a few weeks ago there was a new setting for Premium capacities mentioned briefly in the announcement blog post: Container Size.

The blog post only says the following:

We’re introducing a new dataflow workload on premium capacity settings, enabling you to optimize dataflow workload performance for processing more complex, compute-heavy dataflows. This setting is available in the Capacity Admin portal, Dataflow workload settings.

…which does not tell you much at all. Pedro Fernandes contacted me to see if I knew more and because I didn’t, I started investigating. This resulted in me learning lots of new information around how the Power Query engine in Power BI Desktop and Excel uses memory, how things are different in the Power BI service, and how all of this can have an impact on query refresh performance.

If you’ve read this blog post, and this related thread on the Power Query forum, you’ll know that when a Power Query query is evaluated the work is done by a Microsoft.Mashup.Container process, visible in tools such as Task Manager and Resource Monitor. A single refresh operation in Excel or Power BI Desktop might result in multiple evaluations of multiple queries for different reasons, so it’s not uncommon to see multiple Microsoft.Mashup.Container processes.

image

In another thread from the Power Query forum, Curt Hagenlocher of the dev team explains there is a 256MB limit on the amount of physical RAM that each Microsoft.Mashup.Container process can use, although there is is no limit on the amount of virtual memory that can be used. The thread is about how using Table.Buffer can be bad for refresh performance but the details are more widely applicable. Here are the highlights:

Certain operations force the rows of a table value to be enumerated. If the enumeration operation itself is expensive, then using Table.Buffer can be a performance optimization because we store the values in memory so that second and subsequent enumerations of the rows go against memory.

If the table is only being enumerated once (which is the most common scenario) or if the underlying enumeration is fast anyway, then Table.Buffer won’t help performance.

Table.Buffer can actually hurt performance in some cases, because we cap RAM usage of the query at 256 MB — which means that a query which uses more than 256 MB is now forced to page RAM to/from disk. Enough paging, and the performance cost can be quite dramatic.

Currently, “table at a time” operations like joins, sort, many groupings, pivot, unpivot, etc., all happen in RAM (unless folded). For large tables, these will consume a lot of memory.

 

The 256MB limit is also mentioned briefly on this thread.

Here’s a screenshot from Resource Monitor showing this: the Microsoft.Mashup.Container process in this case is evaluating a query that reads data from a large (900,000ish rows) csv file and does a Pivot, and while the Working Set value has peaked at around the 256MB limit you can see the Commit value is much higher, and so paging must be taking place.

image

Currently there is no way to change this 256MB in Power BI Desktop or Excel although someone has already posted a suggestion on the Ideas site to allow us to change it. How much of an impact does this actually have on refresh performance though? Without the ability to change this setting it’s hard to say, but I suspect it could be significant and that a lot of Power Query performance problems could be explained by this behaviour.

The situation is different in the Power BI service, where I understand there is a limit on the overall amount of memory that a single Power Query query evaluation can use. This makes a lot of sense in the context of Power BI Pro and shared capacity because Microsoft could not allow one user to run lots of complex, expensive Power Query queries that might affect other users inside or outside the same tenant. With Power BI Premium, which gives you your own dedicated capacity in the Power BI service, there is no chance that anything you do will affect other tenants and so Microsoft gives you more control over how resources are used. As a result, the new Container Size setting for a dataflow in a Premium capacity lets you configure the amount of memory that can be used for a single entity refresh within a dataflow – and refreshing a single entity in a dataflow is, as far as I understand it, the equivalent of what a Microsoft.Mashup.Container process does on the desktop.

I did some (not very scientific) testing and it looks like increasing the Container Size setting can have a noticeable impact on the performance of memory-intensive queries. Using the technique I blogged about here I measured the execution time of the query mentioned above that does a Pivot on data from a large, local csv file in Power BI Desktop: it took 98 seconds on my laptop. I then used a dataflow to load the source data from the csv file into an entity in a dataflow on an A4 capacity, without making any changes. This took 132 seconds; you can get the time taken for a dataflow refresh by clicking on the Refresh History option, as Matthew Roche shows here. I then created a computed entity that used this new entity as its source and which did the same Pivot operation as the original query on the desktop. The following table shows the time taken to refresh this computed entity in the A4 capacity with different Container Size settings:

Container Size (MB) Refresh Time (seconds)
700 61
2000 57
5000 52

700MB is the default setting for Container Size and the smallest value that you can use; I guess the maximum value you can set will depend on the size of the capacity you’re using. There are two conclusions that I think you can draw from these results:

  • Even with the default setting for Container Size, it was faster for the computed entity to read the data from the source entity (which, remember, stores its data in Azure Data Lake Gen2 storage) and do the Pivot in the Power BI service than it was for Power Query in Power BI Desktop to read the data from the csv file and do the same Pivot operation on my laptop.
  • Increasing the Container Size setting reduced refresh time quite significantly.

So, as the blog post I referenced at the very beginning of this post states, if you are doing memory-intensive operations such as group bys, sorts, pivots, unpivots and joins against non-foldable data sources in a dataflow, and if that dataflow is on a Premium capacity, then increasing the Container Size property is probably a good idea because it may reduce refresh times. If you can reproduce this on your own Premium capacities please let me know by leaving a comment – I would be very interested to hear about your experiences.

[Thanks to Curt Hagenlocher and Anton Fritz for providing information for this blog post]

Table.Buffer() Does Not Buffer Nested Values

Here’s yet another entry in the list of useful things I learned from Ehren von Lehe on the Power Query MSDN forum: Table.Buffer() does not buffer nested table, record or list values inside the cells of a table. From this thread:

Buffering is shallow. It will force the evaluation of any scalar cell values, but will leave non-scalar values (records, lists, tables, etc.) as-is.

It’s actually really difficult to come up with a simple demo query to prove this though (the Power Query engine is too clever about not evaluating things it doesn’t need for the final output of a query), but it’s fairly easy to understand the principle. Whenever you have an expression that returns a table something like this:

image

…if you use Table.Buffer() on this table it will only buffer the scalar values (in this case the text values in the Name and Signature columns). The nested table values, as in the highlighted cell, will not be buffered in memory and if you try to access the contents it may result in another call back to the underlying data source.

Comparing The Performance Of Reading Data From Files With File.Contents And Web.Contents In Power Query And Power BI

In my last post I mentioned the Power Query engine’s persistent cache, which in some scenarios caches the data read from a data source when a query is refreshed. Another important nugget of information that Ehren von Lehe of the Power Query dev team mentioned in a post on the Power Query MSDN forum recently is the fact that if you use File.Contents to get data from a file then the persistent cache is not used, but if you use Web.Contents to get data from the same file then the persistent cache is used. I guess the thinking here is that there is no point creating an on-disk cache containing the contents of a file that is already on disk.

Using Process Monitor (see here and here for more information on how to do this) to view how much data is read from disk when a query is run it is possible to see this in action. Here’s the graph showing how much data is read from a 150MB csv file when a slow query is refreshed. This particular slow query results in five reads to the csv file – it’s more or less the same scenario from my second Process Monitor blog post here, with the query itself described in a lot of detail here. Here’s a graph of the data captured by Process Monitor showing time in seconds on the X axis and amount of data read in MB on the Y axis:

image

Since this query uses File.Contents to get the data from the csv file, the persistent cache is not used; as you can see, the graph shows clearly that the full contents of the csv file are read five times.

The same query altered to use Web.Contents shows just two full reads:

image

I have no idea why the file is read twice rather than once, but it’s definitely different behaviour to the version that uses File.Contents.

As far as I can see it is possible to replace File.Contents with Web.Contents in every case. So, if you have the following expression:

File.Contents("c:\users\myuser\Desktop\file.txt")

You can just replace it with:

Web.Contents("c:\users\myuser\Desktop\file.txt")

Which one is faster though? Just because a query reads data from disk more often does not necessarily mean that it will be slower. In the above scenario, with the csv file stored on my local hard drive, the Web.Contents version of the query refreshes in 18 seconds while the File.Contents version refreshes in 14 seconds. Replacing the csv file with an Excel file that contains the same data (remember that Excel files are a lot slower than csv files to read data from, as I showed here) results in the version of the query that uses File.Contents running in 205 seconds, while the version that uses Web.Contents running in 297 seconds. So it looks like, in most cases, File.Contents is the right choice when reading data from a file (as you would hope).

However, when using the same csv file stored on a network file share, the Web.Contents version takes 23 seconds while the File.Contents version takes 25 seconds. So maybe if you are dealing with files that are stored remotely over a slow connection it might be worth replacing File.Contents with Web.Contents to see if you get any performance benefits. There may be other situations where Web.Contents is the faster choice too. If you test this and see a difference, let me know by leaving a comment!

UPDATE: please also read Curt Hagenlocher’s comment below – Web.Contents may be changed in the future so it only works with http/https

Power BI, Caching, Parallelism And Power Query Refresh Performance

Some time ago a customer of mine (thank you, Robert Lochner) showed me a very interesting scenario where a set of Power Query queries in Power BI Desktop refreshed a lot faster with the “Enable Parallel Loading Of Tables” option turned off. This seemed a bit strange, but I have recently been reading lots of posts by Ehren von Lehe of the Power Query dev team on caching and query evaluation on the Power Query MSDN forum and two posts in particular, here and here, offered an explanation of what was going on. In this post I will walk through a very simple scenario that illustrates the information in that post and sheds some light on the internals of the Power Query engine.

Let’s say you have very simple web service built in Flow that, when it receives a GET request, waits 5 seconds and then returns the text “Hello”:

image

Here’s the M code for a Power Query query called SourceQuery that calls this web service and returns a table with one row and one column containing the text “Hello”:

let
    Source = Web.Contents("https://entermyurlhere.com"),
    ToLines = Lines.FromBinary(Source, null, null, 65001),
    ToTable = Table.FromColumns({ToLines})
in
    ToTable

image

Let’s also say you have four other identical queries called R1, R2, R3 and R4 that reference this query and do nothing but return the same table. Here’s the M code for these queries:

let
    Source = SourceQuery
in
    Source

Now, here’s the big question! If the output of SourceQuery is not loaded into Power BI but the output of R1, R2, R3 and R4 is loaded, as follows:

image

How many times would you expect the web service to be called when you refresh your dataset? One or four?

With the setting of “Enable Parallel Loading Of Tables” turned on, which is the default, the “Enable Data Preview To Download In The Background” setting turned off and Data Privacy checks turned off, Fiddler (see here for how to use it) shows the web service is called four times. SQL Server Profiler (see here for how to use it with Power BI) shows that the four queries are run in parallel and each query takes just over 5 seconds.

This seems wrong, because your natural inclination is to think that the Power Query engine evaluates SourceQuery first, and after that has happened the data passes from there to R1, R2, R3 and R4. In fact what happens is the reverse: R1, R2, R3 and R4 are evaluated in parallel and because each of them reference SourceQuery, then each of them causes SourceQuery to be evaluated independently. Yet another post by Ehren on the forum here explains this in detail.

The Power Query engine does do some caching in a ‘persistent cache’ that can be shared between queries and which stores the data requested from some types of data source on disk – it will store the data returned by Web.Contents for example, but not File.Contents – and only in some situations. The key statements about the persistent cache from this thread relevant to this particular problem are:

Power Query’s persistent cache (which stores data on disk during a particular refresh) is updated via a background thread. And separate evaluations (i.e. separate Microsoft.Mashup.Container.*.exe processes) running at the same time are not coordinated; when evaluation A is accessing the persistent cache (including updating it), this doesn’t block evaluation B from accessing the cache. This means that even when using a shared persistent cache, PQ can potentially end up requesting the same data twice. It depends on the timing of the various requests.

and

…you can reduce the number of evaluations happening at any given time (and thus the likelihood of timing-related cache misses, or additional unwanted requests) by doing the following:

  • Disabling background analysis (which pre-calculates the PQ Editor previews behind the scenes)
  • Disabling the Data Privacy Firewall (which does its own evaluations during refresh that could potentially cause duplicate requests, and also only consume a subset of the data in certain cases)

In PBIDesktop, you can also reduce the likelihood of timing-related cache misses by disabling parallel loading. (In Excel, the loading of multiple queries is always sequential)

As a result, with the “Enable Parallel Loading Of Tables” setting turned off, so the queries above are evaluated in series, Fiddler shows the web service is called only once. What’s more, Profiler shows that only one of the queries out of R1, R2, R3 and R4 takes just over five seconds while the other three, executed after it, are almost instant.

This must be because, when the queries are executed in series, the first query refreshes and data gets loaded into Power Query’s persistent cache. After that, even though the three subsequent queries also cause SourceQuery to be evaluated, each time SourceQuery is evaluated it can reuse the data stored in the persistent cache.

In contrast, with the four queries refreshed in parallel, each evaluation of SourceQuery takes place before the persistent cache has been populated and therefore the web service is called four times. This is slower, but in this case not much slower, than the scenario with the four queries executed in series – each query evaluation takes just over five seconds but remember that they are being executed in parallel so the overall duration is not much more than five seconds. In other cases the difference between parallel and sequential query execution could be a lot larger.

Now for the bad news: the “Enable Parallel Loading Of Tables” setting only works in Power BI Desktop, and there is no equivalent setting in the Power BI service. When a dataset is refreshed in the service, as far as I can tell the queries in the dataset are always evaluated in parallel. In Excel, as Ehren says, all queries are executed sequentially.

Is it possible to make sure the web service is called only once with all the queries executed in parallel? Yes, but not in a completely reliable way. The first thing to say here is that Table.Buffer and related buffering functions are not, as far as I understand it, useful here: they buffer data in memory within a single chain of execution, and in this case we have four separate chains of execution for R1, R2, R3 and R4. Instead what I have found that works is inserting a delay that gives the persistent cache time to be populated. If you keep the SourceQuery and R1 queries the same, and then alter the M code for R2, R3 and R4 so they wait for ten seconds (using Function.InvokeAfter) before returning as follows:

let
    Source = Function.InvokeAfter(
        ()=>SourceQuery, 
        #duration(0,0,0,10)
        )
in
    Source

 

In this particular case this results in the slowest refresh times and it only works because I can be sure that the call to the web service takes five seconds.

There’s one last thing to say here: wouldn’t it be nice if we could ensure that the web service was only called once, and all subsequent queries got their data from the a persisted copy of the data? As Matthew Roche points out here, this is exactly what dataflows allow you to do. Taking SourceQuery and turning it into an entity in a dataflow would result in a single call to the web service when the entity is refreshed, the data from the web service being persisted in the dataflow. You would then just need to have the queries R1, R2, R3 and R4 in your dataset, change them so they get their data from the entity in the dataflow, and as a result they would use the persisted copy of the data in the dataflow and would not call the web service. Another reason to use dataflows!

[I hope all the information in this post is correct  – it is based on Ehren’s forum posts and my own observations alone. If anyone from the Power Query dev team is reading this and spots an error, please let me know]

%d bloggers like this: