Power BI, Caching, Parallelism And Power Query Refresh Performance

Some time ago a customer of mine (thank you, Robert Lochner) showed me a very interesting scenario where a set of Power Query queries in Power BI Desktop refreshed a lot faster with the “Enable Parallel Loading Of Tables” option turned off. This seemed a bit strange, but I have recently been reading lots of posts by Ehren von Lehe of the Power Query dev team on caching and query evaluation on the Power Query MSDN forum and two posts in particular, here and here, offered an explanation of what was going on. In this post I will walk through a very simple scenario that illustrates the information in that post and sheds some light on the internals of the Power Query engine.

Let’s say you have very simple web service built in Flow that, when it receives a GET request, waits 5 seconds and then returns the text “Hello”:

image

Here’s the M code for a Power Query query called SourceQuery that calls this web service and returns a table with one row and one column containing the text “Hello”:

let
Source = Web.Contents("https://entermyurlhere.com"),
ToLines = Lines.FromBinary(Source, null, null, 65001),
ToTable = Table.FromColumns({ToLines})
in
ToTable
image

Let’s also say you have four other identical queries called R1, R2, R3 and R4 that reference this query and do nothing but return the same table. Here’s the M code for these queries:

let
Source = SourceQuery
in
Source

Now, here’s the big question! If the output of SourceQuery is not loaded into Power BI but the output of R1, R2, R3 and R4 is loaded, as follows:

image

How many times would you expect the web service to be called when you refresh your dataset? One or four?

With the setting of “Enable Parallel Loading Of Tables” turned on, which is the default, the “Enable Data Preview To Download In The Background” setting turned off and Data Privacy checks turned off, Fiddler (see here for how to use it) shows the web service is called four times. SQL Server Profiler (see here for how to use it with Power BI) shows that the four queries are run in parallel and each query takes just over 5 seconds.

This seems wrong, because your natural inclination is to think that the Power Query engine evaluates SourceQuery first, and after that has happened the data passes from there to R1, R2, R3 and R4. In fact what happens is the reverse: R1, R2, R3 and R4 are evaluated in parallel and because each of them reference SourceQuery, then each of them causes SourceQuery to be evaluated independently. Yet another post by Ehren on the forum here explains this in detail.

The Power Query engine does do some caching in a ‘persistent cache’ that can be shared between queries and which stores the data requested from some types of data source on disk – it will store the data returned by Web.Contents for example, but not File.Contents – and only in some situations. The key statements about the persistent cache from this thread relevant to this particular problem are:

Power Query’s persistent cache (which stores data on disk during a particular refresh) is updated via a background thread. And separate evaluations (i.e. separate Microsoft.Mashup.Container.*.exe processes) running at the same time are not coordinated; when evaluation A is accessing the persistent cache (including updating it), this doesn’t block evaluation B from accessing the cache. This means that even when using a shared persistent cache, PQ can potentially end up requesting the same data twice. It depends on the timing of the various requests.

and

…you can reduce the number of evaluations happening at any given time (and thus the likelihood of timing-related cache misses, or additional unwanted requests) by doing the following:

  • Disabling background analysis (which pre-calculates the PQ Editor previews behind the scenes)
  • Disabling the Data Privacy Firewall (which does its own evaluations during refresh that could potentially cause duplicate requests, and also only consume a subset of the data in certain cases)

In PBIDesktop, you can also reduce the likelihood of timing-related cache misses by disabling parallel loading. (In Excel, the loading of multiple queries is always sequential)

As a result, with the “Enable Parallel Loading Of Tables” setting turned off, so the queries above are evaluated in series, Fiddler shows the web service is called only once. What’s more, Profiler shows that only one of the queries out of R1, R2, R3 and R4 takes just over five seconds while the other three, executed after it, are almost instant.

This must be because, when the queries are executed in series, the first query refreshes and data gets loaded into Power Query’s persistent cache. After that, even though the three subsequent queries also cause SourceQuery to be evaluated, each time SourceQuery is evaluated it can reuse the data stored in the persistent cache.

In contrast, with the four queries refreshed in parallel, each evaluation of SourceQuery takes place before the persistent cache has been populated and therefore the web service is called four times. This is slower, but in this case not much slower, than the scenario with the four queries executed in series – each query evaluation takes just over five seconds but remember that they are being executed in parallel so the overall duration is not much more than five seconds. In other cases the difference between parallel and sequential query execution could be a lot larger.

Now for the bad news: the “Enable Parallel Loading Of Tables” setting only works in Power BI Desktop, and there is no equivalent setting in the Power BI service. When a dataset is refreshed in the service, as far as I can tell the queries in the dataset are always evaluated in parallel. In Excel, as Ehren says, all queries are executed sequentially.

Is it possible to make sure the web service is called only once with all the queries executed in parallel? Yes, but not in a completely reliable way. The first thing to say here is that Table.Buffer and related buffering functions are not, as far as I understand it, useful here: they buffer data in memory within a single chain of execution, and in this case we have four separate chains of execution for R1, R2, R3 and R4. Instead what I have found that works is inserting a delay that gives the persistent cache time to be populated. If you keep the SourceQuery and R1 queries the same, and then alter the M code for R2, R3 and R4 so they wait for ten seconds (using Function.InvokeAfter) before returning as follows:z

let
Source = Function.InvokeAfter(
()=>SourceQuery,
#duration(0,0,0,10)
)
in
Source

In this particular case this results in the slowest refresh times and it only works because I can be sure that the call to the web service takes five seconds.

There’s one last thing to say here: wouldn’t it be nice if we could ensure that the web service was only called once, and all subsequent queries got their data from the a persisted copy of the data? As Matthew Roche points out here, this is exactly what dataflows allow you to do. Taking SourceQuery and turning it into an entity in a dataflow would result in a single call to the web service when the entity is refreshed, the data from the web service being persisted in the dataflow. You would then just need to have the queries R1, R2, R3 and R4 in your dataset, change them so they get their data from the entity in the dataflow, and as a result they would use the persisted copy of the data in the dataflow and would not call the web service. Another reason to use dataflows!

[I hope all the information in this post is correct  – it is based on Ehren’s forum posts and my own observations alone. If anyone from the Power Query dev team is reading this and spots an error, please let me know]

Power BI Sentinel: Backup, Documentation, Change Tracking And Lineage Tracking For Power BI

A few weeks ago at SQLBits I had a demo of a very interesting new tool for Power BI users called Power BI Sentinel. The website, with all the details, is here:

https://www.powerbisentinel.com

It helps solve several problems that everyone managing a Power BI deployment has to deal with. It can:

  • Backup reports and datasets (as .pbix files) direct from your App Workspaces to Azure Blob storage on a schedule, so you are able to access earlier versions and roll back if you need to.
  • Generate documentation on your datasets, including the DAX calculations used.
  • Identify what visuals and filters have changed in different versions of a report, and when those changes took place.
  • Track which data sources are used by which datasets and, in turn, which reports use those datasets.

image

It’s still a very new tool and adding functionality all the time; there is also a lot of functionality that you’d want from a tool like this that Microsoft hasn’t built the APIs to support yet. It’s definitely worth checking out though and (in my opinion) quite reasonably priced. I dare say Microsoft will build some of this functionality into Power BI at some point, but I don’t know what it will deliver exactly or when that will happen.

Full disclosure: I’ve known the owners of the company for a long time through their involvement with the UK SQL Server and Power BI community, and I was given a free licence for Power BI Sentinel by them.

Defining Relationships Between Entities In The Common Data Model To Automatically Create Relationships In Power BI

Following on from my last post on attaching manually-created Common Data Model folders in Power BI dataflows, I was asked whether defining relationships between entities in the model.json file of the CDM folder results in relationships being created between tables in a Power BI dataset. I’ve just tested it, and I can confirm it does.

Consider a CDM folder that contains two entities, Sales and Fruit. The Sales entity contains monthly sales data:

image

…and the Fruit entity contains a list of fruit sold:

image

Now –and this is important – let’s say you have defined the entities in the model.json so that the Sales entity  has attributes Month, Product and Sales:

image

…and the Fruit entity has a single attribute called Fruit:

image

Because the Sales entity and the Fruit entity use different names for the attributes that contain the names of the fruit sold, when the two entities are loaded into Power BI no relationships are created between the two tables. There’s no way that Power BI can guess that there should be a relationship between the Product and Fruit columns based on these names.

However, if you add a relationship definition (as documented here) to the model.json file like so:

image

Then, when you load the two entities as tables in the same Power BI dataset, you get a relationship created automatically:

image

As far as I can see it does this by adding a key definition to the Sales table in a similar way to the way the Table.AddKeyM M function/Remove Duplicates transformation does, as I blogged here.

Now if we could add DAX calculations (calculated columns and especially measures) to the definition of the model.json file, so they were automatically created when entities were imported from a dataflow, that would be really cool. I don’t see why this would not be possible because you could store all kinds of useful information – such as the DAX for the calculations – in the metadata record of an M query (you can access this yourself using the Value.Metadata M function) loaded into a dataset, and the Power BI engine could read it when the table is loaded and create the calculations from this.

Attaching Manually-Created Common Data Model Folders In Power BI Dataflows

Most of the focus on Power BI dataflows so far has been on the use of Power Query Online to load data from external data sources. However, when you create a dataflow you also have the option to attach an existing Common Data Model (CDM) folder stored in Azure Data Lake Storage Gen2:

image

The documentation has some very detailed information on this subject: you can read how to configure Power BI to use your own Azure Data Lake Storage Gen2 account here, and you can read about how to attach an existing CDM folder here.

The interesting thing for me is that this means that any external service – whether it’s a Microsoft service, a third party service or something you build yourself – can output data into a folder in Azure Data Lake Storage Gen2, and so long as it’s in the right format you can expose this data as a dataflow in Power BI. So, I thought, why not create some data manually, put it in the cloud and see if I can get Power BI to attach it as a dataflow?

It turns out it isn’t too difficult to do, and what’s more when you look at the documentation for the Common Data Model and specifically the documentation for the format of the model.json file there are all kinds of interesting features to check out.

Consider the following folder containing three files, all created manually and uploaded to my Azure Data Lake Storage Gen2 account using Azure Storage Explorer:

image

The two csv files are pretty simple:

image

image

The model.json file, which contains the metadata for the dataflow, is also not that complicated:

image

This model.json file describes a single entity called Sales with three columns: Month, Product and Sales. It also specifies that the data for this entity is stored in two partitions and these partitions are linked to the two csv files above (I’m pretty sure this is how incremental refresh is handled in dataflows too). I like the idea of another application being able to save new csv files to a folder and, after updating the model.json file to add a new partition, this data being appended onto the table of data returned by the entity.

It took a few goes to attach the folder to a dataflow, mostly because when it fails the error messages (at least at the time of writing) are completely unhelpful. Two things I ran into are:

  • Any spaces in folder or file paths need to be url encoded, ie replaced with %20
  • You really do need to grant permissions on the folder to the Power BI users who are going to access the dataflow. RTFM.

After that, the dataflow is visible in the Power BI service:

image

…and the data from the Sales entity can be loaded into your dataset in Power BI Desktop:

image

image

As I said, there’s more for me to research here. For example, looking in the documentation, you can create two types of entities – Local and Referenced (are the latter the same as Linked Entities?) – and you can also define relationships between entities (do these become relationships between tables in Power BI?). I think this opens up a lot of new possibilities for close integration between Power BI and other services!

Book Review: “Collect, Combine And Transform Data Using Power Query In Excel And Power BI” by Gil Raviv

I generally try to avoid writing book reviews here, but the fact that there are so few books available on Power Query and M means that I’m making an exception for “Collect, Combine and Transform Data using Power Query in Excel and Power BI” by Gil Raviv.

The first thing to say about this book is that it takes the approach of teaching through exercises and worked examples, rather than by explaining abstract concepts. If this suits your style of learning (and I know that it does for a lot of people) then you’re in luck; if you’re looking for a book that will explain what all the different join types for Merge operations do, for example, then you’ll be disappointed. This isn’t a criticism though – I don’t think it’s possible to write a book that will satisfy everyone – and Gil has done a good job of covering a lot of common data preparation scenarios. One important exception to this is the chapter on M which provides a very clear introduction to the language and the way it works. I suspect a lot of people will want to buy the book for this chapter alone.

The second thing to say about this book is that while it covers both Power Query in Excel and Power BI, in my opinion it’s aimed slightly more at Excel users. Again, this is not a criticism: although advanced Excel users and Power BI report designers have to solve many of the same problems, they also have some very different concerns too. What’s more, if you can assume your readers have good Excel skills and can explain Power Query concepts in Excel terms then you’ll serve that particular set of readers well, and probably do a better job for them than if you assume they are completely new to the area of data transformation and preparation have no existing skills in this area.

All in all it’s a good book that I can recommend to anyone who wants to learn Power Query and M, and also for intermediate users who want to deepen their knowledge. I still think there’s a need for a book completely devoted to M and covering topics such as custom connectors and dataflows; hopefully someone writes one soon!

Full disclosure: I received a free copy of the book from the author. I’m also the author of a Power Query book myself, but to be honest it’s several years old now and a bit out-of-date, so it’s hard to recommend it any more.

You can buy a copy of this book from Amazon UK here.

SQLBits Power BI And Analysis Services Videos Now Free To View Online

SQLBits is one of the best Microsoft data platform conferences around, and last week’s event in Manchester was particularly good. As usual, videos of almost all of the sessions are available for everyone to view for free online (no registration required) here:

https://sqlbits.com/content/Event18

There were lots of Power BI and Analysis Services related sessions, so I thought I’d call out a few:

  • If you’re interested in the new calculation groups feature in SSAS 2019 that I blogged about last week, you should definitely watch Christian Wade’s two-part session here and here (part two has all the juicy details in), which also gives some details about other upcoming features such as XMLA endpoints. Kasper’s session here covers a lot of the same topics.
  • There’s more insight into Microsoft’s Power BI roadmap and thinking in the Q&A session with Christian, Kasper and Adam here
  • Marco and Alberto always do great sessions, and Alberto’s session on Aggregations here and Marco’s session on many-to-many relationships here are up to their usual high standards.
  • My session on Power BI Dataflows here sums up my current thoughts about them.

Of course there’s lots more there (more than I have had a chance to watch) so let me know if there are other sessions that are good!

SSAS Tabular 2019, Calculation Groups And Migration From SSAS Multidimensional

With the release of CTP 2.3 of SQL Server 2019 today there was big news for Analysis Services Tabular developers: Calculation Groups. You can read all about them in detail in this blog post:

https://blogs.msdn.microsoft.com/analysisservices/2019/03/01/whats-new-for-sql-server-2019-analysis-services-ctp-2-3/

In my opinion this is the most important new feature in DAX since… well, forever. It allows you to create a new type of calculation – which in most cases will be a time intelligence like a year-to-date or a previous period growth – that can be applied to multiple measures; basically the same thing that we have been doing in SSAS Multidimensional for years with the time utility/shell/date tool dimension technique. It’s certainly going to solve a lot of problems for a lot of SSAS Tabular implementations, many of which have hundreds or even thousands of measures for every combination of base measure and calculation type needed.

I’m not going to repeat any of the detailed technical information in the blog post here, though. Instead the point I want to make is that this is very big news for SSAS Multidimensional users too. In the past couple of years many people who have existing SSAS Multidimensional implementations have thought about migrating to SSAS Tabular so they can take advantage of its new features or move to the cloud, and indeed many of them have already migrated successfully. However, up to now, the biggest blocker for those wanting to migrate from Multidimensional to Tabular has been the fact that some complex calculations that can be expressed in MDX cannot be recreated (or recreated easily and efficiently) in DAX, because DAX has not had an equivalent of calculated members not on the Measures dimension or the MDX SCOPE statement.

Calculation groups do not remove this roadblock completely, but they do remove the roadblock for a large group of existing SSAS Multidimensional users whose only complex calculation requirement is a time utility/shell/date tool dimension. As a result these SSAS Multidimensional users will now be able to migrate to SSAS Tabular 2019, Azure Analysis Services or Power BI if they want to. Only those people who have more exotic uses for calculated members not on the Measures dimension (which are not very common at all) and those who use SCOPE statements (a larger group – many people working with financial data use SCOPE statements heavily) will find that Multidimensional is still the correct platform for them.

Handling Different Granularities In Power BI Using Slicer Groups

Reading Prathy Kamasani’s excellent blog post on Sync Slicers the other week I had a revelation: in the post Prathy shows how it is possible to sync selections between two slicers that are based on different columns from different tables in your dataset, something I had not known was possible. As soon as I read this, I realised this technique could be used to solve a number of common problems including that of handling data at different granularities.

This problem has been written about extensively over the years by Marco Russo and Alberto Ferrari in the following articles, which are all worth reading for background:

https://www.daxpatterns.com/handling-different-granularities/

https://www.sqlbi.com/articles/physical-and-virtual-relationships-in-dax/

https://www.sqlbi.com/articles/propagate-filters-using-treatas-in-dax/

Let me be clear: the solution I’m showing in this post is not the “best”, fastest or most elegant solution to this problem. It does, however, have the advantage of not requiring any changes to relationships or the DAX of your measures – and this could be important in scenarios when making these changes would be time-consuming or have other knock-on effects.

Now, consider the following dataset:

image

There is a Date dimension table and a relationship has been built between it and the Daily Sales table, which contains sales information at the date granularity. There is also a Monthly Budget table that contains data at the Month granularity, and no relationship between it and the Date dimension table. The requirement is, however, to be able to select a month on the Date dimension table and for this selection to filter the Daily Sales table and the Monthly Budget table.

Now look at the following report. With two slicers built from the Month columns of the Date and Monthly Budget tables, selecting a month from the slicer built on the Month column of the Date table filters the sales value but not the Budget value, as you would expect:

SynchSlicersNoSynch

The key thing is that the two Month columns contain month names in exactly the same format, “MonthName Year”, which means that a selection on one can be transferred over to the other. Next, for both slicers on the report, go to the Sync Slicers pane, expand the Advanced options section and enter the same group name for both. Make sure the “Sync field changes to other slicers” option is selected too.

image

Once you’ve done this you’ll find that the selection on the slicer built on the Date table is transferred to the slicer built on the Monthly Budget table, even though there is no relationship between them and no clever DAX used, and this in turn means that both the Budget and the Sales values shown are affected:

SynchSlicersWithSynch

The last step is to hide the slicer for the Month column on the Monthly Budget table so the user only sees the slicer for the Month column on the Date table. You can do this either on the Selection pane or by deselecting the “eye” option on the Sync Slicers pane:

image

The end result is this:

SynchSlicersWithSynchHidden

One last thing to mention is that, if you look at the DAX queries generated behind the scenes, it’s all done using the TreatAs function. Here’s an example of a query that shows this:

[sourcecode language=’text’ padlinenumbers=’true’]
DEFINE
VAR __DS0FilterTable =
TREATAS({“December 2019”}, ‘Monthly Budget'[Month])

VAR __DS0FilterTable2 =
TREATAS({“December 2019”}, ‘Date'[Month])

EVALUATE
SUMMARIZECOLUMNS(
__DS0FilterTable,
__DS0FilterTable2,
“SumBudget”, IGNORE(CALCULATE(SUM(‘Monthly Budget'[Budget])))
)
[/sourcecode]

You can download the sample pbix file for this post here.

Thoughts On The New Power Query Source In SSIS

Yet more evidence that Power Query is taking over the world: there’s a new Power Query data source for SQL Server Integration Services in the latest version of SSDT. Here’s the blog post announcing it (the actual announcement is halfway through):

https://blogs.msdn.microsoft.com/ssis/2019/02/15/deeper-integration-and-new-connector-for-ssis-in-adf/

And here’s the documentation:

https://docs.microsoft.com/en-us/sql/integration-services/data-flow/power-query-source?view=sql-server-2017

It’s a preview and as such, limited in scope. The main restriction, as the blog post says, is:

At present, to facilitate quick/frequent feedback-gathering and improvement cycles, it can only be used with SQL Server Data Tools (SSDT) and SSIS IR.

There’s also no user interface for generating queries: right now you can only paste in an M query or use an M query stored in an SSIS variable. I don’t think that’s much of a problem though – if you don’t want to write your own M code, you can always generate a query in Excel Power Query or Power BI Desktop and cut and paste from there. You don’t even have to open the Advanced Editor to do this, as I show here.

Beyond the normal excitement about cool new stuff, what are the benefits of having Power Query integrated into SSIS? I’m not an SSIS person so I’m probably not the right person to comment on the benefits for SSIS developers, but I would imagine that it will make it possible to connect to a wider range of data sources and also make it easier to work with certain others, such as Excel workbooks and web services. I’d be interested in hearing your thoughts on this, so please leave a comment.

From the point of view of the Excel, Power BI or SSAS developer, though, it’s clear this will make it much easier and quicker to take user-developed Power BI and Excel Power Query solutions and turn them into a managed, IT-department owned solutions – and this will be incredibly useful.

Splitting Text By Character Transition In Power BI And Excel Power Query

The February 2019 release of Power BI Desktop includes some new functionality not mentioned in the summary blog post, functionality that is already in the latest Excel 365 builds (thanks Ed Hansberry for pointing this out) and will be extremely useful. It takes the form of four new options under the Split Column button in the Query Editor:

image

The four options allow you to split text by:

  • Lowercase to uppercase
  • Uppercase to lowercase
  • Digit to non-digit
  • Non-digit to digit

They are intended to solve problems like the one Gil Raviv blogged about here where he shows how to split camel case text in M.

Here’s an example of the lowercase to uppercase split. With the following input table:

image

The output is:

image

Here’s the M code behind:

[sourcecode language=’text’ padlinenumbers=’true’]
let
Source =
#table(
{“MyData”},
{
{“CamelCase”},
{“DimDate”}
}
),
#”Split Column by Character Transition” =
Table.SplitColumn(
Source,
“MyData”,
Splitter.SplitTextByCharacterTransition(
{“a”..”z”},
{“A”..”Z”}),
{“MyData.1”, “MyData.2″}
)
in
#”Split Column by Character Transition”
[/sourcecode]

It turns out that the new Splitter function that’s behind this, Splitter.SplitTextByCharacterTransition, has been in Power BI for at least a month already (if you want some background on Splitter functions, see my blog post here), but I don’t think anyone noticed. This new function is very flexible, and I can see it will be useful for a lot more scenarios than just these new four options in the Query Editor: it allows you to split text when there is a transition from any one character or characters to any other character or characters.