Data Privacy Settings In Power BI/Power Query, Part 1: Performance Implications

One of the most confusing, under-documented and widely-misunderstood features of Power BI and Power Query (or Excel “Get & Transform” or whatever you want to call it) are the data privacy settings. I get caught out by them all the time, so I thought it would be a good idea to write a series of blog posts demonstrating how they work and what effect they have in different scenarios using example M queries.

Before carrying on, I suggest you read the official Microsoft documentation on the subject here:
https://powerbi.microsoft.com/en-us/documentation/powerbi-desktop-privacy-levels/
It gives you a good grounding in what the different data privacy levels are and where you can set them in the Power BI UI. The same options are available in Excel with Power Query/Get & Transform.

In this first post I’m going to look at what the performance implications of different privacy levels can be. Let’s say you have two data sources. First, an Excel workbook with a single table in that contains the name of a day of the week:

image

The second is the DimDate table in the Adventure Works DW SQL Server sample database:

image

Here’s an M query called FilterDay that returns the day name from the table in the Excel workbook:

let
Source = Excel.Workbook(
File.Contents("C:\Filter.xlsx")
, null, true),
FilterDay_Table = Source{[Item="FilterDay",Kind="Table"]}[Data],
ChangedType =
Table.TransformColumnTypes(
FilterDay_Table,
{{"Parameter", type text}}),
Output = ChangedType{0}[#"Parameter"]
in
Output
image

Here’s an M query called DimDate that filters data from the DimDate table in the Adventure Works DW database, returning only the rows where the EnglishDayNameOfWeek column matches the value returned by the FilterDay query above.

let
Source = Sql.Databases("localhost"),
DB = Source{[Name="Adventure Works DW"]}[Data],
dbo_DimDate = DB{[Schema="dbo",Item="DimDate"]}[Data],
RemovedColumns = Table.SelectColumns(dbo_DimDate,
{"DateKey", "EnglishDayNameOfWeek"}),
FilteredRows = Table.SelectRows(RemovedColumns,
each ([EnglishDayNameOfWeek] = FilterDay))
in
FilteredRows

The first time you run this second query you’ll be prompted to enter credentials to connect to SQL Server, and then (assuming you haven’t set any of the options that get Power BI to ignore privacy levels) you’ll see the “Information is required about data privacy” prompt:

image

Clicking “Continue” allows you to set a privacy level for the SQL Server database and the root of the drive that the Excel workbook is on:

image

The dropdown boxes in the left-hand column allow you to be more specific in what you set privacy levels on; in the case of the Excel file you can choose to set permissions at all folders in the path down to the workbook and the workbook itself; in the case of SQL Server there is the option of setting privacy levels on the database rather than the server.

You can choose any of the three privacy levels (Public, Organizational or Private) for the SQL Server instance and the Excel file and the query will still run:

image

This is where things start to get confusing. At this point the workbook has a privacy level set to None, which is the default for newly-created data sources:

image

However, this “None” setting means it inherits its data privacy settings from the level set for the c:\ drive set earlier, even though this setting isn’t displayed on the “Data sources in current file” tab. Hopefully the fact this is happening will be addressed in a future release: I think the current UI is misleading.

For the purposes of this blog post the data privacy settings for SQL Server are irrelevant and it’s the settings on the Excel workbook that are important. So, to make things clearer, let’s say you explicitly set the data privacy setting for SQL Server to private and the data privacy setting for the Excel workbook itself to Public like so:

image

If you right-click on the FilteredRows step of the query that returns data from the DimDate table in SQL Server and select “View Native Query” (see here for more details on this feature) to see the SQL query generated in the background for this step like so:

image

You’ll see that query folding is taking place for this step and the filter on the EnglishDayNameOfWeek column is taking place in the SQL query:

image
select [_].[DateKey],
[_].[EnglishDayNameOfWeek]
from
(
select [DateKey],
[EnglishDayNameOfWeek]
from [dbo].[DimDate] as [$Table]
) as [_]
where [_].[EnglishDayNameOfWeek] = 'Friday'

Query folding is almost always a good thing for the performance of a query. For more details on what query folding is, see here.

If, however, you set the privacy level for the Excel workbook to Private like so:

image

…even though the DimDate query still works, query folding does not take place for the Filtered Rows step. The View Native Query right-click option is greyed out, and Profiler shows that the following SQL is executed when the query is refreshed:

select [$Ordered].[DateKey],
[$Ordered].[EnglishDayNameOfWeek]
from
(
select [DateKey],
[EnglishDayNameOfWeek]
from [dbo].[DimDate] as [$Table]
) as [$Ordered]
order by [$Ordered].[DateKey]

Note that there is no WHERE clause in this query and that the whole of the DimDate table is returned from SQL Server.

Data from a data source that has a privacy level of Private can never be sent to another data source. That is exactly what needs to happen for query folding to take place though: a value from the Excel workbook – the text “Friday” – needs to be embedded in the WHERE clause of the SQL query sent to SQL Server in order for filtering to happen inside the database.  The risk with query folding is that a DBA could monitor the queries that are being run on SQL Server, look at the WHERE clauses, and see data from your Excel workbook. That’s maybe not a problem with day names, but potentially an issue if you were working with more sensitive data like customer names or addresses. Therefore, with the Excel workbook’s privacy level set to Private, the whole of the DimDate table is downloaded into the M engine and the filtering has to take place there to maintain the privacy of the data in Excel. The query still runs but it will probably be a lot slower than it would have been had query folding taken place. With the privacy level of the Excel workbook set to Public, on the other hand, it is ok to send data from Excel to SQL Server so query folding does take place.

To sum up, in this post I have shown how different data privacy settings can affect the performance of a query by determining whether query folding takes place or not. In part 2 of this series I will show how different data privacy settings can determine whether a query executes at all.

[This post was updated on June 19th 2017 with details on how the “None” privacy level behaves and how permissions can be set at different levels and inherited – thanks to Curt Hagenlocher for providing these details]

Click here for part 2 of this series.

Power BI Custom Data Connectors

Amid all the other announcements at Microsoft Build today came one I’ve been waiting for a long time: we can now build our own custom data connectors for Power BI. Here’s the announcement on the Power BI blog:

https://powerbi.microsoft.com/en-us/blog/data-connectors-developer-preview/

…and here are the docs:

https://github.com/Microsoft/DataConnectors

Although it might not sound that exciting, I think this is one of the biggest things that’s happened to Power BI in a long time. Microsoft’s most successful products are always the ones that are platforms for its partners to build their own solutions on top of, and Power BI is no exception. Custom visuals have made an important contribution to the success of Power BI as a whole, allowing third parties to build visuals that Microsoft either didn’t have the resources to build itself or didn’t realise people wanted. Similarly, custom data connectors will only increase the already impressive number of data sources that Power BI can connect to. My guess is that a lot of SaaS and data providers will want to build their own custom data connectors, and that the number of custom data connectors available is going to explode. Why go to all the trouble of trying to build BI and reporting features into your product when it’s quicker and easier to send the data to Power BI and your users get a much better experience as a result?

This doesn’t just affect Power BI though. Since the new Get & Transform functionality for loading data into Excel (see here) is based on the same Power Query technology that is used by Power BI, these custom data connectors will at some point work with Excel too. Although it will be a while before most Excel users upgrade to a version that supports custom data connectors, when that happens the sheer size of the Excel market means that there will be even more demand for custom connectors and this will in turn strengthen the Power BI ecosystem. Don’t forget that Analysis Services Tabular 2017 also now uses Power Query for loading data, so this will benefit corporate BI users too.

From a purely technical point of view today’s announcement included some welcome new capabilities for M coders like me. These include Visual Studio integration (intellisense for M!) and the ability to connect to web services and use OAuth for authentication, something that wasn’t really possible before. It’s also interesting that you can create custom data connectors that enable DirectQuery connections from Power BI, and that you aren’t limited to importing data.

I can’t wait to start using all these features and I’m sure they will be the subject of many future blog posts. Also, if you’re looking to hire someone with a good understanding of M to build a custom data connector, I’m your guy. You can find my contact details at http://www.crossjoin.co.uk/

Using Text.BetweenDelimiters() To Extract URLs From A Web Page In Power BI/Power Query M

The Add Column By Example functionality that appeared in the April 2017 release of Power BI Desktop is, I have to say, very cool: I’ve used it quite a bit in the last few weeks and it really does work well. One thing I noticed while using it is that three new functions have been added to the M language to support this feature:

  • Text.BetweenDelimiters() which takes a text value and extracts the part that appears between two given delimiter values
  • Text.BeforeDelimiter() which takes a text value and extracts the part that appears before a given delimiter value
  • Text.AfterDelimiter() which takes a text value and extracts the part that appears after a given delimiter value

The functions themselves are quite straightforward and the online documentation has some good examples of how to use them. To save you the click here’s an extra example – the expression:

[sourcecode language=”text” padlinenumbers=”true”]
Text.BetweenDelimiters("Hello *world!??", "*", "!")
[/sourcecode]

…returns the text “world”:

image

As it happens last week I received an email from a reader who wanted to know if it was possible to extract all the links from the href attributes in the source of a web page using M, and I realised that Text.BetweenDelimiters() would be very useful for doing this. I wrote the following M function to demonstrate:

[sourcecode language=”text”]
(SourceURL as text, AttributeDelimiter as text) =>
let
//Get HTML source
Source = Text.FromBinary(Web.Contents(SourceURL)),
//Function to find each link
GetLink = (Counter as number) =>
let
CurrentLink =
Text.BetweenDelimiters(
Source,
"href=" & AttributeDelimiter,
AttributeDelimiter,
Counter
)
in
if CurrentLink=""
then
{}
else
List.Combine({
{CurrentLink},
@GetLink(Counter+1)}
),
//Call function
Output = GetLink(0)
in
Output
[/sourcecode]

A few things to note:

  • I’m using a combination of Text.FromBinary() and Web.Contents() to get the HTML source for the web page whose links we’re extracting
  • Since HTML allows the use of single and double quotes for attributes, I’ve added a parameter to my function called AttributeDelimiter to allow either to be passed in
  • Text.BetweenDelimiters only extracts one piece of text at a time, but you can specify which occurrence of the start delimiter it uses. I therefore used recursion to extract the contents of every href attribute in the HTML: I declare a function called GetLink, and from within that function I can make a recursive call by putting an @ before the function name as in line 22 above. It would probably be better to use List.Generate() instead of recursion here though.

Assuming the query that returns this function is called GetAllLinks

image

…then it can be called in a new query like so:

[sourcecode language=”text”]
GetAllLinks(
"https://msdn.microsoft.com/en-us/library/mt798303.aspx",
""""
)
[/sourcecode]

One other thing to point out is how, in order to pass a double quote character to the function as text, since text has itself to be enclosed in double quotes I need to use four double quotes: “”””

The output of this query is a list containing all of the links from the href attributes on the page that are enclosed in double quotes:

image

I guess this could be taken even further to create a function that crawls a series of web pages and returns the links in all of them, then uses the Force Directed Graph custom visual or better still NodeXL in Excel to show which pages link to each other. I’ll leave that to someone else to do though…!

You can download a pbix file with all of the examples in this post here.

Daylight Saving Time And Time Zones In M

Quite a few people (Ken Puls, Reza Rad and today Chris Koester) have blogged about how the M date/time zone functions work in Power Query/Power BI, and the related problem of turning a UTC time into the actual time in any given location taking Daylight Saving Time changes into account. I don’t have much to add to what they have written, but I did learn something new about this subject from a customer last week: it is possible to do the conversion in pure M if the time zone you’re converting to is your PC’s own local time zone.

Here’s an example. In 2017 the clocks in the UK went forward at 1am on Sunday March 26th. Given the following table with two times in it, one of which is 1pm on Saturday March 25th and one of which is 1pm on Sunday March 26th:

image

…if you assume that these times are both UTC, here’s a query that returns the correct equivalent local time for me (because my PC is set to UK time) for both of them:

[sourcecode language=”text” padlinenumbers=”true”]
let
//Read data from Excel table
Source =
Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
//Change the type of the column to datetime
ChangeType1 =
Table.TransformColumnTypes(
Source,
{{"Time", type datetime}}),
//Turn this into a UTC time
ConvertToUTC =
Table.AddColumn(
ChangeType1,
"UTC",
each DateTime.AddZone([Time],0)),
//Change the data type to datetimezone
ChangeType2 =
Table.TransformColumnTypes(
ConvertToUTC,
{{"UTC", type datetimezone}}),
//Convert to local PC time
ConvertToLocal =
Table.AddColumn(
ChangeType2,
"Local",
each DateTimeZone.ToLocal([UTC])),
//Change type to datetimezone
ChangeType3 =
Table.TransformColumnTypes(
ConvertToLocal ,
{{"Local", type datetimezone}})
in
ChangeType3
[/sourcecode]

image

Here’s a brief explanation of what the query does:

  • First it reads the times from the Excel table and sets the Time column to be datetime data type
  • It then creates a new column called UTC and then takes the values in the Time column and converts them to datetimezone values, using the DateTime.AddZone() function to add a time zone offset of 0 hours, making them UTC times
  • Finally it creates a column called Local and converts the UTC times to my PC’s local time zone using the DateTimeZone.ToLocal() function

As you can see from the output table, the DateTimeZone.ToLocal() function has correctly handled the UK Daylight Saving Time change by converting the UTC time 13:00 on March 25th 2017 to 13:00 UK time, and converting the UTC time 13:00 on March 26th 2017 to 14:00 UK time.

Frustratingly there doesn’t seem to be a way to convert a UTC time to the correct local time in any given time zone in M – the DateTimeZone.SwitchZone() function just adds/removes hours without taking Daylight Saving Time into account, so it’s not really very useful. Apart from all the blog posts linked to at the start of this post, when I posted this question on the Power Query forum I got a very helpful answer from Marcel Beug with a completely different approach to solving the problem here, although it’s still not a straightforward one. If anyone from Microsoft is reading, it would be great to have a new M function to handle this requirement!

Exporting Your Queries’ M Code From Power Query And Power BI Using Copy/Paste

Two years ago I blogged about a method to export all the M code for all of your queries in Power Query using the Send A Frown button – useful if you need the code for documentation purposes. This trick doesn’t work with Power BI Desktop, unfortunately, but the good news is that there’s a better way to do this now in Power Query and Power BI Desktop using copy/paste. It’s pretty simple really: when you copy a query from the Power Query or Power BI Desktop Query Editor you can not only paste the query to another Query Editor (pasting from Power Query to Power BI and vice versa works too) but you can also paste the query to a text editor like Notepad and get the M code for the query. What’s more, you can also select more than one query in the Query Editor and when you paste you get all the code for all of the selected queries:

image

Remember that because the properties of each step in a query become comments in your M code, they get copied too.

Thanks to VossF for telling me about this on this thread.

The List.* M Functions And The equationCriteria Argument

Last week a reader left a comment on my blog asking what the third argument of the List.Contains() M function (somewhat cryptically called equationCriteria) does. I didn’t know, so I did some research and found out that lots of the List.* functions have the same argument. The documentation for List.Distinct() has a few examples but no real explanation of how they work. It also says:

For more information about equationCriteria, see Parameter Values.

…but there’s no link or indication where the documentation on ‘Parameter Values’ can be found. A bit more digging led me to the bottom of this page:

Equation criteria

Equation criteria for list values can be specified as either a

  • A function value that is either
    • A key selector that determines the value in the list to apply the equality criteria, or
    • A comparer function that is used to specify the kind of comparison to apply. Built in comparer functions can be specified, see section for Comparer functions.
  • A list value which has
    • Exactly two items
    • The first element is the key selector as specified above
    • The second element is a comparer as specified above.

Still not exactly helpful, is it? After a bit of time testing, though, I think I’ve worked out how what’s possible with the equationCriteria argument and this blog post will, I hope, help any future M coders who are struggling with the same question. Let’s see some examples…

The basics

First of all, the basics. The following expression using List.Contains() returns TRUE because the text value “apples” appears in the list {“apples”, “oranges”, “pears”}:

[sourcecode language=”text” padlinenumbers=”true”]
List.Contains({"apples", "oranges", "pears"}, "apples")
[/sourcecode]

image

The following returns FALSE because the text value “grapes” does not appear in the list {“apples”, “oranges”, “pears”}:

[sourcecode language=”text”]
List.Contains({"apples", "oranges", "pears"}, "grapes")
[/sourcecode]

image

However there are lots of different ways that text values can be compared and the equationCriteria argument allows you to specify which rules to follow.

Case sensitivity and culture

If you’ve written any M code you’ll know that it is case sensitive. As a result, the following returns FALSE:

[sourcecode language=”text”]
List.Contains({"apples", "oranges", "pears"}, "Apples")
[/sourcecode]

What happens if you want to do a case-insensitive comparison though? This is where the Comparer functions come in. The Comparer.FromCulture() function returns a function that compares two values according to the rules of a given culture or locale and optionally ignore case, and can be used in the equationCriteria argument. The following example returns TRUE:

[sourcecode language=”text”]
List.Contains(
{"apples", "oranges", "pears"},
"Apples",
Comparer.FromCulture("en-GB", true)
)
[/sourcecode]

In this case Comparer.FromCulture(“en-GB”, true) returns a function that compares two values for the English – Great Britain culture (for a full list of culture values, see the Language Tag column of the table on this page); the second, optional argument here makes the function ignore case when making the comparison. The function that Comparer.FromCulture() returns is then used by List.Contains() to make the comparison.

Rather than specify a culture you can also use the Culture.Current function to return the current system culture. For me, Culture.Current returns the value “en-GB” because I live in Great Britain and have my PC configured to use a British English locale:

image

The following example shows how Culture.Current can be used with Comparer.FromCulture and also returns TRUE, at least for me:

[sourcecode language=”text”]
List.Contains(
{"apples", "oranges", "pears"},
"Apples",
Comparer.FromCulture(
Culture.Current,
true
)
)
[/sourcecode]

If you’re curious to see an example where different cultures produce different results here’s one I stole from this article on string comparisons and sorting in .NET. Apparently in English the character æ is treated the same as the combination of the two characters ae but this is not the case in Danish. As a result the following returns TRUE:

[sourcecode language=”text”]
List.Contains(
{"aepples", "oranges", "pears"},
"æpples",
Comparer.FromCulture(
"en-GB",
true
)
)
[/sourcecode]

Whereas this returns FALSE:

[sourcecode language=”text”]
List.Contains(
{"aepples", "oranges", "pears"},
"æpples",
Comparer.FromCulture(
"da-DK",
true
)
)
[/sourcecode]

Ordinal comparisons

If you don’t want all the uncertainty of cultures and case sensitivity you can just make an ordinal comparison, which will compare two strings by finding the unicode character value for each character in each string and compare those values. To do this you can use the Comparer.Ordinal() function. The following returns FALSE:

[sourcecode language=”text”]
List.Contains(
{"apples", "oranges", "pears"},
"Apples",
Comparer.Ordinal
)
[/sourcecode]

…because “a” is not the same unicode character as “A”, and so “apples” and “Apples” are not treated as equal.

Custom comparer functions

As the documentation hints you can also write your own function to do the comparison. A comparer function is just – as far as I can see – a function that has two arguments and returns a logical value. Here’s an example of a custom function that takes two text values, x and y, and returns true if the first three characters of x are the same as y:

[sourcecode language=”text”]
(x as text, y as text)=>Text.Start(x,3)=y
[/sourcecode]

It can be used with List.Contains() as in the following example, which returns TRUE:

[sourcecode language=”text”]
List.Contains(
{"apples", "oranges", "pears"},
"app",
(x as text, y as text)=>Text.Start(x,3)=y
)
[/sourcecode]

What must be happening here is that the function is called three times, every value in the list {“apples”, “oranges”,”pears”} is being passed to the x argument and for each call “app” is passed to y; because the first three characters of “apples” are “app” the function returns true in this case, so List.Contains() returns true.

Key selectors

If you’re working with a list of records you might only want to do the comparison on one field in the record, and this is what key selectors allow you to do. The following example, which returns TRUE:

[sourcecode language=”text”]
List.Contains(
{[Fruit="apples", Colour="Red"],
[Fruit="oranges", Colour="Orange"],
[Fruit="pears", Colour="Green"]},
[Fruit="apples", Colour="Russet"],
each [Fruit]
)
[/sourcecode]

…does so because it only compares the Fruit field in each record, and the Fruit fields in [Fruit=”apples”, Colour=”Red”] and [Fruit=”apples”, Colour=”Russet”] are indeed the same. However the following example returns FALSE:

[sourcecode language=”text”]
List.Contains(
{[Fruit="apples", Colour="Red"],
[Fruit="oranges", Colour="Orange"],
[Fruit="pears", Colour="Green"]},
[Fruit="apples", Colour="Russet"],
each [Colour]
)
[/sourcecode]

…because the Colour “Russet” does not appear anywhere in the Colour field of any of the records in the first parameter.

Combining key selectors and comparison functions

Finally, as the documentation suggests, you can combine the above methods of comparison by passing a list containing two items to equationCriteria: the first item in the list must be a key selector, the second must be a comparer function. For example, the following returns TRUE:

[sourcecode language=”text”]
List.Contains(
{[Fruit="apples", Colour="Red"],
[Fruit="oranges", Colour="Orange"],
[Fruit="pears", Colour="Green"]},
[Fruit="Apples", Colour="Russet"],
{each [Fruit], Comparer.FromCulture("en-GB", true)}
)
[/sourcecode]

…because it only looks at the Fruit field of each record, and it does a case-insensitive comparison using the en-GB culture, so “apples” and “Apples” are equal.

Record.AddField(), Functions And The Delayed Option In M

Today I was looking at the Record.AddField() M function and saw it had a mysterious – and badly documented – optional fourth argument called delayed. Of course I had to work out what it did, and having done so I thought I’d blog about it in case someone found it useful (and however obscure the topic, there’s always someone out there who finds this stuff useful).

Imagine you have an M function called ReturnANumber that has no arguments at all and returns the number 999:

[sourcecode language=”text” padlinenumbers=”true”]
() as number => 999
[/sourcecode]

image

Now imagine that you want to return this function in a record (which is exactly what #shared does). The following expression:

[sourcecode language=”text”]
[MyFunction = ReturnANumber]
[/sourcecode]

…returns a record with one field whose value is of type function:

image

But what if you wanted the field to contain the number that the function returns, not a reference to the function itself? The delayed option of Record.AddField() allows you to do this: if you set it to true, you get the value the function returns.

[sourcecode language=”text”]
Record.AddField([], "MyFunction", ReturnANumber, true)
[/sourcecode]

image

Setting delayed to false does the same as the first example above:

[sourcecode language=”text”]
Record.AddField([], "MyFunction", ReturnANumber, false)
[/sourcecode]

image

Another way to get the same result as setting delayed to true is to use Function.Invoke():

[sourcecode language=”text”]
[MyFunction = Function.Invoke(ReturnANumber,{})]
[/sourcecode]

image

Now I need to think of a real-world use for this…

Web.Contents(), Caching And The ExcludedFromCacheKey Option In Power BI And Power Query

When you’re using the Web.Contents() M function to call a web service from Power Query or Power BI, you don’t necessarily get one HTTP request each time you call the function: some caching takes place, so that if you make the same request multiple times your query won’t waste time asking for the same data over and over. In this post I’m going to share the results of some tests I made to show how caching works with Web.Contents() and what factors influence it.

For my tests I built a simple web service in Microsoft Flow, similar to the one I blogged about here, that accepts a HTTP POST request and calls a stored procedure in an Azure SQL Database. The stored procedure then updates a table in the database and this in turn allows me to count the number of times the web service is called. Finally, the web service returns the value 0 if the stored procedure has executed successfully.

This web service can then be called from either Power Query or Power BI using the Web.Contents() function, something like this (because the URL for the web service is very long I stored it in a parameter called WebServiceURL):

[sourcecode language=”text” padlinenumbers=”true”]
let
Source = Web.Contents(
WebServiceURL,
[Content=Text.ToBinary("Hello")]
),
#"Imported JSON" = Json.Document(Source,1252)
in
#"Imported JSON"
[/sourcecode]

The output of the query when run in Power Query and loaded to an Excel table is this:

image

The first important thing to point out is that the above query, when refreshed in the latest versions of Power Query (I’m running Excel 2016 build 7571.2109) and Power BI (build 2.41.4581.361- November 2016 release), results in a single call to the web service. It might seem like I’m stating the obvious but in the past I’ve seen plenty of cases where a data source has been queried multiple times by Power Query/Power BI even if I was only expecting it to be queried once.

Now, let’s look at a query that calls this web service several times. Here’s the query above converted to a function called fnCallWebService:

[sourcecode language=”text”]
() =>
let
Source = Web.Contents(
WebServiceURL,
[Content=Text.ToBinary("Hello")]),
#"Imported JSON" = Json.Document(Source,1252)
in
#"Imported JSON"
[/sourcecode]

Here’s a query that calls this function once for each row of the following table:

image

[sourcecode language=”text”]
let
Source = Excel.CurrentWorkbook(){[Name="MyTable"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(
Source,
{{"Row", Int64.Type}}),
#"Invoked Custom Function" = Table.AddColumn(
#"Changed Type",
"fnCallWebService",
each fnCallWebService())
in
#"Invoked Custom Function"
[/sourcecode]

In the query above I used the Invoke Custom Function button to call the function for each row in the source table and put the value returned by the function in a new column. The output is this:

image

Even though the function is called four times, once for each row in the original table, that does not mean that the web service is called four times – it isn’t, it’s only called once. In this case Power BI/Power Query knows that each of the four calls to the function is making an identical request to the web service and so it only goes to the web service once, and thereafter uses a cached result the other three times.

One way to stop this caching from taking place is to add an HTTP header to the request to the web service and pass a different value to that header for each call. Here’s another version of my function, now called fnCallWebServiceWithHeaders, which this time takes a number as a parameter and then passes that number to the web service via a header called MyHeader:

[sourcecode language=”text” highlight=”5″]
(RowNum as number) => let
Source = Web.Contents(
WebServiceURL,
[Content=Text.ToBinary("Hello"),
Headers=[MyHeader=Text.From(RowNum)]]),
#"Imported JSON" = Json.Document(Source,1252)
in
#"Imported JSON"
[/sourcecode]

Now if I call this function for each row of the table, and for each call pass the value in the [Row] column through to the function like so:

image

[sourcecode language=”text”]
let
Source = Excel.CurrentWorkbook(){[Name="MyTable"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(
Source,
{{"Row", Int64.Type}}),
#"Invoked Custom Function" = Table.AddColumn(
#"Changed Type",
"fnCallWebServiceWithHeaders",
each fnCallWebServiceWithHeaders([Row]))
in
#"Invoked Custom Function"
[/sourcecode]

…the web service gets hit four times. The presence of a different value for the MyHeader header in each request is enough to prevent any caching from taking place.

It is possible, however, to get Power BI/Power Query to ignore one or more headers when working out whether caching should take place using the ExcludedFromCacheKey option in Web.Contents(). Here’s one more version of my function, now called fnCallWebServiceWithHeadersExlCache, which uses this option:

[sourcecode language=”text” highlight=”7″]
(RowNum as number) =>
let
Source = Web.Contents(
WebServiceURL,
[Content=Text.ToBinary("Hello"),
Headers=[MyHeader=Text.From(RowNum)],
ExcludedFromCacheKey={"MyHeader"}]),
#"Imported JSON" = Json.Document(Source,1252)
in
#"Imported JSON"
[/sourcecode]

The ExcludedFromCacheKey option takes a list of text values which represent the names of headers that are to be ignored when considering which requests can be cached. In the example above my list contains just the one header, MyHeader, and when this is used in a query like so:

[sourcecode language=”text”]
let
Source = Excel.CurrentWorkbook(){[Name="MyTable"]}[Content],
#"Changed Type" = Table.TransformColumnTypes(
Source,
{{"Row", Int64.Type}}),
#"Invoked Custom Function" = Table.AddColumn(
#"Changed Type",
"fnCallWebServiceWithHeaders",
each fnCallWebServiceWithHeadersExlCache([Row]))
in
#"Invoked Custom Function"
[/sourcecode]

…even though the function is called four times, once for each row in the table, and even though each time Web.Contents() is called it is with a different value passed through to the MyHeader header, only one request is made to the web service and the three subsequent requests are answered from cache again.

In summary, if you’re calling a web service multiple times in a query and especially if you’re using the Headers option in Web.Contents(), this is important stuff to understand because caching can make a big difference to the performance of your queries.

First Thoughts On The Integration Of Power Query/M Into Analysis Services Tabular

Last Friday’s big news was the release of the first CTP for Analysis Services v.next. Among several major new pieces of functionality (Ragged hierarchies! Drillthrough that works properly, even for calculations! Table-level security!) probably the biggest is the integration of Power Query/M into Analysis Services. As you can probably guess, I’m incredibly pleased that my two favourite technologies have got together. The technical details are given in this blog post, which I suggest you read if you haven’t done so already, but what I think is missing is an explanation of why this is so important and what kind of opportunities it opens up – hence this post. Of course this is just my take on the subject and not what Microsoft may actually thinking; it’s also very early days, so as the functionality develops and I have more chance to think about this my opinions may change. If you have any ideas on this subject I would be interested to hear them so please leave a comment!

Why this had to happen: Power BI

There is an obvious reason why Microsoft decided to integrate Power Query/M into SSAS, and that is because it needs to support the conversion of Power BI models into Analysis Services Tabular models. There are two scenarios where this will be necessary.

The first is the ability to convert a Power BI model into an Azure Analysis Services Tabular model (listed as ‘planned’ here), something that will be a key selling point for Azure Analysis Services when it releases. The engine behind Power BI is essentially the same as the one used in Analysis Services so migrating the data model should be straightforward, but since Power BI uses Power Query/M to load data then a migrated Azure Analysis Services model will also have to use Power Query/M.

The second scenario is similar to the first. We now know that on-premises Power BI will be delivered through Reporting Services v.next, and it’s reasonable to assume Reporting Services will need a database engine to store the data for published Power BI reports. That engine will have to be an Analysis Services instance of some kind (either standalone or running in-process inside Reporting Services) and again for that to work Analysis Services will have to support the same data access mechanisms as Power BI.

Better support for a larger number of data sources

I’ve just argued why Microsoft was obliged to include this functionality in SSAS v.next but in fact there are many positive reasons for doing this too. The most obvious one is to do with support for more data sources. At the moment SSAS Tabular supports a pretty good range of data sources, but the world of BI is getting more and more diverse and in order to stay relevant SSAS needs to support far more than it does today. By using Power Query/M as its data access mechanism, SSAS v.next will immediately support a much larger number of data sources and this number is going to keep on growing: any investment that Microsoft or third parties make for Power BI in this area will also benefit SSAS. Also, because Power Query/M can query and fold to more than just relational databases, I suspect that in the future this will allow for DirectQuery connections to many of these non-relational data sources too.

Different data sources for partitions in the same table

Another benefit of this change is that we’ll have a lot more flexibility with partitioning tables in an SSAS Tabular model. As the blog post says:

As long as a partition’s M query adheres to the column mappings of the table, you are free to perform any transformations and pull in data from any data source defined in the model.

In SSAS 2016 the partitions in a table all have to get data from the same data source whereas in v.next we’ll be able to get data from different data sources in different partitions, and this opens up some interesting new possibilities. For example, I can imagine a simple budgeting application where the partitions in a table get data from different Excel workbooks stored in OneDrive for Business, and where the each partition gets processed automatically when changes are saved to one of these workbooks.

Does this replace SSIS and my data warehouse? 

The short answer is no. Power Query/M is not a full-featured ETL tool and I don’t think it ever will be; it certainly does not have the kind of functionality needed to perform enterprise-level ETL right now. My view is that Microsoft have built Power Query/M into SSAS for the reasons above and not to encourage enterprise SSAS users to do their own quick-and-dirty ETL when loading data (although there is a risk that that will happen anyway). That said, I think the dividing line between corporate and self-service BI will become increasingly blurred over the next few years as the Microsoft BI stack develops, and we’ll see Analysis Services being used in self-service scenarios as well as the more traditional corporate ones.

Centralised data source objects

One last thing to point out is that the way SSAS v.next makes a distinction between data sources and other queries is very interesting. In Power BI and Power Query it’s easy to end up with data source connection information duplicated across multiple queries unless you know what you’re doing, and this can cause no end of problems later on in a project. As far as I can see, in SSAS v.next a “data source object” is an M query that only contains the connection to external data, while all other queries have to reference a data source to be able to access external data. This means, as the blog post says:

Referring to data source objects helps to centralize data source settings for multiple queries and simplifies deployments and maintenance if data source definitions must be updated later on. When updating a data source definition, all M queries that refer to it automatically use the new settings.

I wonder whether this concept is coming to Power BI and Power Query at some point? I hope so – it makes a lot of sense.

Referenced Queries And Caching In Power BI And Power Query

Last week, Maxim Zelensky (whose blog is well worth checking out) tweeted about a very interesting answer he had received to a question he posted on the Power Query MSDN forum, on the subject of caching and referenced queries in Power Query. You can read the thread here:

https://social.technet.microsoft.com/Forums/en-US/8d5ee632-fdff-4ba2-b150-bb3591f955fb/queries-evaluation-chain?forum=powerquery

…but since this is such important information – it’s certainly something I’ve wondered about myself – I though I would share Maxim’s question and the response from Ehren of the Power Query dev team here in full so it gets the wider visibility it deserves. I’m very grateful to Maxim for letting me share this and to Ehren for writing such a detailed response.

First, Maxim’s original question:

There are two different scenarios I am working with:

1) Query1 connects to the data source (flat file) and make basic cleaning and transformations.

Then Query2 and Query3 reference to Query1, performing other transformations needed.

Query3 also take some data from Query2.

Query2 and Query3 then exported to sheet, Query1 – connection only.

As far as I can understand, PQ can define refresh chain like this: evaluate Query1, then evaluate Query2, then evaluate Query3 (as it need the results from Query2). The question is: When PQ performs calculation of Query3, will it recalculate Query1? Or, as it was evaluated in the chain before, Query3 will use cached results of previous Query1 calculation (performed when Query2 was evaluated)?

2) I have a set of flat files, and I take data from them with one Query1. Query1 also performs some transformations. Then I have two independent Query2 and Query3, both connected to Query1 performing different transformations of source data. Results of Query2 and Query3 evaluations then used in Query4, which exports its results to the sheets, Query1, Query2 and Query3 – connection only

The second question is: performing "Refresh" on Query4, how much times will be Query1 evaluated – one, two or three ? Or there also will be chain: calculate Q1, caching, then Q2 or Q3, calculate next using cached results of Q1 evaluation, and then – Q4?

3) Is there is a difference with connection to database?

4) Is there any rules of evaluation chain (like each expression/query will be calculated once in the evaluation chain)?

And here’s Ehren’s reply:

There’s a lot involved in answering your question, so let me back up and explain a few things first.

Caching
Power Query (in both Excel and Power BI Desktop) utilizes a "persistent cache", stored on disk, when refreshing queries. But what exactly does that mean?
First, let’s look at what gets cached. The persistent cache does not store the results of your M Queries (Q1, Q2, etc. in your example). Instead, it stores the results of the behind-the-scenes requests sent to data sources.

So if Q1 queries a SQL Server database called "MyServer/MyDatabase" and returns a single unfiltered table called "MyTable", the query sent to the server might be "select [Col1] from [MyTable]". In this case, the persistent cache will now know the result of sending "select [Col1] from [MyTable]" to "MyServer/MyDatabase". If another M query (whether through referencing Q1, or by querying the same table directly) needs the same result, the persistent cache can provide it, and the result won’t have to be fetched a second time from the SQL Server.
"Great," you might say. "So if I’m pulling from a flat file in Q1, and in a few places in Q2 I need to do Table.RowCount(Q1), the file should only be read from disk once, right?" And the answer would be…no. This is because not all data sources are cached. Specifically, the results of calls to File.Contents are not stored in the persistent cache. Why not? Well, the cache is stored on disk, and caching local files (which are already on disk) elsewhere on disk doesn’t really make sense. (Using Table.Buffer in this context may help…see more on Table.Buffer below.)
"Okay", you might say. "But if Q1 is pulling from a SQL table, and in a few places in Q2 I reference Q1, that should hit the persistent cache, right?" Maybe. It depends on how Q2 is using Q1, since doing additional operations on Q1 (such as filtering or merging) might cause the M engine to compute a different SQL query, resulting in the server being hit again.
Next, let’s look at the scope of caching. The scope of caching differs depending on what you’re doing, as well as what tool you’re using.

Previewing
If you’ve opened the Power Query editor in Excel or Power BI Desktop, you might have seen warnings like "This preview may be up to 3 days old". This is because there is a persistent cache used for interactive previewing of query results. As you can imagine from the fact that we have warnings about preview results being days old, this cache is long-lived and is intended to make the experience of working in the editor faster.

Loading to Excel
If you load/refresh three queries in Excel, each of them gets their own persistent cache. So the fact that a SQL result is cached during the load of Q2 won’t benefit the loading of Q3, even if it needs the same result.

Loading to Power BI Desktop
If you load/refresh three queries in PBI Desktop, they all share a single persistent cache. When you refresh multiple times, each refresh operation gets its own cache (shared by all the queries being refreshed at that particular time). This means that if SQL result is cached during the load of Q2, it will still be cached during the loading of Q3 (assuming they’re both being loaded at the same time).

What about Table.Buffer?
Table.Buffer can be useful if you want to store an intermediate result in memory and avoid pulling content from disk, a remote file share, a SQL Server, a website, or any other data source multiple times during an evaluation.  Think of Table.Buffer as, "load this table into memory, and stop folding subsequent operations back to the data source".

However, because buffering happens in memory and is not persisted on disk, buffering during the load of one query does not affect the load of another query. If Q1 is buffered when Q2 is loaded, Q1 will be re-buffered when Q3 is loaded.

And now to answer your question…
Now let’s take a look at your example (Q4 references Q2 and Q3, and Q2 and Q3 both reference Q1).
Since you’re pulling from a flat file, and File.Contents results aren’t cached, the flat file will be read each time Q1 is referenced (twice in Q4, once in Q3, Q2, and Q1). If you buffered the result of Q1, then Q4 would only read the file once. But when Q1, Q2, and Q3 are loaded (even in PBI Desktop), they will still each also read the file.

What about immutability?
You asked about the fact that M values are supposed to be immutable. This is true for the "pure" parts of the language, but breaks down when you introduce external data sources and folding. (In fact, you could think of Table.Buffer as transferring a table from the fuzzy unpredictable world of folding to the immutable world of pure M values.) You can see this in action by doing the following test, using a query called "MyTextFileQuery" that pulls from a local file on disk.

Reads the file five times
= Table.RowCount(MyTextFileQuery) + Table.RowCount(MyTextFileQuery) + Table.RowCount(MyTextFileQuery) + Table.RowCount(MyTextFileQuery) + Table.RowCount(MyTextFileQuery)

Reads the file once
= let rowCount = Table.RowCount(MyTextFileQuery) in rowCount + rowCount + rowCount + rowCount + rowCount

I’ve read this response several times and it’s still sinking in, but clearly there are some important implications here for anyone doing more advanced data loading work in Power Query and Power BI. I’m sure it will be the inspiration for many future blog posts on tuning Power Query query performance.