Data Privacy Settings In Power BI/Power Query, Part 2: Preventing Query Execution

In part 1 of this series I showed how the data privacy settings in Excel Power Query/Get & Transform and Power BI could impact the performance of your queries. In this post I’m going to show you how they can stop a query from running at all.

Let’s say you have the Excel workbook from part 1 of this series, but now instead of using the day name to filter data from a SQL Server table you want to pass that value to a web service. The web service I’m going to use for my examples is one that allows you to search for open data published by the UK government on https://data.gov.uk. It’s very simple: you give it a search term and it returns a JSON document with the search results in, no authentication or anything else required. For example:

https://data.gov.uk/api/3/action/package_search?q=Friday

In fact it doesn’t really matter what it does, just know that it is a web service that I can pass a text parameter to and get a result from.

Here’s a query that reads a single piece of text from the FilterDay table in my Excel workbook:

image_thumb2

…and then passes that value to the web service:

let
    ExcelSource = 
	Excel.Workbook(
		File.Contents("C:\FilterParameter.xlsx")
	, null, true),
    FilterDay_Table = 
	ExcelSource{[Item="FilterDay",Kind="Table"]}[Data],
    ChangedType = 
	Table.TransformColumnTypes(
		FilterDay_Table,
		{{"Parameter", type text}}
		),
    Day = ChangedType{0}[#"Parameter"],
    Output = 
	Web.Contents(
		"https://data.gov.uk/api/3/action/package_search", 
		[Query=[q=Day]]
	),
    ImportedJSON = Json.Document(Output,65001)
in
    ImportedJSON 

This query succeeds if any of the following conditions are true:

  • The data privacy levels of both the Excel workbook and the web service are set to Public
  • The data privacy levels of both the Excel workbook and the web service are set to Organizational
  • The data privacy level of the Excel workbook is set to None and the data privacy level of the web service is set to Public

[See here to find out how to set privacy levels for a data source. Interestingly the data privacy level of the web service cannot be set to None for this query – the UI always prompts for it to be set before the query will run]

Here’s the output of a successful run:

image

Any other combinations of data privacy settings, for example if both the Excel workbook and the web service are set to Private, result in the following error message:

Formula.Firewall: Query ‘WebFunctionSucceeds’ (step ‘ImportedJSON’) is accessing data sources that have privacy levels which cannot be used together. Please rebuild this data combination.

image

While it’s ok to send data from one Public data source to another Public data source, or from one Organizational data source to another Organizational data source, you cannot send data from one Private data source to any other data source, or even send data from a Public data source to a Private data source.

In the example in my previous post when the engine found it wasn’t allowed to send data from one source to another because of the data privacy rules used it was still able to run the query, but had to do so in a less efficient way. In this example there is no way to run this query without sending data from the Excel workbook to the web service – you can’t call this web service without sending a search term to it. As a result, if incompatible data privacy levels are set then the query returns the error shown.

Notice that in the query above I’m reading data from Excel and sending it to the web service in a single M query. This is deliberate! In the next post in this series I’ll be looking at examples where the engine can’t work out what it’s supposed to do and errors, even if the privacy levels used suggest the query should run.

Data Privacy Settings In Power BI/Power Query, Part 1: Performance Implications

One of the most confusing, under-documented and widely-misunderstood features of Power BI and Power Query (or Excel “Get & Transform” or whatever you want to call it) are the data privacy settings. I get caught out by them all the time, so I thought it would be a good idea to write a series of blog posts demonstrating how they work and what effect they have in different scenarios using example M queries.

Before carrying on, I suggest you read the official Microsoft documentation on the subject here:
https://powerbi.microsoft.com/en-us/documentation/powerbi-desktop-privacy-levels/
It gives you a good grounding in what the different data privacy levels are and where you can set them in the Power BI UI. The same options are available in Excel with Power Query/Get & Transform.

In this first post I’m going to look at what the performance implications of different privacy levels can be. Let’s say you have two data sources. First, an Excel workbook with a single table in that contains the name of a day of the week:

image

The second is the DimDate table in the Adventure Works DW SQL Server sample database:

image

Here’s an M query called FilterDay that returns the day name from the table in the Excel workbook:

let
    Source = Excel.Workbook(
		File.Contents("C:\Filter.xlsx")
		, null, true),
    FilterDay_Table = Source{[Item="FilterDay",Kind="Table"]}[Data],
    ChangedType = 
	Table.TransformColumnTypes(
		FilterDay_Table,
		{{"Parameter", type text}}),
    Output = ChangedType{0}[#"Parameter"]
in
    Output

image

Here’s an M query called DimDate that filters data from the DimDate table in the Adventure Works DW database, returning only the rows where the EnglishDayNameOfWeek column matches the value returned by the FilterDay query above.

let
    Source = Sql.Databases("localhost"),
    DB = Source{[Name="Adventure Works DW"]}[Data],
    dbo_DimDate = DB{[Schema="dbo",Item="DimDate"]}[Data],
    RemovedColumns = Table.SelectColumns(dbo_DimDate,
		{"DateKey", "EnglishDayNameOfWeek"}),
    FilteredRows = Table.SelectRows(RemovedColumns, 
		each ([EnglishDayNameOfWeek] = FilterDay))
in
    FilteredRows

The first time you run this second query you’ll be prompted to enter credentials to connect to SQL Server, and then (assuming you haven’t set any of the options that get Power BI to ignore privacy levels) you’ll see the “Information is required about data privacy” prompt:

image

Clicking “Continue” allows you to set a privacy level for the SQL Server database and the root of the drive that the Excel workbook is on:

image

The dropdown boxes in the left-hand column allow you to be more specific in what you set privacy levels on; in the case of the Excel file you can choose to set permissions at all folders in the path down to the workbook and the workbook itself; in the case of SQL Server there is the option of setting privacy levels on the database rather than the server.

You can choose any of the three privacy levels (Public, Organizational or Private) for the SQL Server instance and the Excel file and the query will still run:

image

This is where things start to get confusing. At this point the workbook has a privacy level set to None, which is the default for newly-created data sources:

image

However, this “None” setting means it inherits its data privacy settings from the level set for the c:\ drive set earlier, even though this setting isn’t displayed on the “Data sources in current file” tab. Hopefully the fact this is happening will be addressed in a future release: I think the current UI is misleading.

For the purposes of this blog post the data privacy settings for SQL Server are irrelevant and it’s the settings on the Excel workbook that are important. So, to make things clearer, let’s say you explicitly set the data privacy setting for SQL Server to private and the data privacy setting for the Excel workbook itself to Public like so:

image

If you right-click on the FilteredRows step of the query that returns data from the DimDate table in SQL Server and select “View Native Query” (see here for more details on this feature) to see the SQL query generated in the background for this step like so:

image

You’ll see that query folding is taking place for this step and the filter on the EnglishDayNameOfWeek column is taking place in the SQL query:

image

select [_].[DateKey],
    [_].[EnglishDayNameOfWeek]
from 
(
    select [DateKey],
        [EnglishDayNameOfWeek]
    from [dbo].[DimDate] as [$Table]
) as [_]
where [_].[EnglishDayNameOfWeek] = 'Friday'

Query folding is almost always a good thing for the performance of a query. For more details on what query folding is, see here.

If, however, you set the privacy level for the Excel workbook to Private like so:

image

…even though the DimDate query still works, query folding does not take place for the Filtered Rows step. The View Native Query right-click option is greyed out, and Profiler shows that the following SQL is executed when the query is refreshed:

select [$Ordered].[DateKey],
    [$Ordered].[EnglishDayNameOfWeek]
from 
(
    select [DateKey],
        [EnglishDayNameOfWeek]
    from [dbo].[DimDate] as [$Table]
) as [$Ordered]
order by [$Ordered].[DateKey]

Note that there is no WHERE clause in this query and that the whole of the DimDate table is returned from SQL Server.

Data from a data source that has a privacy level of Private can never be sent to another data source. That is exactly what needs to happen for query folding to take place though: a value from the Excel workbook – the text “Friday” – needs to be embedded in the WHERE clause of the SQL query sent to SQL Server in order for filtering to happen inside the database.  The risk with query folding is that a DBA could monitor the queries that are being run on SQL Server, look at the WHERE clauses, and see data from your Excel workbook. That’s maybe not a problem with day names, but potentially an issue if you were working with more sensitive data like customer names or addresses. Therefore, with the Excel workbook’s privacy level set to Private, the whole of the DimDate table is downloaded into the M engine and the filtering has to take place there to maintain the privacy of the data in Excel. The query still runs but it will probably be a lot slower than it would have been had query folding taken place. With the privacy level of the Excel workbook set to Public, on the other hand, it is ok to send data from Excel to SQL Server so query folding does take place.

To sum up, in this post I have shown how different data privacy settings can affect the performance of a query by determining whether query folding takes place or not. In part 2 of this series I will show how different data privacy settings can determine whether a query executes at all.

[This post was updated on June 19th 2017 with details on how the “None” privacy level behaves and how permissions can be set at different levels and inherited – thanks to Curt Hagenlocher for providing these details]

SSAS Multidimensional: Are Your Indexes Processed?

If you are using SSAS Multidimensional and you use Process Update to process your dimensions, here’s something for you to try: run a Process Default on your cube. Does it finish in a few seconds? Then you’re ok. If it doesn’t, and it takes minutes or even longer then read on – you might have a problem that’s causing slow query performance.

One of the most common sources of query performance problems I see with my SSAS Multidimensional customers is unprocessed aggregations and indexes. If you run a Process Update on a dimension it may result in indexes and aggregations being dropped from partitions in your cubes; for more details on this, and why it happens, see this post:

https://blog.crossjoin.co.uk/2010/05/12/what-happens-when-you-do-a-process-update-on-a-dimension/

This classic post by Darren Gosbell explains how you can check if you have unprocessed aggregations on a partition:

http://geekswithblogs.net/darrengosbell/archive/2008/12/02/ssas-are-my-aggregations-processed.aspx

However, unprocessed indexes can also be a problem for query performance too. You can tell if the indexes on a partition are built by using the discover_partition_dimension_stat DMV. Here’s an example of how to use it for a partition in the Adventure Works database:

SELECT 
DIMENSION_NAME, ATTRIBUTE_NAME, ATTRIBUTE_INDEXED, 
ATTRIBUTE_COUNT_MIN, ATTRIBUTE_COUNT_MAX 
FROM SystemRestrictSchema($system.discover_partition_dimension_stat
        ,DATABASE_NAME = 'Adventure Works DW 2008'
        ,CUBE_NAME = 'Adventure Works'
        ,MEASURE_GROUP_NAME = 'Internet Sales'
        ,PARTITION_NAME = 'Internet_Sales_2003')

 

[For some background on running SSAS DMV queries, see here]

Here’s what the above query returns, a list of dimensions and attributes that are related to the partition:

image

If the ATTRIBUTE_INDEXED column shows false then indexes are not built for the attribute on the dimension. In this example no indexes are built at all on the partition; if I do a Process Index or Process Default on this partition, here’s what the DMV returns:

image

Now you can see the ATTRIBUTE_INDEXED property is set to true for most attributes. Note that there is an (All) attribute that is never indexed, and if you have set the AttributeHierarchyEnabled property to false or the AttributeHierarchyOptimizedState property to NotOptimized on an attribute, it will not have indexes built for it either (this is typically done to improve processing performance – see here for a few more details).

In a real-world cube it is likely that only a few indexes will be dropped on partitions as a result of a Process Update on a dimension, and even then this will depend on whether any changes take place in the dimension’s data, so you will have to look down the list of attributes returned by this DMV very carefully to see if ATTRIBUTE_INDEXED returns false when it should be returning true.

The solution to this problem, as several of the posts I’ve linked to above suggest, is to always run a Process Default on your cube as the last step in your processing schedule. A Process Default will process any object that is in an unprocessed state, so it will automatically rebuild any aggregations or indexes that are dropped as a result of Process Updates on dimensions.

Power BI Custom Data Connectors

Amid all the other announcements at Microsoft Build today came one I’ve been waiting for a long time: we can now build our own custom data connectors for Power BI. Here’s the announcement on the Power BI blog:

https://powerbi.microsoft.com/en-us/blog/data-connectors-developer-preview/

…and here are the docs:

https://github.com/Microsoft/DataConnectors

Although it might not sound that exciting, I think this is one of the biggest things that’s happened to Power BI in a long time. Microsoft’s most successful products are always the ones that are platforms for its partners to build their own solutions on top of, and Power BI is no exception. Custom visuals have made an important contribution to the success of Power BI as a whole, allowing third parties to build visuals that Microsoft either didn’t have the resources to build itself or didn’t realise people wanted. Similarly, custom data connectors will only increase the already impressive number of data sources that Power BI can connect to. My guess is that a lot of SaaS and data providers will want to build their own custom data connectors, and that the number of custom data connectors available is going to explode. Why go to all the trouble of trying to build BI and reporting features into your product when it’s quicker and easier to send the data to Power BI and your users get a much better experience as a result?

This doesn’t just affect Power BI though. Since the new Get & Transform functionality for loading data into Excel (see here) is based on the same Power Query technology that is used by Power BI, these custom data connectors will at some point work with Excel too. Although it will be a while before most Excel users upgrade to a version that supports custom data connectors, when that happens the sheer size of the Excel market means that there will be even more demand for custom connectors and this will in turn strengthen the Power BI ecosystem. Don’t forget that Analysis Services Tabular 2017 also now uses Power Query for loading data, so this will benefit corporate BI users too.

From a purely technical point of view today’s announcement included some welcome new capabilities for M coders like me. These include Visual Studio integration (intellisense for M!) and the ability to connect to web services and use OAuth for authentication, something that wasn’t really possible before. It’s also interesting that you can create custom data connectors that enable DirectQuery connections from Power BI, and that you aren’t limited to importing data.

I can’t wait to start using all these features and I’m sure they will be the subject of many future blog posts. Also, if you’re looking to hire someone with a good understanding of M to build a custom data connector, I’m your guy. You can find my contact details at http://www.crossjoin.co.uk/

DAX Functions, DirectQuery And Unrestricted Measures

If you use Power BI to connect to a data source such as SQL Server in DirectQuery mode, you’ll find that by default you cannot use the complete range of DAX functions inside your calculations. For example, here’s a screenshot from Power BI Desktop in DirectQuery mode with a measure where I’m trying to use the RankX() function:

image

Whenever you get an error message like “Function ‘RANKX’ is not supported in DirectQuery mode” you know you’re trying to use a DAX function that isn’t allowed by default in DirectQuery mode.

Why does this limitation exist? Well, as the documentation points out here:

https://powerbi.microsoft.com/en-us/documentation/powerbi-desktop-use-directquery/#limitations-of-directquery

…this is done for performance reasons. In Power BI, every time you interact with a report (by opening it, clicking on a slicer etc) a series of DAX queries is sent back to the Data Model to get the data needed by the report. In DirectQuery mode, all of these DAX queries – including any calculations – are translated into queries that are sent back to the data source. If that data source is SQL Server, then Power BI generates SQL queries. Not everything that you can do in a DAX calculation can be translated back into efficient SQL, so to prevent you from accidentally building calculations that will perform badly in DirectQuery mode Power BI tries to prevent you from using any DAX functions that it thinks it can’t generate efficient SQL for.

You can remove this restriction by clicking File/Options and Settings/Options to open the Options dialog and then going to the DirectQuery pane and selecting “Allow unrestricted measures in DirectQuery mode”:

image

There is no documented list of functions that work in DirectQuery mode (probably because it might change with each release) but there is a way to get one. Using SQL Server Management Studio it is possible to connect to Power BI Desktop and run DMV queries against the Power BI Data Model. If you query the MDSCHEMA_FUNCTIONS DMV like so:

select * from $SYSTEM.MDSCHEMA_FUNCTIONS

You’ll see that it returns a list of all of the DAX and MDX functions understood by Power BI; the last column, called DIRECTQUERY_PUSHABLE contains information used by Power BI to decide whether a calculation is available in DirectQuery mode or not:

image

[At the time of writing this query doesn’t work in DAX Studio for some reason, I think because of the data returned in the PARAMETERINFO column, but if you try to restrict the rows or columns returned by the DMV in any way the DIRECTQUERY_PUSHABLE column only returns empty values. This is why I say you have to use SQL Server Management Studio]

For the DAX functions in the list, if DIRECTQUERY_PUSHABLE is 0 then it is not available in DirectQuery mode by default. If it returns 1 then it is available in measures but not calculated columns; if it returns 3 it is available in both measures and calculated columns (I guess 2 means that it is only available in calculated columns, but there are no functions with this property value right now). However this is not a definitive guide to whether a function is actually ‘pushable’ or not: some functions could be pushable in some scenarios, or with certain combinations of parameters, and not others and the UI may perform additional checks to decide whether a function can be used.

Thanks to Jeffrey Wang and Marius Dumitru for answering my questions about this.

By the way, if you’re reading this, you might also be interested in this post I wrote a while back on the functionality available in the Query Editor in DirectQuery mode:

https://blog.crossjoin.co.uk/2015/11/11/power-bi-directquery-mode-not-just-ssas-directquery-v2-0/

UPDATE – as Marco points out in his comments below, there *is* documentation on which functions are supported in DirectQuery mode now. I just didn’t know about it…

New Book: Tabular Modelling In SQL Server Analysis Services

My good friends Marco Russo and Alberto Ferrari have published a second edition of the SSAS Tabular book that we wrote together a few years ago. It’s called “Tabular Modelling In SQL Server Analysis Services” and it covers pretty much everything you need to know about building models in SSAS 2016 Tabular. Although I didn’t have anything to do with preparing the second edition, it does include a few things I wrote for the first edition so I can’t pretend that this is an unbiased review – that said, I think it’s fair to say that everyone working with SSAS Tabular 2016 should have a copy of this book on their shelf. No-one knows SSAS Tabular better than Marco and Alberto and the chapters on engine internals and performance tuning are worth the price of the book alone. The only topic it doesn’t cover in detail is DAX, and of course if you want to learn DAX you should get a copy of Marco and Alberto’s equally brilliant book “The Definitive Guide To DAX”.

More details on the new book can be found in Marco’s blog post here and you can buy the book from Amazon UK here.

Using Text.BetweenDelimiters() To Extract URLs From A Web Page In Power BI/Power Query M

The Add Column By Example functionality that appeared in the April 2017 release of Power BI Desktop is, I have to say, very cool: I’ve used it quite a bit in the last few weeks and it really does work well. One thing I noticed while using it is that three new functions have been added to the M language to support this feature:

  • Text.BetweenDelimiters() which takes a text value and extracts the part that appears between two given delimiter values
  • Text.BeforeDelimiter() which takes a text value and extracts the part that appears before a given delimiter value
  • Text.AfterDelimiter() which takes a text value and extracts the part that appears after a given delimiter value

The functions themselves are quite straightforward and the online documentation has some good examples of how to use them. To save you the click here’s an extra example – the expression:

Text.BetweenDelimiters("Hello *world!??", "*", "!")

…returns the text “world”:

image

As it happens last week I received an email from a reader who wanted to know if it was possible to extract all the links from the href attributes in the source of a web page using M, and I realised that Text.BetweenDelimiters() would be very useful for doing this. I wrote the following M function to demonstrate:

(SourceURL as text, AttributeDelimiter as text) =>
let
	//Get HTML source
    Source = Text.FromBinary(Web.Contents(SourceURL)),
	//Function to find each link
    GetLink = (Counter as number) =>
                    let
                        CurrentLink = 
		Text.BetweenDelimiters(
			Source, 
			"href=" & AttributeDelimiter, 
			AttributeDelimiter,
			Counter
		)
                    in
                        if CurrentLink="" 
		then 
			{} 
		else 
			List.Combine({
				{CurrentLink}, 
				@GetLink(Counter+1)}
			),
	//Call function
    Output = GetLink(0)
in
    Output

A few things to note:

  • I’m using a combination of Text.FromBinary() and Web.Contents() to get the HTML source for the web page whose links we’re extracting
  • Since HTML allows the use of single and double quotes for attributes, I’ve added a parameter to my function called AttributeDelimiter to allow either to be passed in
  • Text.BetweenDelimiters only extracts one piece of text at a time, but you can specify which occurrence of the start delimiter it uses. I therefore used recursion to extract the contents of every href attribute in the HTML: I declare a function called GetLink, and from within that function I can make a recursive call by putting an @ before the function name as in line 22 above. It would probably be better to use List.Generate() instead of recursion here though.

Assuming the query that returns this function is called GetAllLinks

image

…then it can be called in a new query like so:

GetAllLinks(
	"https://msdn.microsoft.com/en-us/library/mt798303.aspx", 
	""""
)

One other thing to point out is how, in order to pass a double quote character to the function as text, since text has itself to be enclosed in double quotes I need to use four double quotes: “”””

The output of this query is a list containing all of the links from the href attributes on the page that are enclosed in double quotes:

image

I guess this could be taken even further to create a function that crawls a series of web pages and returns the links in all of them, then uses the Force Directed Graph custom visual or better still NodeXL in Excel to show which pages link to each other. I’ll leave that to someone else to do though…!

You can download a pbix file with all of the examples in this post here.