Power BI Data Privacy Levels And Cloud /Web-Based Data Sources Or Dataflows

I have already blogged in great detail many times about Power BI/Power Query data privacy settings (see this series for example) but there’s always something new to learn. Recently I was asked a question by Ian Eckert about how Power BI handles data privacy for cloud or web-based data sources after a dataset has been published, and it prompted yet more revelations…

Consider the following M query:

let
    Source = Xml.Tables(
        Web.Contents("https://blog.crossjoin.co.uk/feed/")
        ),
    channel = Source{0}[channel],
    language = channel{0}[language],
    out = Json.Document(
        Web.Contents(
            "https://data.gov.uk/api", 
            [
                RelativePath="3/action/package_search", 
                Query=[q=language]
            ]
            )
            ),
    result = out[result],
    results = result[results],
    #"Converted to Table" = Table.FromList(
        results, 
        Splitter.SplitByNothing(), 
        null, 
        null, 
        ExtraValues.Error
        ),
    Column1 = #"Converted to Table"{0}[Column1],
    #"Converted to Table1" = Record.ToTable(Column1)
in
    #"Converted to Table1"

It doesn’t do anything particularly interesting, but it does take data from one web-based data source (the RSS feed for this blog) and sends it to another (the UK government’s open data metadata search web service). As a result, in Power BI Desktop, if you set the data privacy settings for both data sources to Public then the query runs, but if you set the data privacy settings for both data sources to Private:

image

image

…As expected, you get the following error:

image

Formula.Firewall: Query ‘Test’ (step ‘Converted to Table1’) is accessing data sources that have privacy levels which cannot be used together. Please rebuild this data combination.

Now the strange thing is that, when you publish the dataset that contains this M query, refresh always works. Why? What’s more, other datasets that do something similar will always fail when refreshed.

It turns out that when you publish a dataset that uses cloud or web-based data sources like the two used here, the Power BI service does not use the data privacy settings you have set in Power BI Desktop but instead it automatically assigns data privacy levels as follows:

  • Data sources, like the ones used here, that use Anonymous authentication are automatically given the privacy level Public
  • All other data sources are given the privacy level Private.

Interestingly, Power BI dataflows also count as cloud-based data sources and because they do not use Anonymous authentication they default to Private too, so if you are combining data from a dataflow with another data source in your dataset then you need to be careful of this.

What’s more there is at the time of writing no way to change these data privacy levels in the Power BI web-based portal. Hopefully this will change soon.

There are some workarounds though!

First of all, you can force refresh to take place through a gateway. This might sound strange because in theory, if you’re only using cloud or web-based data sources, a gateway should not be necessary. However there are already similar scenarios where a gateway is needed, for example if you are scraping data from a web page you need to use a gateway, and if you are combining data from a cloud-based data source with an on-premises data source you also need to use a gateway. If you add your cloud/web-based data sources as data sources in your gateway (unfortunately it does not seem to be possible to add a dataflow as a data source in a gateway, though) you can set their data privacy levels in the Advanced Settings section in the Manage Gateways screen:

image

You will also need to set the “Use a data gateway” option to On in the Settings dialog for your dataset after it has been published:

image

The other workaround is to copy the M code for your query and paste it into a new blank M query in an entity in a dataflow, as Matthew Roche shows here. While it does not seem to be possible to set data privacy levels for individual data sources when creating an entity, it is possible to turn off data privacy checks for an entity completely. If you create a query that sends data from one data source to another (regardless, as far as I can see, of the authentication mechanism used), you will see the following message in the Power Query Online query editor:

image

The evaluation was cancelled because combining data from multiple sources may reveal data from one source to another. Click Continue if the possibility of revealing data is okay.

If you click Continue, data privacy checks are turned off and the query runs; you can also click the Options button on the ribbon and check the “Allow combining data from multiple sources” option:

image

If one of your data sources is already itself a dataflow you may need to do some editing of the M query to make things work, but as Matthew Roche shows here it is possible to have an entity in a dataflow refer to another entity without using a computed entity (which is a Premium-only feature).

[Thanks to Arthi Ramasubramanian Iyer from Microsoft for providing background information for this post]

Power BI/Power Query/M Pre-cons At SQLKonferenz, SQLBits and SQL Saturday Iceland

If you’re learning Power BI and you want to deepen your understanding of its data transformation and loading functionality, the Power Query engine and the M language, I’m doing one-day pre-conference seminars on this subject at three conferences over the next few months:

You’ll learn about all the functionality in the Power Query Editor window plus concepts such as merging, appending, parameters, functions, data privacy and M. I hope to see you at one of them!

Updated Power BI Video Training Course Now Available

I’m pleased to announce that the updated version of my Power BI video training course is now available via my friends at Project Botticelli. You can find out more and purchase it here:

https://projectbotticelli.com/power-bi-desktop-and-cloud-course-video-tutorial?pk_campaign=cwblog2019&code=crossjoin

If you book using the discount code CROSSJOIN you’ll get 10% off!

I’ve recorded a lot of new content to cover all the new features that were added to Power BI in 2018, and there are also new segments on Azure Analysis Services and Power BI Premium. You can watch a few of the videos for free to get a taste of the course.

Don’t forget that my older MDX and SSAS Multidimensional courses are also still available on the same site.

ODBC Data Sources, The SqlCapabilities Option And Power Query/Power BI Data Refresh Performance

A few weeks ago I received a great tip from Simon Nuss about a Power Query feature relating to ODBC data sources that I hadn’t seen before (although it’s in Power BI Desktop and Excel 365, so it can’t be that new) and which is potentially important for the performance of data refresh and of the Power Query Editor’s preview pane. It’s only visible if you click “Advanced options” when creating your ODBC data source:

image

It’s the “Supported row reduction clauses” option:

image

What does it do? Well, first of all let’s see what happens if you don’t set it. Let’s say you have an ODBC data source that uses the Microsoft ODBC Driver for SQL Server to connect to the Adventure Works DW database. If you create a new query that uses this data source to get the contents of the DimDate table it works pretty much as you’d expect; here’s the M code:

let
    Source =
        Odbc.DataSource(
            "dsn=Test SQL",
            [HierarchicalNavigation=true]
        ),
    #"Adventure Works DW_Database" =
        Source{
            [Name="Adventure Works DW",Kind="Database"]
            }[Data],
    dbo_Schema =
        #"Adventure Works DW_Database"{
            [Name="dbo",Kind="Schema"]
            }[Data],
    DimDate_Table =
        dbo_Schema{[Name="DimDate",Kind="Table"]}[Data]
in
    DimDate_Table

image

However, if you look in Profiler when the contents of the Preview window is refreshed, you’ll see that the Power Query engine is running a query that gets the entire contents of the table:

image

Obviously, if you are connecting to a big table this could make the Power Query Editor slow. What’s more, as this blog post shows, the Power Query Editor only actually wants to display 1000 rows and the fact that the query is requesting all the rows from the table suggests that even basic query folding is not happening on this ODBC data source.

Now let’s see what happens if you do select something from the “Supported row reduction clauses” dropdown. As the screenshot above shows, there are four values in the dropdown that relate to the form of SQL supported by your ODBC source to only get the first N rows from a table. If you don’t know which one to choose you can click the Detect button and it will select one for you. In this case I selected TOP:

image

Having done this, when you connect to the same DimDate table, you’ll see that now the Power Query Editor is only requesting the top 1000 rows from the table:

image

Also, if you look at the M code for the query, the Odbc.DataSource function has the as-yet undocumented SqlCapabilities option set in its second parameter,

Odbc.DataSource(
 "dsn=Test SQL",
 [HierarchicalNavigation=true,
 SqlCapabilities=[LimitClauseKind=LimitClauseKind.Top]
 ]
)

Actually it’s not quite undocumented because it is mentioned here as something that is only available if you’re using Odbc.DataSource in a custom connector; obviously things have changed.

Setting this option can make a big difference to the responsiveness of the Power Query Editor when you’re working with large tables and/or slow data sources – I saw this myself this week when working with one of my customers. It’s also going to be important if you use the Keep Rows/Keep Top Rows button on the ribbon or the Table.FirstN() M function; I suspect there are other, more obscure, reasons why it might speed refresh up even if you aren’t filtering the rows of the table but I’m not 100% sure.

One last thing to mention is that there is another undocumented function which is what, I think, the Detect button in the screenshots above uses to find out how to set the SqlCapabilities option. It’s called Odbc.InferOptions and it can be used like so:

Odbc.InferOptions("dsn=Test SQL")

This function returns a record containing a field called LimitClauseKind that tells you the type of limit clause that is supported:

image

image

Here’s a list of the possible LimitClauseKind values I found using #shared:

image

%d bloggers like this: