Data Privacy Settings In Power BI/Power Query, Part 3: The Formula.Firewall Error

In the first two parts of this series (see here and here) I showed how Power BI/Power Query/Excel Get & Transform’s data privacy settings can influence whether query folding takes place or even whether a query is able to run or not. In this post I’m going to talk about the situations where, whatever data privacy level you use, the query will not run at all and you get the infamous Formula.Firewall error.

I’ll admit I don’t understand this particular topic perfectly (I’m not sure anyone outside the Power Query dev team does) so what I will do is explain what I do know, demonstrate a few scenarios where the error occurs and show how to work around it.

Assume you have the two data sources described in my previous posts: an Excel workbook that contains just a single day name, and the DimDate table in SQL Server that can be filtered by the day name from Excel. Let’s also assume that both data sources have their data privacy levels set to Public. The following query, called FilterDay, loads the data from Excel and returns a text value containing the day name:

let
    Source = 
	Excel.Workbook(
		File.Contents("C:\FilterParameter.xlsx"), 
	null, true),
    FilterDay_Table = 
	Source{[Item="FilterDay",Kind="Table"]}[Data],
    ChangedType = 
	Table.TransformColumnTypes(
		FilterDay_Table,
		{{"Parameter", type text}}
	),
    Output = 
	ChangedType{0}[#"Parameter"]
in
    Output

image

Now, look at the following query:

let
    Source = 
	Sql.Database(
		"localhost", 
		"adventure works dw",
		[Query="select DateKey, EnglishDayNameOfWeek 
		from DimDate"]),
    FilteredRows = 
	Table.SelectRows(Source, 
		each ([EnglishDayNameOfWeek] = FilterDay)
	)
in
    FilteredRows

It filters the contents of the DimDate table and only returns the rows where the EnglishDayNameOfWeek column matches the day name returned by the FilterDay query. Notice that there are two steps in the query, Source (which runs a SQL query) and FilteredRows (which does the filtering). Here’s the output:

image

As you can see from the screenshot, the query runs. In fact it runs whatever data privacy settings you have set on both the data sources, although it’s worth pointing out that if you use your own SQL in an M query (as I do in this case) this stops query folding in all subsequent steps, as described here.

Now take a look at the following version of the query:

let
    Source = 
	Table.SelectRows(
		Sql.Database(
			"localhost", 
			"adventure works dw",
			[Query="select DateKey, 
				EnglishDayNameOfWeek 
				from DimDate"]
		), 
		each ([EnglishDayNameOfWeek] = FilterDay)
	)
in
    Source

The important difference here is that there is now one step in this query instead of two: the query and the filtering take place in the same step. Even more importantly, regardless of the data privacy settings, the query fails with the error:

Formula.Firewall: Query ‘DimDate With Native Query Single Step Fails’ (step ‘Source’) references other queries or steps, so it may not directly access a data source. Please rebuild this data combination.

image

The problem here is that the Power Query engine is not allowed to access two different data sources originating from different queries in the same step – as far as I understand it this is because it makes it too hard for the engine to work out whether a step connects to a data source or not, and so which data privacy rules should be applied.

At this point you might think that it’s straightforward to break your logic up into separate steps, as in the first example above. However there are some situations where it’s not so easy to work around the problem. For example, consider the following query:

let
    Source = 
	Sql.Database(
		"localhost", 
		"adventure works dw",
		[Query="
		 select DateKey, EnglishDayNameOfWeek 
		 from DimDate 
		 where 
		 EnglishDayNameOfWeek='" & FilterDay & "'" 
		]
	)
in
    Source

In this example I’m dynamically generating the SQL query that is being run and passing the name of the day to filter by into the WHERE clause. In the two previous examples the query that was run had no WHERE clause and the filtering on day name took place inside Power BI – in this case the filtering is happening inside the query, so in order to generate the WHERE clause I have to refer to the value that the FilterDay query returns in the same step. Therefore, this query also gives the same Formula.Firewall error seen above.

How can you work around this? Well, the following version of the query that attempts to reference FilterDay in a separate step doesn’t work either:

let
    DayAsStep = FilterDay,
    Source = 
	Sql.Database(
		"localhost", 
		"adventure works dw",
		[Query="
		 select DateKey, EnglishDayNameOfWeek 
		 from DimDate 
		 where 
		 EnglishDayNameOfWeek='" & DayAsStep & "'" 
		]
	)
in
    Source

 

Luckily, it turns out that if you use the Value.NativeQuery() function to run your query instead you can avoid the error. As I showed here, you can use this function to pass parameters to SQL queries. If you generate the record containing the parameters for the query as a separate step (called ParamRecord here), like so:

let
    Source = Sql.Database("localhost", "adventure works dw"),
    ParamRecord = [FilterParameter=FilterDay],
    Query = Value.NativeQuery(
                Source, 
                "select DateKey, EnglishDayNameOfWeek 
		from DimDate 
		where 
		EnglishDayNameOfWeek=@FilterParameter",
                ParamRecord)
in
    Query

Then the query runs successfully.

There is another way to avoid the error. In all the examples above I have two queries: one to get data from Excel, one to get filtered data from SQL Server. If these two queries are combined into a single query, it doesn’t matter if data from different data sources is accessed in the same step. So, for example, unlike all of the queries above the following query does not reference any other queries; instead it gets the day name from the Excel workbook in the ExcelSource step and then runs the dynamic SQL query in the SQLSource step, and runs successfully:

let
    ExcelSource = 
	Excel.Workbook(
		File.Contents("C:\FilterParameter.xlsx")
	, null, true),
    FilterDay_Table = 
	ExcelSource{[Item="FilterDay",Kind="Table"]}[Data],
    ChangedType = 
	Table.TransformColumnTypes(FilterDay_Table,
		{{"Parameter", type text}}),
    FilterDayStep = 
	ChangedType{0}[#"Parameter"],
    SQLSource = Sql.Database(
	"localhost", 
	"adventure works dw",
	[Query="
		select DateKey, EnglishDayNameOfWeek 
		from DimDate 
		where 
		EnglishDayNameOfWeek='" 
		& FilterDayStep & 
		"'" ])
in
    SQLSource

Clearly the M engine doesn’t get confused about accessing data from different sources in the same step if those data sources are created in the same query.

Of course you can avoid the Formula.Firewall error and make query folding happen as often as possible by turning off data privacy checks completely in the Options dialog. This will be the subject of the next post in this series.

Using Parameter Tables To Control Data Refresh In Power BI

The use of parameter tables is a well-known technique in Power Query. It involves using a table from the current Excel workbook to hold various parameter values that are read using a Power Query query, and these parameter values are then passed to another Power Query query to control how it behaves. Ken Puls has a great post describing all this here, and similarly a lot of the demos in my video on on building a reporting solution in Power Query involve reading values from a table which are then used to filter data in a SQL query.

The problem with Power BI Desktop and PowerBI.com is that without a current Excel workbook there’s nowhere to store parameter values (at least at the moment) so you can’t implement the technique in exactly the same way. However, with a bit of creativity, it is possible to do something very similar.

Consider the following M query that gets all the rows from the DimDate table in the Adventure Works DW SQL Server database where the day name is Tuesday.

let
    Source = Sql.Database("localhost", "adventure works dw"),
    dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],
    #"Filtered Rows" = Table.SelectRows(
     dbo_DimDate, 
     each ([EnglishDayNameOfWeek] = "Tuesday"))
in
    #"Filtered Rows"

From this you can build a report in Power BI Desktop that looks something like this:

image

The question is, in this case, how can you make this dynamic and allow your end users to enter a different day name to filter by, and how can you ensure that this dynamic filtering works even when the report has been deployed to PowerBI.com?

There may not be a current workbook to store parameters but in PowerBI.com you can of course display Excel workbooks stored in OneDrive for Business as reports (full details on how to do this are here). Rather than display an Excel report, you can instead display a workbook containing an Excel table containing a day name:

image

This is going to be your parameter table. First minor irritation: while it seems like you can change the value in the table inside PowerBI.com the value doesn’t get saved; you have to use the Edit option to open the workbook in Excel Online before any changes you make do get saved.

image

image

The next problem is this: how can you read the day name parameter value from a table stored in an Excel workbook in OneDrive For Business? It’s actually possible using the Excel Services OData API and I blogged about how to call this API in Power Pivot a few years ago here. With a workbook called FilterParameter.xlsx and an Excel table called FilterDay, here’s an example M query that reads the day name parameter value:

let
    Source = OData.Feed("https://enteryourdomainhere-my.sharepoint.com/_vti_bin/ExcelRest.aspx/personal/chris_enteryourdomainhere_onmicrosoft_com/Documents/FilterParameter.xlsx/OData/FilterDay"),
    #"0" = Source{[excelRowID=0]}[Enterparametervalue]
in
    #"0"

If this M query is called DayFilter, then the value it returns can be used in the original query to make the filter dynamic (you will also need to set your data privacy levels correctly or enable Fast Combine):

let
    Source = Sql.Database("chriszbook", "adventure works dw"),
    dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],
    #"Filtered Rows" = Table.SelectRows(
     dbo_DimDate, 
     each ([EnglishDayNameOfWeek] = DayFilter))
in
    #"Filtered Rows"

image

At this point you’ll have something that works in Power BI Desktop, but there’s still more to do to make this work after the report has been published to PowerBI.com. For a start you’ll need to make sure you’re using the latest version of the Power BI Personal Gateway and that you have enabled Fast Combine in it by following the instructions in this blog post. You also need to make sure that in PowerBI.com, when you set up the credentials for the OData connection to the Excel table, you use the OAuth2 Authenication Method option:

image

And with that, you’re done. You can edit the Excel workbook to change the day name that you want to filter by, and to refresh your report in PowerBI.com you can use the Refresh Now option on the dataset:

image

While dashboards reflect the updated data automatically, if you’re looking at a report you may also need to use the Refresh button on the top of the report to see updated data:

image

And there you have it: a parameter table in PowerBI.com. To be honest, I think there are slightly too many fiddly steps for users to follow in this technique for me to be happy recommending its use unconditionally, but it should be useful in some scenarios. Hopefully there will be an easier way of accomplishing the same thing in Power BI in future…

Sentiment Analysis In Excel With Azure Machine Learning And Power Query

You may have seen Joseph Sirosh’s blog post last week about the ability to publish Azure Machine Learning models to the Azure Marketplace, and that MS have published a number of APIs there already. There’s a new Excel add-in that can be used with these APIs but I noticed that at least one of them, the Sentiment Analysis API, can be used direct from Power Query too.

To do this, the first thing you need to do is to go to the Azure Marketplace, sign in with your Microsoft account, and subscribe to the Lexicon Based Sentiment Analysis API. The docs say you get 25000 transactions free per month although there doesn’t appear to be a way to pay for more; that said the number of transactions remaining shown on my account kept resetting, so maybe there is no limit. The API itself is straightforward: pass it a sentence to evaluate and it will return a score between –1 and 1, where 1 represents a positive sentiment and –1 is a negative sentiment. For example, the sentence “I had a good day” returns the value 1:

image

…whereas the sentence “I had a bad day” returns –1:

image

You can now go to Power Query and click From Azure Marketplace (you will need to enter your credentials at this point if this is the first time you’ve used the Azure Marketplace from Power Query):

image

…and then, when the Navigator pane appears, double-click on Score:

image

The API is exposed as a function in Power Query (I guess because it’s an OData service operation, but I’m not sure) and double-clicking on Score invokes the function. You can enter a sentence here and the Query Editor will open to show the score:

image

image

However, to do anything useful you don’t want to invoke the function just yet – you want a query that returns the function. To do this you need to edit the query. Go to the Advanced Editor and you’ll see the M code for the query will be something like this:

let
    Source = Marketplace.Subscriptions(),
    #"https://api.datamarket.azure.com/data.ashx/aml_labs/
      lexicon_based_sentiment_analysis/v1/" = 
      Source{
       [ServiceUrl="https://api.datamarket.azure.com/data.ashx/aml_labs/
       lexicon_based_sentiment_analysis/v1/"]
       }[Feeds],
    Score1 = 
       #"https://api.datamarket.azure.com/data.ashx/aml_labs/
       lexicon_based_sentiment_analysis/v1/"
       {[Name="Score"]}[Data],
    #"Invoked FunctionScore1" = Score1("I had a good day")
in
    #"Invoked FunctionScore1"

 

You need to remove the last line (called #”Invoked FunctionScore1”) which invokes the function, leaving:

let
    Source = Marketplace.Subscriptions(),
    #"https://api.datamarket.azure.com/data.ashx/aml_labs/
      lexicon_based_sentiment_analysis/v1/" = 
      Source{[ServiceUrl="https://api.datamarket.azure.com/data.ashx/aml_labs/
      lexicon_based_sentiment_analysis/v1/"]}[Feeds],
    Score1 = #"https://api.datamarket.azure.com/data.ashx/aml_labs/
      lexicon_based_sentiment_analysis/v1/"
      {[Name="Score"]}[Data]
in
    Score1

 

You can now click the Close and Load button to close the Query Editor window; you now have a function called Score that you can call in other queries. For example, take the following Excel table (called Sentences):

image

Click on the From Table button to load this table into Power Query, then in the Query Editor click the Add Custom Column button and add a new custom column called Sentiment Score with the expression

Score([Sentence])

image

You’ll then be prompted to set a data privacy level for the data you’re using, because calling this function involves sending data from your worksheet up to the API where someone could potentially see it.

image

Click the Continue button and set the privacy level for the workbook to Public, so that this data can be sent to the API:

image

Click Save and you’ll see the sentiment score has been added as a column containing a Record value. Click on the Expand icon in the SentimentScore column then OK:

image

And at last you’ll see the scores themselves:

image

Here’s the code:

let
    Source = Excel.CurrentWorkbook(){[Name="Sentences"]}[Content],
    #"Added Custom" = Table.AddColumn(Source, "SentimentScore", each Score([Sentence])),
    #"Expand SentimentScore" = 
       Table.ExpandRecordColumn(#"Added Custom", "SentimentScore", 
       {"result"}, {"SentimentScore.result"})
in
    #"Expand SentimentScore"

 

You can download the sample workbook here.

Of course, I couldn’t resist repeating this exercise with all of my Facebook status updates – which, of course, can be accessed from Power Query very easily. Here’s my monthly average sentiment score from June 2009 to now:

image

As you can see, I was in a particularly good mood this August – probably because I was on holiday for almost the whole month.

The Ethics Of Big Data

Some time ago I received a review copy of a book called “Ethics Of Big Data” from O’Reilly; I didn’t get round to writing a review of it here for a number of reasons but, despite its flaws (for example its brevity and limited scope), it’s worth reading. It deals with the ethics of data collection and data analysis from a purely corporate point of view: if organisations do not think carefully about what they are doing then

“Damage to your brand and customer relationships, privacy violations, running afoul of emerging legislation, and the possibility of unintentionally damaging reputations are all potential risks”

All of which is true, although I think what irked me about the book when I read it was that it did not tackle the wider and (to my mind) more important question of the social impact of new data technologies and their application. After all, this is what you and I do for a living – and I know that I haven’t spent nearly enough time thinking these issues through.

What prompted me to think about this again was a post by Adam Curtis which argues that the way that governments and corporations are using data is stifling us on a number of levels from the personal to the political:

“What Amazon and many other companies began to do in the late 1990s was build up a giant world of the past on their computer servers. A historical universe that is constantly mined to find new ways of giving back to you today what you liked yesterday – with variations.

Interestingly, one of the first people to criticise these kind of “recommender systems” for their unintended effect on society was Patti Maes who had invented RINGO. She said that the inevitable effect is to narrow and simplify your experience – leading people to get stuck in a static, ever-narrowing version of themselves.

Stuck in the endless you-loop.”

Once our tastes and opinions have been reduced to those of the cluster the k-means algorithm has placed us in we have become homogenised and easier to sell to, a slave to our past behaviour. Worse, the things we have in common with the people in other clusters become harder to see. Maybe all of this is inevitable, but if there is going to be an informed debate on this then shouldn’t we, as the people who actually implement these systems, take part in it?

Analysis Services and Solid State Disks

There’s a lot of discussion going on in the SQL Server relational world about solid state disks and their benefits, but for some reason very little has been said about how useful they could be for Analysis Services. And yet, with their impressive performance for random reads, SSAS and SSDs would seem to be the perfect combination. I think the reason for the silence is that there’s less general understanding of SSAS performance issues out there in the blogosphere, and probably also since there are fewer really large installations of SSAS not many people have had the chance to test SSDs with SSAS properly.

Anyway, the other week I was lucky enough to spend some time testing SSDs with SSAS for a customer of mine. For reasons of privacy I can’t share any significant details about what I found, and before I go on I need to point out that what I was doing was comparing the performance of a particular SSAS cube on a particular existing hard disk configuration with the same SSAS cube on a particular SSD configuration. So while it’s certainly possible to say that, in lab tests, SSDs can perform extremely well for certain tasks, in the real world you have to ask the following question: what benefit will an SSD give me for my cube, over my existing disk configuration? So in fact even if I could share the details of my tests they wouldn’t be all that useful.

That said, here are a few high-level findings:

  • Backup on the SSD was approximately 50% faster than on their existing, very respectable, hard disk configuration; Restore was 65% faster. These were the clearest examples of the SSD showing massive performance improvements.
  • For various reasons I couldn’t do full tests on processing performance, but Process Index operations were about 10-15% faster, as was Synchronisation.
  • Performance of individual queries was not significantly faster, maybe 1-2%; there were some cases where, with large numbers of concurrent users, I saw a bigger improvement of around maybe 10%. This didn’t surprise me at all because I knew that any query performance issues the customer has are related to the formula engine and not the storage engine, so IO wasn’t much of a problem in the first place.

These tests also led to some interesting behind-the-scenes discussions about how you can actually measure the impact of IO on SSAS storage engine operations, which Teo has already blogged about here and which are well worth reading:
http://prologika.com/CS/blogs/blog/archive/2010/04/13/analysis-services-i-o-time.aspx
http://prologika.com/CS/blogs/blog/archive/2010/04/16/using-xperf-to-test-analysis-services.aspx

For this particular customer, my recommendation was that it wasn’t worth the time, money and effort to move to SSDs, at least from a query performance point of view. In fact, based on several years experience of tuning SSAS implementations I find that in most cases IO is not one of the most important factors in determining query performance – it’s more often the case that the problem lies in the formula engine, and even when it doesn’t, an appropriate partitioning strategy and aggregation design can work wonders.

However, let me be clear: I am not making some general statement that SSDs are not useful for SSAS. What I am doing is giving the usual ‘your mileage may vary’ and ‘it depends’ answer. There are definitely going to be SSAS implementations where SSDs will make a massive difference to query performance, and where no other tuning technique can have the same impact. For example, I’ve heard that with distinct count measures (especially when there are a large number of concurrent users) SSDs can lead to very impressive improvements in query performance; I would assume that where there are large many-to-many relationships you would also benefit accordingly. What you need to do before you spend money on SSDs is to understand the causes of any performance issues you currently have and make sure you tune your cube to the best of you abilities.

Has anyone else had any experience with SSAS and SSDs they’d like to share?

UPDATE: the SQLCat team have published this technical note on the subject: http://sqlcat.com/technicalnotes/archive/2010/09/20/analysis-services-distinct-count-optimization-using-solid-state-devices.aspx

:)
%d bloggers like this: