The Power Query Branding Problem

A few years ago I started blogging about Power Query. Back then life was simple: I put “Power Query” in the title of a post and everyone knew what I was writing about, because Power Query was an Excel add-in you could download and install. Now, however, the technology has been renamed “Get & Transform” in Excel 2016 and is a native feature of Excel; the name “Power Query” only applies to the add-in for Excel 2010 and 2013. What’s more, the same technology is used in Power BI’s Query Editor and it’s also now in Azure Analysis Services, Analysis Services 2017 Tabular and the Common Data Service. This is obviously a good thing – I think Power Query is one of the best things to come out of Microsoft in the last decade – but it also presents me with a problem. How can I write about this technology if it doesn’t have a single, official, easily identifiable name?

In more recent times I’ve written posts with unwieldy names like “Introduction to Insert Topic Name Here in Power Query/Power BI/Excel 2016 Get & Transform” and in the future I suppose this will have to grow to “Introduction to Insert Topic Name Here in Power Query/Power BI/Excel 2016 Get & Transform/Analysis Services Data Loading/Common Data Service”. Tagging and categorising blog posts can help here, I know, but it’s the title of a blog post that’s the main determining factor as to whether it gets read or not when someone is looking at a list of search results. It’s getting ridiculous, but how else can I ensure that someone searching for the solution to a data loading problem in Excel 2016 Get & Transform will find a post I’ve written that contains the answer but shown in Power BI?

Inside Microsoft I understand that the team that builds this technology is known as the Power Query team. I certainly think about this technology as being called Power Query, as do a lot of other people in the community. However, my argument is that I can’t just use the name “Power Query” when I’m writing or speaking about this technology because most of its users – especially those who are new to it and who need the most help – don’t think of it as “Power Query”. They think of it as Excel 2016 Get & Transform, the Query Editor in Power BI Desktop and so on, the specific instances of it.

Maybe I’m making too big a deal of this, but in my opinion this is a problem not just for me but for Microsoft too. We all know how much developers rely on internet searches to find solutions to problems, and not having a single name for this technology makes it much harder to search successfully. This in turn makes it less likely that when a developer runs into a problem they will be able to solve it, which in turn means they are less likely to want to use this technology in future.

What’s the answer? It has to be to make the “Power Query” brand visible somewhere in the UI of all the products that use Power Query technology. I know there’s a risk of confusing users instead of helping them here (am I using Power Query or Power BI?), but it could be as simple as making a few small changes like renaming the “Query Editor” window to be the “Power Query Editor”:

image

I think that would be enough to let people know that “Power Query” is a technology in its own right and that content referring to “Power Query” is relevant to Excel, Power BI, SSAS and everywhere else that Power Query is used. It would also be nice if, now that M is the official name of the M language (and not Power Query Formula Language), the Advanced Editor window and the Custom Column dialog let users know that the code they were writing in them was in a language called M and not some mysterious, nameless scripting language.

What do you think? I’m interested to hear your comments and opinions…

UPDATE: victory is ours! See this comment from Faisal Mohamood of the Power Query team below
Hey there Chris – what you are saying makes complete sense. Power Query is the name of this capability and we will highlight the name of this capability as such in experiences where you are working with Power Query (and M).

Obscure MDX Month: Deselecting Members In An Excel PivotTable Leads To Missing Rows

Here’s some interesting (and borderline buggy) Excel PivotTable behaviour I learned about today from Charles-Henri Sauget, as well as the workaround for it courtesy of the great Greg Galloway.

Say you have a large dimension attribute hierarchy with 200,000 members on it in SSAS MD (or the equivalent in Tabular or Power Pivot) and drop it onto the rows of an Excel PivotTable. As you would expect, you get a PivotTable with 200,000 rows in it:

image

However if you then deselect just one member on rows like so:

image

…you’ll find that the PivotTable does not have 199,999 rows – in Excel 2016 it only has 32,000 rows:

image

(different versions of Excel may return different numbers of rows here, but still not the full number).

If you look at the MDX generated by Excel it consists of all of the member unique names that are still selected, and unsurprisingly it’s a gigantic query:

image

However, it turns out you can make Excel do the sensible thing and use the Except() function to return everything apart from the deselected member by going to the Field Settings dialog and selecting “Include new items in manual filter”:

image

image

This then gives you the expected number of rows in the PivotTable:

image

I suspect the reason Excel is generating the crazy-long MDX statement by default is that it’s the only way to prevent new members being added to the PivotTable if they are added to the attribute hierarchy in future. On a really large attribute hierarchy, though, the risk is that the resulting MDX query might exceed the maximum length of a query, so Excel has to truncate the number of members returned to make the query shorter. With “Include new items in manual filter” selected, though, it’s ok if new members do get added to the PivotTable in the future so it’s ok to use the Except() function in the query.

Obscure MDX Month: Optimising MDX That Uses The RGB() Function

In the first blog post in this series a few weeks ago I mentioned that calling Excel and VBA functions from MDX came with a query performance penalty. In this post I’ll give you an illustration of this using the VBA function that I suspect is most frequently called in MDX: the RGB() function.

Take the following MDX query as a baseline:

WITH
MEMBER MEASURES.TEST AS 
[Measures].[Internet Sales Amount]
SELECT {[Customer].[Country].[Country].MEMBERS} ON 0,
NON EMPTY
[Date].[Date].[Date].MEMBERS
*
[Product].[Product].[Product].MEMBERS
 ON 1
FROM
[Adventure Works]
WHERE(MEASURES.TEST)
CELL PROPERTIES 
VALUE, 
FORMATTED_VALUE, 
BACK_COLOR

 

It returns Countries on columns and all non empty combinations of Date and Product on rows, and the calculated measure returns the value of the Internet Sales Amount measure:

image 

On a SE engine cache it runs in 2.5 seconds on my laptop. With a BACK_COLOR property added to the calculated measure that uses the RGB() function to return the code for red if the measure value is greater than $5000, query performance is a lot worse: it goes up to 6.5 seconds on a warm SE cache.

WITH
MEMBER MEASURES.TEST AS 
[Measures].[Internet Sales Amount]
,BACK_COLOR=
IIF([Measures].[Internet Sales Amount]>5000, 
RGB(255,0,0), 
RGB(255,255,255))
SELECT {[Customer].[Country].[Country].MEMBERS} ON 0,
NON EMPTY
[Date].[Date].[Date].MEMBERS
*
[Product].[Product].[Product].MEMBERS
 ON 1
FROM
[Adventure Works]
WHERE(MEASURES.TEST)
CELL PROPERTIES VALUE, FORMATTED_VALUE, BACK_COLOR

 

image

That’s a big increase just to do some cell highlighting! However in this case the RGB() function can only return two possible integer values, so if you replace the RGB() function with the integers it returns, like so:

WITH
MEMBER MEASURES.TEST AS 
[Measures].[Internet Sales Amount]
,BACK_COLOR=
IIF([Measures].[Internet Sales Amount]>5000, 
255, 
16777215)
SELECT {[Customer].[Country].[Country].MEMBERS} ON 0,
NON EMPTY
[Date].[Date].[Date].MEMBERS
*
[Product].[Product].[Product].MEMBERS
 ON 1
FROM
[Adventure Works]
WHERE(MEASURES.TEST)
CELL PROPERTIES VALUE, FORMATTED_VALUE, BACK_COLOR

…then the query returns in around 3.5 seconds. The last thing to remember is that IIF() statements can perform better if one branch returns null, and in this case we can replace the integer value 16777215 that gives the white background with a null and get the same result:

WITH
MEMBER MEASURES.TEST AS 
[Measures].[Internet Sales Amount]
,BACK_COLOR=
IIF([Measures].[Internet Sales Amount]>5000, 
255, 
NULL)
SELECT {[Customer].[Country].[Country].MEMBERS} ON 0,
NON EMPTY
[Date].[Date].[Date].MEMBERS
*
[Product].[Product].[Product].MEMBERS
 ON 1
FROM
[Adventure Works]
WHERE(MEASURES.TEST)
CELL PROPERTIES VALUE, FORMATTED_VALUE, BACK_COLOR

Now the query returns in around 3 seconds, only 0.5 seconds slower than the original with no colour coding.

Obscure MDX Month: Optimising The Performance Of Total-To-Date Calculations In SSAS Multidimensional

Here’s a SSAS Multidimensional MDX tip that I picked up at the PASS Summit back in 2008 at Mosha’s excellent “MDX Deep Dive” precon (incidentally the slides and supporting material are still available here, although a lot of the material is out of date). It’s regarding total-to-date calculations, ie calculations where you are doing a running total from the very first date you have data for up to the current date. The standard way of writing these calculations is something like this:

WITH
MEMBER MEASURES.[TTD Sales] AS
SUM(
NULL:[Date].[Calendar].CURRENTMEMBER,
[Measures].[Internet Sales Amount])

SELECT
[Customer].[Country].[Country].MEMBERS 
ON 0,
NON EMPTY
[Date].[Calendar].[Date].MEMBERS
*
[Product].[Product].[Product].MEMBERS
ON 1
FROM
[Adventure Works]
WHERE(MEASURES.[TTD Sales])

This query runs in around 19.2 seconds on my laptop on a cold cache. However if you rewrite it like this:

WITH
MEMBER MEASURES.[PTTD SALES] AS
SUM(
NULL:[Date].[Calendar].CURRENTMEMBER.PARENT.PREVMEMBER,
[Measures].[Internet Sales Amount])

MEMBER MEASURES.[TTD Sales] AS
MEASURES.[PTTD SALES]
+
SUM(
[Date].[Calendar].CURRENTMEMBER.FIRSTSIBLING:
[Date].[Calendar].CURRENTMEMBER,
[Measures].[Internet Sales Amount])

SELECT
[Customer].[Country].[Country].MEMBERS 
ON 0,
NON EMPTY
[Date].[Calendar].[Date].MEMBERS
*
[Product].[Product].[Product].MEMBERS
ON 1
FROM
[Adventure Works]
WHERE(MEASURES.[TTD Sales])

…it runs slightly faster: around 16.1 seconds on a cold cache on my laptop. Of course this is a very big query, and on most normal queries the difference in performance would be much less significant, but it could still be useful. In fact it’s very similar to the kind of tricks people used to optimise the performance of YTD calculations back in the days of SSAS 2000 – the subject of my second-ever blog post from December 2004! The idea here is that instead of summing up a large set of dates, the calculation sums up all the dates in the current month and then all the months from the beginning of time up to and including the previous full month. For YTD and most other something-to-date calculations trick like this are no longer needed, and indeed are counter-productive and will make your calculations slower. However it seems that for total-to-date calculations they can still help performance.

Obscure MDX Month: Current and CurrentOrdinal

When you are writing an MDX expression, everywhere you use a set you can give that set a name and then reference the name later on. This is known as creating an inline named set, something I have blogged about a few times (see here and here) over the years. When you are iterating over a set using a function like Generate() or Filter(), if you give that set a name you can then use the Current and CurrentOrdinal functions to find out more about the item in the set returned at the current iteration.

Consider the following MDX query on the Adventure Works cube:

SELECT 
{[Measures].[Internet Sales Amount]} 
ON 0,
{[Customer].[Gender].[Gender].MEMBERS
*
[Customer].[Marital Status].[Marital Status].MEMBERS}
ON 1
FROM
[Adventure Works]

It returns a set of four tuples on rows: every combination of Gender and Marital Status:

image

If you pass the set on rows to the Filter() function and give it a name (for example MySet) you can then use the CurrentOrdinal function to find the 1-based ordinal of the current iteration. This query uses the CurrentOrdinal function to filter the set shown above so only the first and third items in the set are returned:

SELECT {[Measures].[Internet Sales Amount]} ON 0,
FILTER(
{[Customer].[Gender].[Gender].MEMBERS
*
[Customer].[Marital Status].[Marital Status].MEMBERS}
AS MYSET,
MYSET.CURRENTORDINAL=1 OR
MYSET.CURRENTORDINAL=3)
ON 1
FROM
[Adventure Works]

 

image

With an inline named set you can also use the Current function to return the tuple at the current iteration. Here’s another query that uses the Current function to remove the tuple (Female, Single) from the set:

SELECT 
{[Measures].[Internet Sales Amount]} ON 0,
FILTER(
{[Customer].[Gender].[Gender].MEMBERS
*
[Customer].[Marital Status].[Marital Status].MEMBERS}
AS MYSET,
NOT(
MYSET.CURRENT IS 
([Customer].[Gender].&[F],[Customer].[Marital Status].&[S])
)
)
ON 1
FROM
[Adventure Works]

image

I won’t pretend that these functions are massively useful, but fans of super-complex MDX will enjoy this vintage post where I used them.

Obscure MDX Month: Recreating The Star Ratings Measure In MDX Using Excel Functions

I still love MDX, but I’m aware that I blog about it less and less – which is a shame, I know. Therefore I’ve decided that for the next four weeks I’m going to write about some obscure MDX topics that hopefully will make all you SSAS MD diehards out there feel less neglected… even if they don’t have much practical use.

Let’s start off with recreating my ever-popular DAX star-ratings measure in MDX. Well, not exactly pure MDX, but did you know that in MDX you can call some Excel functions (in the same way you can call some VBA functions)? It’s a really, really bad thing to do from a query performance point of view, but it does allow you to do some useful calculations that might otherwise be impossible. Here’s a query on the Adventure Works cube that uses the Excel Rept() and Unichar() functions (functions that do not exist in MDX proper) to recreate my start-ratings measure:

WITH
MEMBER MEASURES.STARS AS
REPT(
UNICHAR(9733),
CINT([Measures].[Internet Sales Amount]/10000))
+
REPT(
UNICHAR(9734),
10-CINT([Measures].[Internet Sales Amount]/10000))

SELECT {[Measures].[Internet Sales Amount],MEASURES.STARS} ON 0,
ORDER(
[Date].[Date].[Date].MEMBERS,
[Measures].[Internet Sales Amount],
BDESC)
ON 1
FROM
[Adventure Works]

 

image

Here’s the same measure used in a PivotTable:

image

SSAS Multidimensional: Are Your Indexes Processed?

If you are using SSAS Multidimensional and you use Process Update to process your dimensions, here’s something for you to try: run a Process Default on your cube. Does it finish in a few seconds? Then you’re ok. If it doesn’t, and it takes minutes or even longer then read on – you might have a problem that’s causing slow query performance.

One of the most common sources of query performance problems I see with my SSAS Multidimensional customers is unprocessed aggregations and indexes. If you run a Process Update on a dimension it may result in indexes and aggregations being dropped from partitions in your cubes; for more details on this, and why it happens, see this post:

https://blog.crossjoin.co.uk/2010/05/12/what-happens-when-you-do-a-process-update-on-a-dimension/

This classic post by Darren Gosbell explains how you can check if you have unprocessed aggregations on a partition:

http://geekswithblogs.net/darrengosbell/archive/2008/12/02/ssas-are-my-aggregations-processed.aspx

However, unprocessed indexes can also be a problem for query performance too. You can tell if the indexes on a partition are built by using the discover_partition_dimension_stat DMV. Here’s an example of how to use it for a partition in the Adventure Works database:

SELECT 
DIMENSION_NAME, ATTRIBUTE_NAME, ATTRIBUTE_INDEXED, 
ATTRIBUTE_COUNT_MIN, ATTRIBUTE_COUNT_MAX 
FROM SystemRestrictSchema($system.discover_partition_dimension_stat
        ,DATABASE_NAME = 'Adventure Works DW 2008'
        ,CUBE_NAME = 'Adventure Works'
        ,MEASURE_GROUP_NAME = 'Internet Sales'
        ,PARTITION_NAME = 'Internet_Sales_2003')

 

[For some background on running SSAS DMV queries, see here]

Here’s what the above query returns, a list of dimensions and attributes that are related to the partition:

image

If the ATTRIBUTE_INDEXED column shows false then indexes are not built for the attribute on the dimension. In this example no indexes are built at all on the partition; if I do a Process Index or Process Default on this partition, here’s what the DMV returns:

image

Now you can see the ATTRIBUTE_INDEXED property is set to true for most attributes. Note that there is an (All) attribute that is never indexed, and if you have set the AttributeHierarchyEnabled property to false or the AttributeHierarchyOptimizedState property to NotOptimized on an attribute, it will not have indexes built for it either (this is typically done to improve processing performance – see here for a few more details).

In a real-world cube it is likely that only a few indexes will be dropped on partitions as a result of a Process Update on a dimension, and even then this will depend on whether any changes take place in the dimension’s data, so you will have to look down the list of attributes returned by this DMV very carefully to see if ATTRIBUTE_INDEXED returns false when it should be returning true.

The solution to this problem, as several of the posts I’ve linked to above suggest, is to always run a Process Default on your cube as the last step in your processing schedule. A Process Default will process any object that is in an unprocessed state, so it will automatically rebuild any aggregations or indexes that are dropped as a result of Process Updates on dimensions.