Improving Power Query Calculation Performance With List.Buffer()

I saw an interesting post the other day on the Power Query Technet forum which showed how the List.Buffer() function could be used to improve calculation performance. This is something I’d seen hinted at in other places so I thought it was worth a bit of investigation.

Consider the following query:

let
//Connect to SQL Server
Source = Sql.Database("localhost", "adventure works dw"),
//Get first 2000 rows from FactInternetSales
dbo_FactInternetSales = Table.FirstN(
Source{[Schema="dbo",Item="FactInternetSales"]}[Data],
2000),
//Remove unwanted columns
RemoveColumns = Table.SelectColumns(
dbo_FactInternetSales,
{"SalesOrderLineNumber", "SalesOrderNumber","SalesAmount"}),
//Get sorted list of values from SalesAmount column
RankValues = List.Sort(RemoveColumns[SalesAmount], Order.Descending),
//Calculate ranks
AddRankColumn = Table.AddColumn(RemoveColumns , "Rank",
each List.PositionOf(RankValues,[SalesAmount])+1)
in
AddRankColumn

It gets the first 2000 rows from the FactInternetSales table in the Adventure Works DW database, removes most of the columns, and adds a custom column that shows the rank of the current row based on its Sales Amount.

image

On my laptop it takes around 35 seconds to run this query – pretty slow, in my opinion, given the amount of data in this table.

However, using the List.Buffer() function in the RankValues step like so:

let
//Connect to SQL Server
Source = Sql.Database("localhost", "adventure works dw"),
//Get first 2000 rows from FactInternetSales
dbo_FactInternetSales = Table.FirstN(
Source{[Schema="dbo",Item="FactInternetSales"]}[Data],
2000),
//Remove unwanted columns
RemoveColumns = Table.SelectColumns(
dbo_FactInternetSales,
{"SalesOrderLineNumber", "SalesOrderNumber","SalesAmount"}),
//Get sorted list of values from SalesAmount column
//And buffer them!
RankValues = List.Buffer(List.Sort(RemoveColumns[SalesAmount], Order.Descending)),
//Calculate ranks
AddRankColumn = Table.AddColumn(RemoveColumns , "Rank",
each List.PositionOf(RankValues,[SalesAmount])+1)
in
AddRankColumn

Makes the query run in just 2 seconds. The List.Buffer() function stores the sorted list of values used to calculate the rank in memory which means it will only be evaluated once; in the original query it seems as though this step and those before it are being evaluated multiple times. Curt Hagenlocher’s comment (on this thread) on what List.Buffer() does for a similar calculation is telling:

The reason for this is that M is both functional and lazy, so unless we buffer the output of List.Select, we’re really just building a query that needs to be evaluated over and over. This is similar to the Enumerable functions in LINQ, if you’re familiar with those.

Table.Buffer() and Binary.Buffer() functions also exist, and do similar things.

A few other points to make:

  • This is not necessarily the optimal way to calculate ranks in Power Query – it’s just an example of how List.Buffer() can be used.
  • In the first query above, query folding is not taking place. If it had been it’s likely that performance would have been better. Since using List.Buffer() explicitly prevents query folding from taking place, it could make performance worse rather than better because of this in many cases.
  • I’m 100% certain you’ll get much better performance for a rank calculation by loading the table to the Excel Data Model/Power Pivot and writing the calculation in DAX. You should only really do calculations like this in Power Query if they are needed for other transformations in your query.

Using Function.InvokeAfter() In Power Query

Last year I blogged about how to introduce a delay between web service requests in M code. Since then a new function has been added to Power Query which makes this a lot easier: Function.InvokeAfter(). This function doesn’t seem to be documented anywhere apart from the Power Query language reference (downloadable from here); the signature is as follows:

Function.InvokeAfter(function as function, delay as duration) as any

It invokes a function after waiting for a given amount of time. Here’s a simple example of how it can be used that declares a function which returns the current date/time as text, then calls it twice with a five second delay in between:

let
GetTimeAsText = ()=> DateTime.ToText(DateTime.LocalNow()),
Output = GetTimeAsText() & " "
& Function.InvokeAfter(GetTimeAsText, #duration(0,0,0,5))
in
Output

The output of this query (at the time of writing) is:

28/04/2015 23:06:38 28/04/2015 23:06:43

One thing that did confuse me a bit was the fact that Function.InvokeAfter() doesn’t allow you to pass a list of arguments for the function you’re invoking like Function.Invoke(). The nice people at Microsoft helped me out with this though, and here’s a slightly more complicated example showing how to use Function.InvokeAfter() with a function that appends “Hello “ to a person’s name:

let
SayHello = (personname as text) as text => "Hello " & personname,
Output = Function.InvokeAfter(()=>SayHello("Chris"), #duration(0,0,0,5))
in
Output

Analysing SSAS Extended Event Data With Power Query: Part 2, Storage Engine Activity

In part 1 of this series I showed how to use Power Query to extract Extended Event data generated by SSAS. Having done that, I now want to show the first (I hope of many) examples of how this data can be used for performance tuning: analysing activity in the Storage Engine, the part of SSAS that reads data from disk and aggregates it up.

I won’t go into the technical details of how I’ve used Power Query to crunch this data; you can download the sample workbook here and see for yourself. There’s nothing particularly complex going on. In brief, what I’ve done is the following:

  • Called the function shown in part 1 to get the raw Extended Event data
  • Filtered that data so that only the Query End, Query Subcube Verbose and Progress Report End events are left
  • Calculated the start time of each event relative to the start time of the earliest recorded event, to make plotting these events on a waterfall chart possible
  • Built an Excel report, including various Power Pivot measures, some normal slicers to make it easy to filter the data, some disconnected slicers for filtering so you only see events that started within a given time range, and a PivotChart showing the waterfall chart (since Excel doesn’t support this type of chart natively, I’ve used this technique to reproduce a waterfall chart with a stacked bar chart)

Here’s an example screenshot of the result, showing Storage Engine activity for a single query:

image

Though it’s hard to see the details at this resolution, the yellow line is the Query End event associated with the query, the grey lines are the Query Subcube Verbose events associated with the query, and the brown lines are the Progress Report events associated with each Query Subcube Verbose event.

What could this be used for? Here are some ideas:

  • Looking for times when there are a lot of queries running simultaneously – and which, as a result, may be performing poorly.
  • Looking for long-running Query Subcube Verbose and Progress Report End events which could be optimised by the creation of aggregations.
  • Visualising the amount of parallelism inside the Storage Engine, in particular the number of Progress Report End events that are running in parallel. This would be very interesting for queries using distinct count measures when you are testing different ways of partitioning your measure group.
  • Highlighting situations where calculations are being evaluated in cell-by-cell mode. When this happens you typically see a very large number of Query Subcube Verbose events being fired off within a query.

I’d like to stress once again that the object of this exercise is not to show off a ‘finished’ tool, but to show how Power Query, Power Pivot and Excel can be used for self-service analysis of this data. This workbook is just a starting point: if you wanted to use this on your own data it’s extremely likely you’d need to change the Power Query queries, the Power Pivot model and the report itself. Hopefully, though, this workbook will save you a lot of time if you do need to understand what’s going on in the Storage Engine when you run an MDX query.

Power Query Announcements At The PASS BA Conference

There were a couple of big (well, big if you’re a Power Query fan like me) announcements made today by Miguel Llopis at the PASS BA Conference:

  • Today Power Query is available only to people who have Excel Professional Plus or Excel standalone, but as of May a version of Power Query will be available on every Excel SKU. There will be some limitations around data sources that are supported if you don’t have Excel Professional Plus, but that’s ok – this change will make it much easier for people to learn about and use Power Query, and I’m really happy about that.
  • Other new features coming in the May update of Power Query include the ability to turn off prompts about native database queries (useful in this scenario, for example), OData v4.0 support, the ability to use alternative Windows credentials to run queries, and a couple of new transformations such as removing empty rows.
  • Excel 2016 – where Power Query is now native to Excel – will have support for creating Power Query queries using VBA and macro recording. I understand you won’t be able to edit individual steps in a query, but you’ll be able to create and delete queries programmatically and change where they load their data too.
  • Excel 2016 will also support undo/redo for Power Query and give you the ability to copy/paste queries (even from workbook to workbook).
  • There was a commitment that Power Query in Excel 2016 will keep getting updates on a regular basis, rather than get tied to the much slower Office release cycle, so it retains parity with the Power Query functionality in the Power BI Dashboard Designer.

All very cool stuff!

Building A Reporting Solution Using Power Query

The video of my SQLBits conference session “Building a reporting solution using Power Query” is now available to view (for free) on the SQLBits website:

http://sqlbits.com/Sessions/Event14/Building_A_Reporting_Solution_Using_Power_Query

It’s not your normal Power Query session about self-service ETL – instead it’s about using Power Query to create a SSRS-like reporting solution inside Excel. This is a topic I’ve been thinking about for a while, and while I have blogged about some of the tricks I show in the session (like this one about using slicers to pass parameters to Power Query) there’s a lot of new material in there too that should interest all you Power Query fans.

Of course there are literally hundreds of other great videos to watch for free at http://sqlbits.com/content/ including many others on Power BI, Power Pivot and Power Query. Alas my “Amazing Things You Can Do With Power BI” session video hasn’t been posted yet though…

[Don’t forget I’m running public Power BI and Power Query training courses in London next month! Full details at http://technitrain.com/courses.php]

Analysing SSAS Extended Event Data With Power Query: Part 1

The other day, while I was reading this post by Melissa Coates, I was reminded of the existence of extended events in SSAS. I say ‘reminded’ because although this is a subject I’ve blogged about before, I have never done anything serious with extended events because you can get the same data from Profiler much more easily, so I had pretty much forgotten about them. But… while Profiler is good, it’s a long way from perfect and there’s a lot of information that you can get from a trace that is still very hard to analyse. I started thinking: what if there was a tool we could use to analyse the data captured by extended events easily? [Lightbulb moment] Of course, Power Query!

I’m not going to go over how to use Extended Events in SSAS because the following blog posts do a great job already:
http://byobi.com/blog/2013/06/extended-events-for-analysis-services/
http://markvsql.com/2014/02/introduction-to-analysis-services-extended-events/
https://francescodechirico.wordpress.com/2012/08/03/identify-storage-engine-and-formula-engine-bottlenecks-with-new-ssas-xevents-5/

You may also want to check out these (old, but still relevant) articles on performance tuning SSAS taken from the book I co-wrote with Marco and Alberto, “Expert Cube Development”:

http://www.packtpub.com/article/query-performance-tuning-microsoft-analysis-services-part1
http://www.packtpub.com/article/query-performance-tuning-microsoft-analysis-services-part2

What I want to concentrate on in this series of posts is how to make sense of this data using Power BI in general and Power Query in particular. The first step is to be able to load data from the .xel file using Power Query, and that’s what this post will cover. In the future I want to explore how to get at and use specific pieces of text data such as that given by the Query Subcube Verbose, Calculation Evaluation and Resource Usage events, and to show how this data can be used to solve difficult performance problems. I’m only going to talk about SSAS Multidimensional, but of course a lot of what I show will be applicable (or easily adapted to) Tabular; I guess you could also do something similar for SQL Server Extended Events too. I’m also going to focus on ad hoc analysis of this data, rather than building a more generic performance monitoring solution; the latter is a perfectly valid thing to want to build, but why build one yourself when companies like SQL Sentry have great tools for this purpose that you can buy off the shelf?

Anyway, let’s get on. Here’s a Power Query function that can be used to get data from one or more .xel files generated by SSAS:

[sourcecode language=”text” padlinenumbers=”true”]
(servername as text,
initialcatalog as text,
filename as text)
as table =>
let
//Query the xel data
Source = Sql.Database(servername,
initialcatalog,
[Query="SELECT
object_name, event_data, file_name
FROM sys.fn_xe_file_target_read_file ( ‘"
& filename & "’, null, null, null )"]),
//Treat the contents of the event_data column
//as XML
ParseXML = Table.TransformColumns(Source,
{{"event_data", Xml.Tables}}),
//Expand that column
Expandevent_data = Table.ExpandTableColumn(ParseXML,
"event_data",
{"Attribute:timestamp", "data"},
{"event_data.Attribute:timestamp",
"event_data.data"}),
//A function to tranpose the data held in the
//eventdata.data column
GetAttributeData = (AttributeTable as table) as table =>
let
RemoveTextColumn = Table.RemoveColumns(AttributeTable,
{"text"}),
SetTypes = Table.TransformColumnTypes(RemoveTextColumn ,
{{"value", type text}, {"Attribute:name", type text}}),
TransposeTable = Table.Transpose(SetTypes),
ReverseRows = Table.ReverseRows(TransposeTable),
PromoteHeaders = Table.PromoteHeaders(ReverseRows)
in
PromoteHeaders,
//Use the function above
ParseAttributeData = Table.TransformColumns(Expandevent_data,
{"event_data.data", GetAttributeData})
in
ParseAttributeData
[/sourcecode]

 

This function can be thought of as the starting point for everything else: it allows you to load the raw data necessary for any SSAS performance tuning work. Its output can then, in turn, be filtered and transformed to solve particular problems.

The function takes three parameters:

  • The name of a SQL Server relational database instance – this is because I’m using sys.fn_exe_file_target_read_file to actually read the data from the .xel file. I guess I could try to parse the binary data in the .xel file, but why make things difficult?
  • The name of a database on that SQL Server instance
  • The file name (including the full path) or pattern for the .xel files

The only other thing to mention here is that the event_data column contains XML data, which of course Power Query can handle quite nicely, but even then the data in the XML needs to be cleaned and transposed before you can get a useful table of data. The GetAttributeData function in the code above does this cleaning and transposing but, when invoked, the function still returns an unexpanded column called event_data.data as seen in the following screenshot:

image

There are two reasons why the function does not expand this column for you:

  1. You probably don’t want to see every column returned by every event
  2. Expanding all the columns in a nested table, when you don’t know what the names of these columns are, is not trivial (although this post shows how to do it)

Here’s an example of how the function can be used:

[sourcecode language=”text”]
let
//Invoke the GetXelData function
Source = GetXelData(
"localhost",
"adventure works dW",
"C:\SSAS_Monitoring*.xel"),
//Only return Query End events
#"Filtered Rows" = Table.SelectRows(Source,
each ([object_name] = "QueryEnd")),
//Expand Duration and TextData columns
#"Expand event_data.data" = Table.ExpandTableColumn(
#"Filtered Rows", "event_data.data",
{"Duration", "TextData"},
{"event_data.data.Duration",
"event_data.data.TextData"}),
//Set some data types
#"Changed Type" = Table.TransformColumnTypes(
#"Expand event_data.data",
{{"event_data.Attribute:timestamp", type datetime},
{"event_data.data.Duration", Int64.Type}}),
//Sort by timestamp
#"Sorted Rows" = Table.Sort(#"Changed Type",
{{"event_data.Attribute:timestamp", Order.Ascending}}),
//Add an index column to identify each query
#"Added Index" = Table.AddIndexColumn(#"Sorted Rows", "Query Number", 1, 1),
//Remove unwanted columns
#"Removed Columns" = Table.RemoveColumns(#"Added Index",
{"object_name", "file_name"})
in
#"Removed Columns"
[/sourcecode]

 

All that’s happening here is that the function is being called in the first step, Source, and then I’m filtering by the Query End event, expanding some of the columns in event_data.data and setting column data types. You won’t need to copy all this code yourself though – you just need to invoke the function and then expand the event_data.data column to reveal whatever columns you are interested in. When you run a query that calls this function for the first time, you may need to give Power Query permission to connect to SQL Server and also to run a native database query.

Here’s an example PivotChart showing query durations built from this data after it has been loaded to the Excel Data Model:

image

Not very useful, for sure, but in the next post you’ll see a more practical use for this function.

You can download the sample workbook for this post here.

Documentation For New Excel 2016 DAX Functions

Microsoft has published documentation for the new DAX functions in the Excel 2016 preview here:

https://support.office.com/en-us/article/New-DAX-functions-for-Excel-2016-Preview-8192a787-aa91-4d7f-9a82-4e2c607e629a

There’s a lot of detail, including examples (although the ConcatenateX() page isn’t live at the time of writing – but I’ve blogged about that already), so it’s well worth reading through.

BI Survey 15

It’s BI Survey time again! BI Survey is the largest annual survey of BI users in the world, so if you want to share your feelings on Microsoft BI tools or whatever else you’re using then this is the opportunity to do it. As in the past, in return for promoting the survey I get access to the results when they appear later in the year, and they always make for interesting reading and a good blog post. This year I’m curious to find out what people are saying about Power BI…

Anyway, if you do want to take part (it should only take 20 minutes and you’ll also be entered in a draw for some Amazon vouchers) then here’s the link:

https://digiumenterprise.com/answer/?link=2419-3RFFUGEB

NaturalInnerJoin And NaturalLeftOuterJoin DAX Functions In Excel 2016

Continuing my series on new DAX functions in Excel 2016, here are two more: NaturalInnerJoin() and NaturalLeftOuterJoin(). Both do pretty much what you’d expect.

Consider the following two tables in an Excel worksheet, called ColourFruit and FruitPrice:

image

With these tables loaded into the Excel Data Model as linked tables, the next step is to create a relationship between the tables on the Fruit column:

image

Both functions only work with two tables that have an active relationship between them, and both take two tables from the Excel Data Model as parameters. Once you’ve done that you can use these functions in a DAX query.

The queries

evaluate naturalinnerjoin(ColourFruit,FruitPrice)

and

evaluate naturalinnerjoin(FruitPrice,ColourFruit)

…both perform an inner join between the two tables on the Fruit column and both return the same table:

image

The query

evaluate naturalleftouterjoin(ColourFruit,FruitPrice)

…returns

image

The query

evaluate naturalleftouterjoin(FruitPrice,ColourFruit)

…returns:

image

For NaturalLeftOuterJoin() the table given in the first parameter is on the left-hand side of the left outer join, so all rows from it are returned, whereas the table in the second parameter is on the right-hand side of the join so only the matching rows are returned.

Benford’s Law And Power Query

Probably my favourite session at SQLBits the other week was Professor Mark Whitehorn on exploiting exotic patterns in data. One of the things he talked about was Benford’s Law, something I first heard about several years ago (in fact I’m sure I wrote a blog post on implementing Benford’s Law in MDX but I can’t find it), about the frequency distribution of digits in data. I won’t try to explain it myself but there are plenty of places you can read up on it, for example: http://en.wikipedia.org/wiki/Benford%27s_law . I promise, it’s a lot more interesting that it sounds!

Anyway, it struck me that it would be quite useful to have a Power Query function that could be used to find the distribution of the first digits in any list of numbers, for example for fraud detection purposes. The first thing I did was write a simple query that returned the expected distributions for the digits 1 to 9 according to Benford’s Law:

[sourcecode language=”text” padlinenumbers=”true”]
let
//function to find the expected distribution of any given digit
Benford = (digit as number) as number => Number.Log10(1 + (1/digit)),
//get a list of values between 1 and 9
Digits = {1..9},
// get a list containing these digits and their expected distribution
DigitsAndDist = List.Transform(Digits, each {_, Benford(_)}),
//turn that into a table
Output = #table({"Digit", "Distribution"}, DigitsAndDist)
in
Output
[/sourcecode]

 

image

Next I wrote the function itself:

[sourcecode language=”text”]
//take a single list of numbers as a parameter
(NumbersToCheck as list) as table=>
let
//remove any non-numeric values
RemoveNonNumeric = List.Select(NumbersToCheck,
each Value.Is(_, type number)),
//remove any values that are less than or equal to 0
GreaterThanZero = List.Select(RemoveNonNumeric, each _>0),
//turn that list into a table
ToTable = Table.FromList(GreaterThanZero,
Splitter.SplitByNothing(), null, null,
ExtraValues.Error),
RenameColumn = Table.RenameColumns(ToTable,{{"Column1", "Number"}}),
//function to get the first digit of a number
FirstDigit = (InputNumber as number) as
number =>
Number.FromText(Text.Start(Number.ToText(InputNumber),1))-1,
//get the distributions of each digit
GetDistributions = Table.Partition(RenameColumn,
"Number", 9, FirstDigit),
//turn that into a table
DistributionTable = Table.FromList(GetDistributions,
Splitter.SplitByNothing(), null, null, ExtraValues.Error),
//add column giving the digit
AddIndex = Table.AddIndexColumn(DistributionTable, "Digit", 1, 1),
//show how many times each first digit occurred
CountOfDigits = Table.AddColumn(AddIndex,
"Count", each Table.RowCount([Column1])),
RemoveColumn = Table.RemoveColumns(CountOfDigits ,{"Column1"}),
//merge with table showing expected distributions
Merge = Table.NestedJoin(RemoveColumn,{"Digit"},
Benford,{"Digit"},"NewColumn",JoinKind.Inner),
ExpandNewColumn = Table.ExpandTableColumn(Merge, "NewColumn",
{"Distribution"}, {"Distribution"}),
RenamedDistColumn = Table.RenameColumns(ExpandNewColumn,
{{"Distribution", "Expected Distribution"}}),
//calculate actual % distribution of first digits
SumOfCounts = List.Sum(Table.Column(RenamedDistColumn, "Count")),
AddActualDistribution = Table.AddColumn(RenamedDistColumn,
"Actual Distribution", each [Count]/SumOfCounts)
in
AddActualDistribution
[/sourcecode]

There’s not much to say about this code, apart from the fact that it’s a nice practical use case for the Table.Partition() function I blogged about here. It also references the first query shown above, called Benford, so that the expected and actual distributions can be compared.

Since this is a function that takes a list as a parameter, it’s very easy to pass it any column from any other Power Query query that’s in the same worksheet (as I showed here) for analysis. For example, I created a Power Query query on this dataset in the Azure Marketplace showing the number of minutes that each flight in the US was delayed in January 2012. I then invoked the function above, and pointed it at the column containing the delay values like so:

image

The output is a table (to which I added a column chart) which shows that this data follows the expected distribution very closely:

image

You can download my sample workbook containing all the code from here.