Benford’s Law And Power Query

Probably my favourite session at SQLBits the other week was Professor Mark Whitehorn on exploiting exotic patterns in data. One of the things he talked about was Benford’s Law, something I first heard about several years ago (in fact I’m sure I wrote a blog post on implementing Benford’s Law in MDX but I can’t find it), about the frequency distribution of digits in data. I won’t try to explain it myself but there are plenty of places you can read up on it, for example: http://en.wikipedia.org/wiki/Benford%27s_law . I promise, it’s a lot more interesting that it sounds!

Anyway, it struck me that it would be quite useful to have a Power Query function that could be used to find the distribution of the first digits in any list of numbers, for example for fraud detection purposes. The first thing I did was write a simple query that returned the expected distributions for the digits 1 to 9 according to Benford’s Law:

[sourcecode language=”text” padlinenumbers=”true”]
let
//function to find the expected distribution of any given digit
Benford = (digit as number) as number => Number.Log10(1 + (1/digit)),
//get a list of values between 1 and 9
Digits = {1..9},
// get a list containing these digits and their expected distribution
DigitsAndDist = List.Transform(Digits, each {_, Benford(_)}),
//turn that into a table
Output = #table({"Digit", "Distribution"}, DigitsAndDist)
in
Output
[/sourcecode]

 

image

Next I wrote the function itself:

[sourcecode language=”text”]
//take a single list of numbers as a parameter
(NumbersToCheck as list) as table=>
let
//remove any non-numeric values
RemoveNonNumeric = List.Select(NumbersToCheck,
each Value.Is(_, type number)),
//remove any values that are less than or equal to 0
GreaterThanZero = List.Select(RemoveNonNumeric, each _>0),
//turn that list into a table
ToTable = Table.FromList(GreaterThanZero,
Splitter.SplitByNothing(), null, null,
ExtraValues.Error),
RenameColumn = Table.RenameColumns(ToTable,{{"Column1", "Number"}}),
//function to get the first digit of a number
FirstDigit = (InputNumber as number) as
number =>
Number.FromText(Text.Start(Number.ToText(InputNumber),1))-1,
//get the distributions of each digit
GetDistributions = Table.Partition(RenameColumn,
"Number", 9, FirstDigit),
//turn that into a table
DistributionTable = Table.FromList(GetDistributions,
Splitter.SplitByNothing(), null, null, ExtraValues.Error),
//add column giving the digit
AddIndex = Table.AddIndexColumn(DistributionTable, "Digit", 1, 1),
//show how many times each first digit occurred
CountOfDigits = Table.AddColumn(AddIndex,
"Count", each Table.RowCount([Column1])),
RemoveColumn = Table.RemoveColumns(CountOfDigits ,{"Column1"}),
//merge with table showing expected distributions
Merge = Table.NestedJoin(RemoveColumn,{"Digit"},
Benford,{"Digit"},"NewColumn",JoinKind.Inner),
ExpandNewColumn = Table.ExpandTableColumn(Merge, "NewColumn",
{"Distribution"}, {"Distribution"}),
RenamedDistColumn = Table.RenameColumns(ExpandNewColumn,
{{"Distribution", "Expected Distribution"}}),
//calculate actual % distribution of first digits
SumOfCounts = List.Sum(Table.Column(RenamedDistColumn, "Count")),
AddActualDistribution = Table.AddColumn(RenamedDistColumn,
"Actual Distribution", each [Count]/SumOfCounts)
in
AddActualDistribution
[/sourcecode]

There’s not much to say about this code, apart from the fact that it’s a nice practical use case for the Table.Partition() function I blogged about here. It also references the first query shown above, called Benford, so that the expected and actual distributions can be compared.

Since this is a function that takes a list as a parameter, it’s very easy to pass it any column from any other Power Query query that’s in the same worksheet (as I showed here) for analysis. For example, I created a Power Query query on this dataset in the Azure Marketplace showing the number of minutes that each flight in the US was delayed in January 2012. I then invoked the function above, and pointed it at the column containing the delay values like so:

image

The output is a table (to which I added a column chart) which shows that this data follows the expected distribution very closely:

image

You can download my sample workbook containing all the code from here.

ConcatenateX() DAX Function In Excel 2016

This is the first of many posts on the new DAX functions that have appeared in Excel 2016 (for a full list see this post). Today: the ConcatenateX() function.

The mdschema_functions schema rowset gives the following description of this function:

Evaluates expression for each row on the table, then return the concatenation of those values in a single string result, separated by the specified delimiter

Its signature is:

CONCATENATEX(Table, Expression, [Delimiter])

It’s easier to understand what it does using a simple example though. Consider the following table on a worksheet in Excel 2016:

image

When you add this table to the Excel Data Model (I called the table Sales) you can add the following measure:

[sourcecode language=”text” padlinenumbers=”true”]
Purchasing Customers:=
CONCATENATEX(
VALUES(Sales[Customer]),
Sales[Customer],
","
)
[/sourcecode]

If you then use this measure in a PivotTable, you see the following:

image

As you can see, the measure returns a comma-delimited list of all of the customers who have bought each product. Very useful…

What’s New In The Excel 2016 Preview For BI?

Following on from my recent post on Power BI and Excel 2016 news, here are some more details about the new BI-related features in the Excel 2016 Preview. Remember that more BI-related features may appear before the release of Excel 2016, and that with Office 365 click-to-run significant new features can appear in between releases, so this is not a definitive list of what Excel 2016 will be able to do at RTM but a snapshot of functionality available as of March 2015 as outlined in this document and which I’ve found from my own investigations. When I find out more, or when new functionality appears, I’ll either update this post or write a new one.

Power Query

Yesterday, in the original version of my post, I mistakenly said that Power Query was a native add-in in Excel 2016: that’s not true, it’s not an add-in at all, it’s native Excel functionality. Indeed you can see that there is no separate Power Query tab any more, and instead there is a Power Query section on the Data tab instead:

DataTab

Obviously I’m a massive fan of Power Query so I’m biased, but I think this is a great move because it makes all the great Power Query functionality a lot easier to discover. There’s nothing to enable – it’s there by default – although I am a bit worried that users will be confused by having the older Data tab features next to their Power Query equivalents.

There are no new features for Power Query here compared to the latest version for Excel 2013, but that’s what I expected.

Excel Forecasting Functions

I don’t pretend to know anything about forecasting, but I had a brief play with the new Forecast.ETS function and got some reasonable results out of it as seen in the screenshot below:

image

Slicer Multiselect

There’s a new hammer icon on a slicer, which, when you click it, changes the way selection works. The default behaviour is the same as Excel 2013: every time you click on an item, that item is selected and any previous selection is lost (unless you were holding control or shift to multiselect). However with the hammer icon selected each new click adds the item to the previously selected items. This is meant to make slicers easier to use with a touch-screen.

Slicer

Time Grouping in PivotTables

Quite a neat feature this, I think. If you have a table in the Excel Data Model that has a column of type date in it, you can add extra calculated columns to that table from within a PivotTable to group by things like Year and Month. For example, here’s a PivotTable I built on a table that contains just dates:

Group1

Right-clicking on the field containing the dates and clicking Group brings up the following dialog:

Group2

Choosing Years, Quarters and Months creates three extra fields in the PivotTable:

Group3

And these fields are implemented as calculated columns in the original table in the Excel Data Model, with DAX definitions as seen here:

Group4

Power View on SSAS Multidimensional

At-bloody-last. I haven’t installed SSAS on the VM I’m using for testing Excel 2016, but I assume it just works. Nothing new in Power View yet, by the way.

Power Map data cards

Not sure why this is listed as new in Excel 2016 when it seems to be the same feature that appeared in Excel 2013 Power Map recently:

https://support.office.com/en-za/article/Customize-a-data-card-in-Power-Map-797ab684-82e0-4705-a97f-407e4a576c6e

Power Pivot

There isn’t any obvious new functionality in the Power Pivot window, but it’s clear that the UI in general and the DAX formula editor experience in particular has been improved.

image

Suggested Relationships

When you use fields from two Excel Data Model tables that have no relationship between them in a PivotTable, you get a prompt to either create new relationships yourself or let Excel detect the relationships:

image

Renaming Tables and Fields in the Power Pivot window

In Excel 2013 when you renamed tables or fields in the Excel Data Model, any PivotTables that used those objects had them deleted. Now, in Excel 2016, the PivotTable retains the reference to table or field and just displays the new name. What’s even better is that when you create a measure or a calculated column that refers to a table or column, the DAX definition of the measure or calculated column gets updated after a rename too.

DAX

There are lots of new DAX functions in this build. With the help of the mdschema_functions schema rowset and Power Query I was able to compare the list of DAX functions available in 2016 with those in 2013 and create the following list of new DAX functions and descriptions:

[sourcecode language=”text” wraplines=”true” gutter=”false”]
FUNCTION NAME DESCRIPTION
DATEDIFF Returns the number of units (unit specified in Interval)
between the input two dates
CONCATENATEX Evaluates expression for each row on the table, then
return the concatenation of those values in a single string
result, separated by the specified delimiter
KEYWORDMATCH Returns TRUE if there is a match between the
MatchExpression and Text.
ADDMISSINGITEMS Add the rows with empty measure values back.
CALENDAR Returns a table with one column of all dates between
StartDate and EndDate
CALENDARAUTO Returns a table with one column of dates
calculated from the model automatically
CROSSFILTER Specifies cross filtering direction to be used in
the evaluation of a DAX expression. The relationship is
defined by naming, as arguments, the two columns that
serve as endpoints
CURRENTGROUP Access to the (sub)table representing current
group in GroupBy function. Can be used only inside GroupBy
function.
GROUPBY Creates a summary the input table grouped by the
specified columns
IGNORE Tags a measure expression specified in the call to
SUMMARIZECOLUMNS function to be ignored when
determining the non-blank rows.
ISONORAFTER The IsOnOrAfter function is a boolean function that
emulates the behavior of Start At clause and returns
true for a row that meets all the conditions mentioned as
parameters in this function.
NATURALINNERJOIN Joins the Left table with right table using the
Inner Join semantics
NATURALLEFTOUTERJOIN Joins the Left table with right table
using the Left Outer Join semantics
ROLLUPADDISSUBTOTAL Identifies a subset of columns specified
in the call to SUMMARIZECOLUMNS function that should be
used to calculate groups of subtotals
ROLLUPISSUBTOTAL Pairs up the rollup groups with the column
added by ROLLUPADDISSUBTOTAL
SELECTCOLUMNS Returns a table with selected columns from the table
and new columns specified by the DAX expressions
SUBSTITUTEWITHINDEX Returns a table which represents the semijoin of two
tables supplied and for which the common set of
columns are replaced by a 0-based index column.
The index is based on the rows of the second table
sorted by specified order expressions.
SUMMARIZECOLUMNS Create a summary table for the requested
totals over set of groups.
GEOMEAN Returns geometric mean of given column
reference.
GEOMEANX Returns geometric mean of an expression
values in a table.
MEDIANX Returns the 50th percentile of an expression
values in a table.
PERCENTILE.EXC Returns the k-th (exclusive) percentile of
values in a column.
PERCENTILE.INC Returns the k-th (inclusive) percentile of
values in a column.
PERCENTILEX.EXC Returns the k-th (exclusive) percentile of an
expression values in a table.
PERCENTILEX.INC Returns the k-th (inclusive) percentile of an
expression values in a table.
PRODUCT Returns the product of given column reference.
PRODUCTX Returns the product of an expression
values in a table.
XIRR Returns the internal rate of return for a schedule of
cash flows that is not necessarily periodic
XNPV Returns the net present value for a schedule of cash flows
[/sourcecode]

Plenty of material for future blog posts there, I think – there are lots of functions here that will be very useful. I bet Marco and Alberto are excited…

VBA

We now have support for working with Power Query in VBA.

Power BI And Excel 2016 BI News

There have been quite a few Power BI and Office BI-related announcements over the last few weeks, and while I’ve tweeted about them (I’m @Technitrain if you’re not following me already) I though it would be a good idea to summarise them all in one post.

Power BI Announcements at Convergence and SQLBits

You’ve probably already seen the announcement today on the Power BI blog that Power BI is FINALLY available to those of us outside the USA:

http://blogs.msdn.com/b/powerbi/archive/2015/03/16/power-bi-preview-now-available-worldwide.aspx

At last! I’m sure MS had very good reasons why they couldn’t make the Power BI Preview available worldwide back in December, but this decision caused a lot of frustration in the MS BI community and I hope it’s not something that happens again. I can also confirm that the Power BI iPhone app is now available in the UK as well. The new data sources for Power BI that are coming soon – especially Google Analytics – will be very popular I think.

While I’m on the topic of Power BI, a few interesting nuggets about upcoming functionality emerged at SQLBits last week. Kasper mentioned that there will be some new DAX functions appearing in Power BI soon: Median, Percentile, DateDiff and XPNV. Presumably they will appear when we get the ability to create DAX measures and calculated columns in the Power BI Dashboard Designer. Also, following on from the bidirectional relationships functionality I blogged about earlier this year, there was the news that Power BI will also understand 1:1 relationships as well as 1:many, many:1 and many:many.

Office 2016 Preview BI Features

The Office 2016 preview went public today too:

http://blogs.office.com/2015/03/16/announcing-the-office-2016-it-pro-and-developer-preview/

There’s a great overview of what’s new for BI in Office 2016 here:

https://support.office.com/en-gb/article/Whats-new-in-Office-2016-Preview-4841f061-d019-45cc-af74-3e89c8cff1c4#data

The main points are:

  • Power Query is now a native feature of Excel 2016.
  • Power View works on SSAS Multidimensional (this is only going to work on the versions of SSAS Multidimensional that support DAX queries, ie SSAS 2014 or SSAS 2012 SP2)
  • New Excel forecasting functions
  • Time grouping functionality in PivotTables

I’ll be writing a more detailed blog on all of this at some point soon, once I know what’s officially public and what isn’t.

The Power Query announcement is interesting because, as things stand at the moment, we’ll be able to use full Power Query, Power Pivot and Power View functionality for free in the Power BI Dashboard Designer, but in Excel the same functionality is restricted to users of the Professional Plus SKUs. This is crazy, and I hope Microsoft makes the Power add-ins available for every SKU of Excel 2016. Have you signed the petition for this yet?

Power Map

Last week the Power Map team released a new video showcasing functionality from an upcoming release:

https://www.youtube.com/watch?v=aP-vZfC3Fd4&feature=youtu.be

Although there are no details about what is shown in the video, it certainly looks like the ability to use custom shapes (the main missing feature in Power Map up to now) will be coming soon.

PowerMap

Wow, psychedelic…

Surface Hub

Finally, BI is clearly one of the main use-cases of the new Surface Hub (see also this video):

SufaceHubPowerBI_small

I wonder if I can justify buying one for demo purposes?

SSAS Multidimensional Cube Design Video Training

I’ve been teaching my SSAS Cube Design training course for several years now (there are still a few places free for the London course next month if you’re interested) and I have now recorded a video training version of it for Project Botticelli.

The main page for the course is here:

https://projectbotticelli.com/cubes?pk_campaign=tt2015cwb

There’s also a free, short video on using the SSAS Deployment Wizard that you can see here:

https://projectbotticelli.com/knowledge/using-deployment-wizard-ssas-cube-design-video-tutorial?pk_campaign=tt2015cwb

clip_image001

If you register before the end of March using the code TECHNITRAIN2015MARCH you’ll get a 15% discount.

Using Excel Slicers To Pass Parameters To Power Query Queries

Power Query is great for filtering data before it gets loaded into Excel, and when you do that you often need to provide a friendly way for end users to choose what data gets loaded exactly. I showed a number of different techniques for doing this last week at SQLBits but here’s my favourite: using Excel slicers.

Using the Adventure Works DW database in SQL Server as an example, imagine you wanted to load only only rows for a particular date or set of dates from the FactInternetSales table. The first step to doing this is to create a query that gets all of the data from the DimDate table (the date dimension you want to use for the filtering). Here’s the code for that query – there’s nothing interesting happening here, all I’m doing is removing unnecessary columns and renaming those that are left:

[sourcecode language=”text” padlinenumbers=”true”]
let
Source = Sql.Database("localhost", "adventure works dw"),
dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],
#"Removed Other Columns" = Table.SelectColumns(dbo_DimDate,
{"DateKey", "FullDateAlternateKey", "EnglishDayNameOfWeek",
"EnglishMonthName", "CalendarYear"}),
#"Renamed Columns" = Table.RenameColumns(#"Removed Other Columns",{
{"FullDateAlternateKey", "Date"}, {"EnglishDayNameOfWeek", "Day"},
{"EnglishMonthName", "Month"}, {"CalendarYear", "Year"}})
in
#"Renamed Columns"
[/sourcecode]

 

Here’s what the output looks like:

image

Call this query Date and then load it to a table on a worksheet. Once you’ve done that you can create Excel slicers on that table (slicers can be created on tables as well as PivotTables in Excel 2013 but not in Excel 2010) by clicking inside it and then clicking the Slicer button on the Insert tab of the Excel ribbon:

image

Creating three slicers on the Day, Month and Year columns allows you to filter the table like so:

image

The idea here is to use the filtered rows from this table as parameters to control what is loaded from the FactInternetSales table. However, if you try to use Power Query to load data from an Excel table that has any kind of filter applied to it, you’ll find that you get all of the rows from that table. Luckily there is a way to determine whether a row in a table is visible or not and I found it in this article written by Excel MVP Charley Kyd:

http://www.exceluser.com/formulas/visible-column-in-excel-tables.htm

You have to create a new calculated column on the table in the worksheet with the following formula:

=(AGGREGATE(3,5,[@DateKey])>0)+0

image

This calculated column returns 1 on a row when it is visible, 0 when it is hidden by a filter. You can then load the table back into Power Query, and when you do you can then filter the table in your new query so that it only returns the rows where the Visible column contains 1 – that’s to say, the rows that are visible in Excel. Here’s the code for this second query, called SelectedDates:

[sourcecode language=”text”]
let
Source = Excel.CurrentWorkbook(){[Name="Date"]}[Content],
#"Filtered Rows" = Table.SelectRows(Source, each ([Visible] = 1)),
#"Removed Columns" = Table.RemoveColumns(#"Filtered Rows",{"Visible"})
in
#"Removed Columns"
[/sourcecode]

 

image

This query should not be loaded to the Excel Data Model or to the worksheet.

Next, you must use this table to filter the data from the FactInternetSales table. Here’s the code for a query that does that:

[sourcecode language=”text”]
let
Source = Sql.Database("localhost", "adventure works dw"),
dbo_FactInternetSales = Source{[Schema="dbo",Item="FactInternetSales"]}[Data],
#"Removed Other Columns" = Table.SelectColumns(dbo_FactInternetSales,
{"ProductKey", "OrderDateKey", "CustomerKey", "SalesOrderNumber",
"SalesOrderLineNumber", "SalesAmount", "TaxAmt"}),
Merge = Table.NestedJoin(#"Removed Other Columns",{"OrderDateKey"},
SelectedDates,{"DateKey"},"NewColumn",JoinKind.Inner),
#"Removed Columns" = Table.RemoveColumns(Merge,
{"ProductKey", "OrderDateKey", "CustomerKey"}),
#"Expand NewColumn" = Table.ExpandTableColumn(#"Removed Columns",
"NewColumn", {"Date"}, {"Date"}),
#"Reordered Columns" = Table.ReorderColumns(#"Expand NewColumn",
{"Date", "SalesOrderNumber", "SalesOrderLineNumber",
"SalesAmount", "TaxAmt"}),
#"Renamed Columns" = Table.RenameColumns(#"Reordered Columns",{
{"SalesOrderNumber", "Sales Order Number"},
{"SalesOrderLineNumber", "Sales Order Line Number"},
{"SalesAmount", "Sales Amount"},
{"TaxAmt", "Tax Amount"}}),
#"Changed Type" = Table.TransformColumnTypes(#"Renamed Columns",
{{"Date", type date}})
in
#"Changed Type"
[/sourcecode]

 

Again, most of what this query does is fairly straightforward: removing and renaming columns. The important step where the filtering takes place is called Merge, and here the data from FactInternetSales is joined to the table returned by the SelectedDates query using an inline merge (see here for more details on how to do this):

image

The output of this query is a table containing rows filtered by the dates selected by the user in the slicers, which can then be loaded to a worksheet:

image

The last thing to do is to cut the slicers from the worksheet containing the Date table and paste them onto the worksheet containing the Internet Sales table:

image

You now have a query that displays rows from the FactInternetSales table that are filtered according to the selection made in the slicers. It would be nice if Power Query supported using slicers as a data source direct without using this workaround and you can vote for it to be implemented here.

You can download the sample workbook for this post here.

Submit Your Feedback On BI Features In SQL Server V.Next

Following on from last month’s post on ideas for new features in SSAS Multidimensional, if you are interested in telling Microsoft what features you think should be added to the on-prem SQL Server BI tools in the next version you can do so here:

http://support.powerbi.com/forums/282523-bi-in-sql-vnext/filters/top

Unsurprisingly, there are plenty of pleas for SSRS to get some love. My suggestion is to integrate Power Query with SSRS: it would add a lot of new data sources that SSRS desperately needs; it would add data transformation and calculation capabilities; and it would also provide the beginnings of a common developer experience for corporate and self-service BI tools – Power Query integrated with Report Builder would be a useful companion to the Power BI Dashboard Designer.

Handling Added Or Missing Columns In Power Query

A recent conversation in the comments of this blog post brought up the subject of how to handle columns that have either been removed from or added to a data source in Power Query. Anyone who has worked with csv files knows that they have a nasty habit of changing format even when they aren’t supposed to, and added or removed columns can cause all kinds of problems downstream.

Ken Puls (whose excellent blog you are probably already reading if you’re interested in Power Query) pointed out that it’s very easy to protect yourself  against new columns in your data source. When creating a query, select all the columns that you want and then right-click and select Remove Other Columns:

image

This means that if any new columns are added to your data source in the future, they won’t appear in the output of your query. In the M code the Table.SelectColumns() function is used to do this.

Dealing with missing columns is a little bit more complicated. In order to find out whether a column is missing, first of all you’ll need a list of columns that should be present in your query. You can of course store these tables in a table in Excel and enter the column names manually, or you can do this in M fairly easily by creating a query that connects to your data source and using the Table.ColumnNames() function something like this:

[sourcecode language=”text” padlinenumbers=”true”]
let
//Connect to CSV file
Source = Csv.Document(
File.Contents(
"C:\Users\Chris\Documents\Power Query demos\SampleData.csv"
),null,",",null,1252),
//Use first row as headers
FirstRowAsHeader = Table.PromoteHeaders(Source),
//Get a list of column names
GetColumns = Table.ColumnNames(FirstRowAsHeader),
//Turn this list into a table
MakeATable = Table.FromList(
GetColumns,
Splitter.SplitByNothing(),
null,
null,
ExtraValues.Error),
//Rename this table’s sole column
RenamedColumns = Table.RenameColumns(
MakeATable ,
{{"Column1", "ColumnName"}})
in
RenamedColumns
[/sourcecode]

Given a csv file that looks like this:

image

…the query above returns the following table of column names:

image

You can then store the output of this query in an Excel table for future reference – just remember not to refresh the query!

Having done that, you can then look at the columns returned by your data source and compare them with the columns you are expecting by using the techniques shown in this post. For example, here’s a query that reads a list of column names from an Excel table and compares them with the columns returned from a csv file:

[sourcecode language=”text”]
let
//Connect to Excel table containing expected column names
ExcelSource = Excel.CurrentWorkbook(){[Name="GetColumnNames"]}[Content],
//Get list of expected columns
ExpectedColumns = Table.Column(ExcelSource, "ColumnName"),
//Connect to CSV file
CSVSource = Csv.Document(
File.Contents(
"C:\Users\Chris\Documents\Power Query demos\SampleData.csv"
),null,",",null,1252),
//Use first row as headers
FirstRowAsHeader = Table.PromoteHeaders(CSVSource),
//Get a list of column names in csv
CSVColumns = Table.ColumnNames(FirstRowAsHeader),
//Find missing columns
MissingColumns = List.Difference(ExpectedColumns, CSVColumns),
//Find added columns
AddedColumns = List.Difference(CSVColumns, ExpectedColumns),
//Report what has changed
OutputMissing = if List.Count(MissingColumns)=0 then
"No columns missing" else
"Missing columns: " & Text.Combine(MissingColumns, ","),
OutputAdded = if List.Count(AddedColumns)=0 then
"No columns added" else
"Added columns: " & Text.Combine(AddedColumns, ","),
Output = OutputMissing & " " & OutputAdded
in
Output
[/sourcecode]

Given a csv file that looks like this:

image

…and an Excel table like the one above containing the three column names Month, Product and Sales, the output of this query is:

image

It would be very easy to convert this query to a function that you could use to check the columns expected by multiple queries, and also to adapt the output to your own needs. Also, in certain scenarios (such as when you’re importing data from SQL Server) you might also want to check the data types used by the columns; I’ll leave that for another blog post though. In any case, data types aren’t so much of an issue with CSV files because it’s Power Query that imposes the types on the columns within a query, and any type conversion issues can be dealt with by Power Query’s error handling functionality (see Gerhard Brueckl’s post on this topic, for example).

You can download a workbook containing the two queries from this post here.

Optimising SSAS Many-To-Many Relationships By Adding Redundant Dimensions

The most elegant way of modelling your SSAS cube doesn’t always give you the best query performance. Here’s a trick I used recently to improve the performance of a many-to-many relationship going through a large fact dimension and large intermediate measure group…

Consider the following cube, built from the Adventure Works DW database and showing a many-to-many relationship:

image

The Fact Internet Sales measure group contains sales data; the Product, Date and Customer dimensions are what you would expect; Sales Order is a fact dimension with one member for each sales transaction and therefore one member for each row in the fact table that Fact Internet Sales is built from. Each Sales Order can be associated with zero to many Sales Reasons, and the Sales Reason dimension has a many-to-many relationship with the Fact Internet Sales measure group through the Fact Internet Sales Reason measure group. Only the Sales Order dimension connects directly to both the Fact Internet Sales Reason and Fact Internet Sales measure groups.

There’s nothing obviously wrong with the way this is modelled – it works and returns the correct figures – and the following query shows how the presence of the many-to-many relationship means you can see the Sales Amount measure (from the Fact Internet Sales measure group) broken down by Sales Reason:

[sourcecode language=”text” padlinenumbers=”true”]
select
{[Measures].[Sales Amount]} on 0,
non empty
[Sales Reason].[Sales Reason].[Sales Reason].members
on 1
from m2m1
where([Date].[Calendar Year].&[2003],
[Product].[Product Category].&[3],
[Customer].[Country].&[United Kingdom])
[/sourcecode]

 

image

However, to understand how we can improve the performance of a many-to-many relationship you have to understand how SSAS resolves the query internally. At a very basic level, in this query, SSAS starts with all of the Sales Reasons and then, for each one, finds the list of Sales Orders associated with it by querying the Fact Sales Reason measure group. Once it has the list of Sales Orders for each Sales Reason, it queries the Fact Internet Sales measure group (which is also filtered by the Year 2003, the Product Category Clothing and the Customer Country UK) and sums up the value of Sales Amount for those Sales Orders, getting a single value for each Sales Reason. A Profiler trace shows this very clearly:

image

The Resource Usage event gives the following statistics for this query:

READS, 7

READ_KB, 411

WRITES, 0

WRITE_KB, 0

CPU_TIME_MS, 15

ROWS_SCANNED, 87299

ROWS_RETURNED, 129466

Given that the Sales Order dimension is a large one (in this case around 60000 members – and large fact dimensions are quite common with many-to-many relationships) it’s likely that one Sales Reason will be associated with thousands of Sales Orders, and therefore SSAS will have to do a lot of work to resolve the relationship.

In this case, the optimisation comes with the realisation that in this case we can add the other dimensions present in the cube to the Fact Sales Reason measure group to try to reduce the number of Sales Orders that each Sales Reason is resolved to. Since Sales Order is a fact dimension, with one member for each sales transaction, then since each sales transaction also has a Date, a Product and a Customer associated with it we can add the keys for these dimensions to the fact table on which Fact Sales Reasons is built and join these dimensions to it directly:

image

This is not an assumption you can make for all many-to-many relationships, for sure, but it’s certainly true for a significant proportion.

The Product, Date and Customer dimensions don’t need to be present for the many-to-many relationship to work, but adding a Regular relationship between them and Fact Internet Sales Reason helps SSAS speed up the resolution of the many-to-many relationship when they are used in a query. This is because in the original design, in the test query the selection of a single member on Sales Reason becomes a selection on all of the Sales Orders that have ever been associated with that Sales Reason; with the new design, the selection of a single member on Sales Reason becomes a selection on a combination of Dates, Customers, Products and Sales Orders – and since the query itself is also applying a slice on Date, Customer and Product, this is a much smaller selection than before. For the query shown above, with the new design, the Resource Usage event now shows:

READS, 11

READ_KB, 394

WRITES, 0

WRITE_KB, 0

CPU_TIME_MS, 0

ROWS_SCANNED, 47872

ROWS_RETURNED, 1418

The much lower numbers for ROWS_SCANNED and ROWS_RETURNED shows that the Storage Engine is doing a lot less work. For the amount of data in Adventure Works the difference in query performance is negligible, but in the real world I’ve seen this optimisation make a massive difference to performance, resulting in queries running up to 15 times faster.

Don’t forget that there are many other ways of optimising many-to-many relationships such as the those described in this white paper. Also, if you have a large fact dimension, if it does not need to be visible to the end user and is only needed to make the many-to-many relationship work, you can reduce the overhead of processing it by breaking it up into multiple smaller dimensions as described here.

I’m speaking at the PASS BA Conference

I haven’t been shy about stating my support for the PASS BA conference and the associated efforts by PASS to reach out beyond its traditional audience to analysts and other power users (see here for example). I won’t bore you with my opinions again, except to say that at the third attempt I think PASS have got the balance of session topics right at the upcoming PASS BA conference in Santa Clara this April. There’s a stellar team of Excel speakers, including Mr Excel and Chandoo. There’s David Smith from Revolution Analytics, the company bought by Microsoft recently; plenty of sessions on predictive analytics; various Microsoft dev teams will be out in force; and Marco Russo and I will be speaking too. I think it promises to be a great conference, definitely not a PASS BI conference, and very different from the PASS Summit.

You can register here, and using the code BASPCHR will give you a $150 discount.