Fourteenth Blog Birthday

Every year, on the anniversary of the first-ever post on this blog, I write a post summarising my thoughts on what’s happening in the world of Microsoft BI and what I’m up to professionally.

This year, rather than go on about how Power BI is taking over the world (which we all know already), I thought it might be interesting to consider how the focus of this blog – and by implication the focus of my work – has changed over the last few years by looking at the viewing statistics of some of my most popular posts.

As you probably know, for a long time the only product I cared about was Analysis Services Multidimensional and MDX: it was pretty much all I blogged about and the only thing I did consultancy and training on. The following graph shows how the number of hits on four of the most-viewed posts on this subject from 2014 to 2018: Aggregating the result of an MDX calculation using scoped assignments; Joining the results of two MDX queries together; Last Ever Non Empty – a new, fast MDX approach; and One Cube vs Multiple Cubes.

image

 

None of these posts are, in technical terms, out of date but the downward trend is the same for all of them. The decline in traffic is matched by the decline in demand for consultancy and training on SSAS MD and MDX. While I still spend around 20% of my time doing SSAS MD and MDX consultancy, I do very little training on them at all these days – I guess because no-one is building new solutions using SSAS MD, although there are a still a large number of SSAS MD solutions in production that need maintenance and improvement. I expect the launch SSAS MD in the cloud as part of Power BI Premium will lead to a spike in the amount of work I do on it as I help my customers migrate but that will only be short-lived.

In contrast, look at the trend for four of my most-popular Power Query/M related posts: Referencing individual cell values from tables in Power QueryWorking with web services in Power Query; Creating tables in Power BI/Power Query M code using #table(); and Web.Contents(), M functions and dataset refresh errors in Power BI. These are not necessarily new posts (the earliest dates from 2014) but again they are all still technically relevant and the steep increase in the amount of hits over the last few years that they receive is clear:

image

Power Query and M is a bit of a niche topic, though; right now my most popular posts are on general Power BI data modelling and DAX – a topic I don’t actually blog about all that often, but which I nevertheless spend a lot of consultancy and training time on. The following graph shows the trends for the posts Comments and descriptions in DAX; Creating current day, week, month and year reports in Power BI using bi-directional cross-filtering and M; Dynamic chart titles in Power BI; and (although I’ve never really understood the popularity of this one) Using DateDiff() to calculate time intervals in DAX.

image

Perhaps I should blog about this more? The reason I don’t is twofold: first, there are a lot of people out there such as Marco and Alberto who specialise in DAX, have covered all the obvious angles and do a much better job than I ever could; second, my philosophy has always to blog about what I’m interested in and excited about, and frankly I have always enjoyed Power Query and M more than DAX.

One last graph is needed for context, showing the most popular posts from the three graphs above next to each other. The following graph shows how Aggregating the result of an MDX calculation using scoped assignments, Working with web services in Power Query and Dynamic chart titles in Power BI compare against each other:

image

It goes to show how the “Dynamic chart titles” post is now much more popular that the “Aggregating the result of an MDX calculation” post was, even at the peak of its popularity. I guess Power BI is a safe bet for my future.

Data Privacy Settings And Data Refresh Performance In Excel And Power BI

I have known for a long time that data privacy checks can have an impact on the performance of data refresh in Power BI and Excel, but on a recent performance tuning engagement I had the chance to observe just how much of a difference changing these settings can make. Unfortunately I can’t share the M code for the queries I was working with but the scenario is very common:

  • The only data source was a single Excel workbook with multiple worksheets
  • There were multiple Power Query/M queries reading data from these worksheets; the output of these queries was not being loaded into the Power BI dataset
  • There were further Power Query/M queries that referenced these source queries, none of which were being loaded into the Power BI dataset, and they all fed into a single query whose output was being loaded into a dataset
  • Overall data volumes were very small, less than 1MB, and the final query only returned approximately 5000 rows
  • Data refresh times were exceedingly slow, even with the “allow data preview to download in the background” setting turned off, and there was a long period where the Power Query engine was stuck in the “Evaluating” phase of the query during refresh in Power BI Desktop before the dialog showed the amount of data being loaded:

    image

  • Data refresh times also varied by +/-25% (I have a theory why this is happening – if I can confirm it I will blog about this) from one refresh to another

As a benchmark, here are some typical refresh timings for the same table in Power BI Desktop (December 2018 version) and Excel for Office 365 (build 11126.20118 click-to-run) with the default setting of data privacy checks turned on and the data privacy level set for the source Excel file set to Public:

Power BI Desktop Excel
Evaluating Phase (Seconds) 122 173
Data Load Phase (Seconds) 140 113
Total Refresh Time (Seconds) 262 286

Here are the same timings with the Power Query engine set to combine data according to each file’s data privacy settings, and the data privacy checks for the current file turned off:

image

image

Power BI Desktop Excel
Evaluating Phase (Seconds) 69 91
Data Load Phase (Seconds) 108 112
Total Refresh Time (Seconds) 177 203

Clearly there are some substantial query performance gains to be had by turning off data privacy checks and these gains are made in the Evaluating phase. In this case the queries in both Power BI Desktop and Excel were around 30% faster with data privacy checks turned off.

BUT is it safe to turn off data privacy checks? I strongly advise you to read my series on data privacy settings starting here, in particular this post on disabling checks, to understand the issues involved. You should not turn off data privacy checks without understanding what they do because you may end up sending sensitive data to somewhere it shouldn’t go and maybe even break the law. What’s more, if you have published a Power BI dataset you cannot disable data privacy checks for scheduled refresh unless you are using the gateway in Personal mode and have configured it with the “Fast Combine” option turned on, as shown here. Therefore turning off privacy checks is only really useful for Power Query users in Excel or to speed up development in Power BI Desktop.

In this particular case there is only one data source, an Excel workbook, a source which does not support query folding, so as far as I can see there is no risk in turning off data privacy checks because there is no way data from this Excel workbook can be sent anywhere else. Now why couldn’t the Power Query engine work this out?

Making Power BI Drillthrough Return The “Right” Rows When You Use It With Complex Measures

When an end user sees a strange value in a Power BI report, their first reaction is usually to want to see the detail-level data from the underlying table. Power BI’s drillthrough feature is a great way of  letting them do this, but it only returns meaningful results if you use it on measures that do simple aggregations such as sums or counts; if you have more complex calculations then usually what the drillthrough returns won’t be the rows that go to make up the value the user has clicked on.

Here’s an example. Say you have a simple Power BI model with a Sales table that contains the following data:

image

There is also a Date table with date and month columns, and the entire model looks like this:

image

Let’s say you create a measure called Sales Value that sums up the contents of the Sales column:

Sales Value = SUM('Sales'[Sales])

You could use this in a column chart to show sales by month, like so:

image

If the user wants to see the underlying data for one of the bars in this chart, drillthrough will work well – you just need to create another page (called, in this case, Month Drillthrough), put a table on it that displays the full contents of the Sales table:

image

[It’s important to note that it’s the Date column from the Sales table that’s shown here, not the Date column from the Date table]

Then drag the Month column from the Date table into the Drillthrough filter area:

image

…and you will be able to drillthrough from one of the columns in the chart, in this case the bar for May 2018:

image

…and that filter will be passed over to the Date Drillthrough page, so you only see the row in the table showing sales for May 5th 2018:

image

But what happens if you want to display year-to-date values in your column chart? If you create the following measure:

YTD Sales = CALCULATE([Sales Value], DATESYTD('Date'[Date]))

…and use it in the bar chart, you will see the following:

image

The problem comes when the user does the same drillthrough on May 2018 – which now shows the value 16 – and gets exactly the same table that they did before, showing only the sales transactions for May:

image

In this case, because the user clicked on the year-to-date value for May 2018 they would expect to see all the rows from the Sales table that went to make up that YTD value for May, that’s to say all the rows from the Sales table where the date was in the range January 2018 to May 2018.

The solution is to use some DAX that takes the month filter passed by the drillthrough and ensures that it filters the table shown not by the selected month, but all months in the year-to-date (similar to, but not exactly the same as, what I describe here).

Here’s a measure that does the job:

SalesIgnoringDate =
var CurrentDateFromSales =
CALCULATE(
    SELECTEDVALUE('Sales'[Date]),
    CROSSFILTER(
        'Date'[Date],
        Sales[Date],
        None
        )
    )
return
IF(
    CONTAINS(
        DATESYTD('Date'[Date]),
        'Date'[Date],
        CurrentDateFromSales
        ),
    CALCULATE(
        [Sales Value],
        CROSSFILTER(
            'Date'[Date],
            Sales[Date],
            None)
            )
    )

What this does is:

  • Uses the DAX Crossfilter() function to disable the relationship between the Date and Sales table, and then use the SelectedValue() function to find the date from the Sales table shown on the current row of the table on the drillthrough report page, and store it in the CurrentDateFromSales variable.
  • Constructs a table using the DatesYTD() function and the Date column of the Date table, which contains all of the dates from the beginning of the current year up to and including the last date in the filter context – which will be the last date in the month selected in the drillthrough.
  • Uses the Contains() function to see if the date saved in the CurrentDateFromSales appears in the table returned in the previous step.
  • If it does appear, return the value of the Sales Value measure. Once again, this needs to have the relationship between the Sales and Date table disabled using the CrossFilter() function.

This measure can be used in the table on the drillthrough page instead of the Sales Value measure. Last of all, since your users will not want to see a measure called SalesIgnoringDate in their report, you can rename the SalesIgnoringDate column on the table to Sales Value.

Here’s the end result (in this case I created a new drillthrough page called YTD Drillthrough with the new measure on):

YTDDrillthrough

You can download the sample pbix file here.

This is just one example, and different types of calculation on your source page will require vastly different DAX measures on your drillthrough page to ensure that a meaningful set of rows is returned. The basic concepts will remain the same whatever the calculation, though: you need to create a measure that ignores the filter applied by the drillthrough and instead returns a value when you want a row to appear in your drillthrough table and returns a blank value when you don’t want a row to appear.

It’s a shame that drillthrough in the SSAS Tabular sense is not available in Power BI, because being able to set a the Detail Rows Expression property on a measure in Power BI would make this problem a lot easier to solve.

Power BI And Column Order In M Queries And Dataset Tables

The order of the columns in a table in a Power BI dataset doesn’t matter all that much, especially because the Fields pane in Power BI Desktop ignores the original column order and lists the columns in a table in alphabetical order. However there are a few situations where it is important, for example when you are using the DAX Union() function in a calculated table: as the documentation states, when you use Union() “Columns are combined by position in their respective tables”. You might also find it irritating if the columns you see in the Data or Relationships panes in the main Power BI Desktop window make it hard to browse the data or create relationships. You can control the order of columns in a table in the Power Query Editor but it’s not completely straightforward to get any changes to column order you make in your queries to be applied to the order of the columns in a table after it has already been loaded into a dataset.

Let’s say you have an M query that returns a table with three columns, Column1, Column2 and Column3, that looks like this:

image

Here’s the M code for this query:

#table(
    type table [Column1=number, Column2=number, Column3=number],
    {{1,2,3}}
)

The output you see in the Data pane in the main Power BI window is, unsurprisingly this:

image

However, if you subsequently edit the query above so the same columns are present but in a different order, like so:

image

#table(
    type table [Column2=number, Column3=number, Column1=number],
    {{2,3,1}}
)

…you will find that it does not affect the order of the columns shown in the Data pane in Power BI Desktop, which are fixed in the order they were when they were first created.

To get the column order change to be applied to the table in the dataset, you need to go back into the Query Editor, right-click on the Query and deselect the Enable Load option:

image

When you do this, you get the following warning:

image

If you subsequently click Continue and then Close & Apply, the dire warnings will come true: you will not only break any visuals that use this table but any measures or calculated columns associated with this table will be lost. If you instead click Continue and then click the Enable Load property on the query so loading is enabled again, you will find that when you then click Close & Apply the visuals will still work, the columns have been reordered, but the measures and calculated columns will still be lost. You’ll also see something that I think is a bug: there will be no data in the table in the dataset

image

You will have to manually refresh the query in the main Power BI Desktop screen to see the data in the table:

image

Credentials, Data Privacy Settings And Data Sources In Power Query/Power BI

Recently I’ve been doing some more investigations into how data privacy settings work in Power BI. This is a subject I’ve blogged about in great detail already in a series of posts last year, but this functionality is so complex that there is always more to learn. I don’t have any profound new insights to offer; instead this blog post is a write up of a series of experiments whose results shed light onto how the process of setting data privacy levels works end-to-end.

Consider the following M query:

let
    Source = 
        Json.Document(
            Web.Contents(
                "https://data.gov.uk/api/3/action/package_search?q=cows"
                )
            ),
    result = Source[result],
    results = result[results],
    results1 = results{0},
    id = results1[id],
    output = 
        Json.Document(
            Web.Contents(
                "https://data.gov.uk/api/3/action/package_show?id=" & id
                )
            )
in
    output

The query does the following:

  • Runs a query against the UK government’s open data search API (the same API I use in this post on the RelativePath and Query options for the Web.Contents() function) to get a list of datasets related to the search term “cows” via the Package Search endpoint
  • Gets the first dataset returned by the search and finds its ID
  • Sends this ID to the Package Show endpoint in the same API to get the full JSON representation of this data set. Note that the entire URL is dynamically generated and that the Query option of Web.Contents() is not used here.

It’s a typical scenario where data privacy settings can cause problems: data from one data source, the package_search endpoint, is sent to another data source, the package_show endpoint. My series from last year on data privacy settings provides some useful background information on why this is such an important thing for the Power Query engine.

Assuming that you have never used this API before, when you try to run this query in the Power Query Editor in Power BI Desktop, you’ll see the following prompt to edit the credentials used:

image

Before you click the Edit Credentials button, there are two interesting things to point out. First, in the Query Dependencies view, you see this:

image

Notice that the Package Search endpoint is shown but not the Package Show endpoint.

Second, if you click the Data Source Settings button, you’ll see the following in the dialog that appears:

image

Not only does it only show the Package Search endpoint, there is a warning that says:

“Some data sources may not be listed because of hand-authored queries”

This refers to the output step in the query that calls the Package Show endpoint with the dynamically-generated url.

Closing this dialog and going back to the Query Editor, if you click the Edit Credentials button, you can set credentials for the data source (anonymous access is fine in this case). These credentials can be set at all levels in the path down to https://data.gov.uk/api/3/action/package_search.

image

Setting credentials at the level of https://data.gov.uk means you only get prompted once; however if you select https://data.gov.uk/api/3/action/package_search from the dropdown list and click Connect you will get prompted again to set credentials, this time with a dropdown that shows all paths down to package show:

image

Assuming you set credentials at the level of https://data.gov.uk and click Connect, then Fiddler shows that a call is made to https://data.gov.uk/api/3/action/package_search?q=cows, presumably to check whether the credentials entered actually work and you move back to the Query Editor.

image

Next, in the Query Editor, you see the data privacy settings prompt:

image

Clicking Continue brings up the data privacy levels dialog:

image

You have the choice to ignore privacy levels for this file, but of course you should always try to avoid doing that. You also have two dropdown boxes that both show https://data.gov.uk on the left-hand side and another two dropdown boxes next to them, although only the top one of these is enabled.

In the first column of dropdown boxes, in the first dropdown, you can see all points in the path from https://data.gov.uk to https://data.gov.uk/api/3/action/package_search:

image

In the dropdown box immediately underneath you can see for the first time all points in the path from https://data.gov.uk to https://data.gov.uk/api/3/action/package_show:

image

If you select https://data.gov.uk in the top-left dropdown only the top-right dropdown is enabled, and in the top-right dropdown you can set the data privacy levels Public, Organizational and Private.

image

The meanings of these three levels are described in my earlier series and here, and I won’t go into detail about what they mean in this post. The bottom-right dropdown is disabled because if you set privacy levels for https://data.gov.uk then all urls that start with this path inherit this setting. This is similar to what happens with the None data privacy setting that I describe here, I think.

Setting the data privacy level to Public on https://data.gov.uk in the top-right dropdown means the query runs successfully:

image

The expected activity is shown in Fiddler:

image

And at last the Package Show endpoint is shown in the Query Dependencies view:

image

The Data Source Settings dialog shows the same as it does above in the “Data sources in current file” tab, including the warning about hand-authored queries, but on the “Global permissions” tab there is now an entry for https://data.gov.uk :

image

Although you only set a privacy level for https://data.gov.uk earlier, it’s interesting to note that the entry for https://data.gov.uk/api/3/action/package_search has a privacy level set explicitly to Public and not to None:

image

Stepping back a few steps to the Privacy Levels dialog, if you set a privacy level of Private for https://data.gov.uk like so:

image

…then the query fails with the error “Formula.Firewall: Query ‘Query1’ (step ‘output’) is accessing data sources that have privacy levels which cannot be used together. Please rebuild this data combination.”:

image

From my point of view, this is the first really interesting finding: the two endpoints, https://data.gov.uk/api/3/action/package_search and https://data.gov.uk/api/3/action/package_show, are considered as separate data sources (which tallies with what is shown in the Query Dependencies view) even though they have both inherited their data privacy level setting from https://data.gov.uk. Since they are both Private then data cannot be sent from one source to the other, hence the error.

The second interesting finding becomes apparent if you follow the steps above with a slightly different version of the original query that uses the Query option in the call to the Package Show endpoint:

let
    Source = 
        Json.Document(
            Web.Contents(
                "https://data.gov.uk/api/3/action/package_search?q=cows"
                )
            ),
    result = Source[result],
    results = result[results],
    results1 = results{0},
    id = results1[id],
    output = 
        Json.Document(
            Web.Contents(
                "https://data.gov.uk/api/3/action/package_show",
                [Query=[#"id"=id]]
                )
            )
in
    output

In this case when you look in the Data Source Settings dialog you now see both endpoints listed and you no longer see the “hand-authored queries” warning:

image

It looks like whatever method it is that the Power Query engine searches for data sources inside a query is confused by dynamically generated urls – which might also explain why data sources that use dynamic urls can’t be refreshed after publishing.

Listing Windows Language Code Identifiers And Their Associated Date And Number Formats With M In Power BI/Power Query

In a comment on my blog post about international date and number formats and changing data types with the “using locale” option in Power Query/Power BI, Jan Karel Pieterse asked if there was any way to get a list of the thousand and decimal separators used for number formatting by each language and region. Since this is exactly the kind of geeky question that fascinates me I decided to write an M query to answer it and – for bonus points – to find the default date format used too.

To start off, I found a table of all Windows Language Code Identifiers on this page:

https://msdn.microsoft.com/en-us/library/cc233982.aspx

Of course this can be loaded into Power Query easily using the “From Web” source. After that it’s easy to add a column to the table that takes a sample date (March 22nd 2018) and number (one hundred thousand and one tenth) and converts it to text using the language code identifier on each row: the Text.From() function does this for dates, and for numbers you have to use Number.ToText() if you want to get thousand separators and decimal separators. There are a few minor problems to deal with, such as the fact that Power Query doesn’t know what to do with dates for the “Congo Swahili” language code identifier and some rows have multiple language tags, but nothing serious.

Here’s the full code:

let
    //Sample dates and numbers to show
    SampleDate = #date(2018,3,22),
    SampleNumber = 100000+(1/10),
    //MS web page with list of language tags
    LocaleWebPage =
        Web.Page(
        Web.Contents(
         "https://msdn.microsoft.com/en-us/library/cc233982.aspx"
         )
         ),
    LocaleList = LocaleWebPage{1}[Data],
    RemoveColumns =
        Table.SelectColumns(
            LocaleList,
            {"Language", "Location (or type)", "Language tag"}
            ),
    SplitColumn =
        Table.SplitColumn(
            RemoveColumns,
            "Language tag",
            Splitter.SplitTextByAnyDelimiter(
                {",","or"},
                QuoteStyle.Csv
                ),
                {"Language tag"}
                ),
    //Create example columns
    DateExample =
        Table.AddColumn(
            SplitColumn,
            "Date",
            each Text.From(SampleDate, [Language tag])
            , Text.Type),
    NumberExample =
        Table.AddColumn(
            DateExample,
            "Number",
            each Number.ToText(SampleNumber,"N", [Language tag])
            , Text.Type),
    //Remove any rows containing errors
    RemoveErrors = Table.RemoveRowsWithErrors(NumberExample)
in
    RemoveErrors

Here’s some of the output:

image

So, if you’ve ever wondered how the Cornish speakers of south-west England like to format their dates or whether the Oromo speakers of Ethiopia use a comma or a full stop as a decimal separator, wonder no more. And if you are not interested in M at all and just want to download an Excel workbook with a list of all LCIDs and how numbers and dates are formatted for them, you can do so here.

BI Survey 18 Highlights

Every year, in return for publicising the BI Survey (the largest survey of BI tool customers and vendors in the world) here on my blog, I get a free copy of the results and the chance to blog about some of the more interesting findings. Here are a few points that stood out for me this year in the BI Survey 18:

  • The top 5 tools acquired by respondents were Power BI, Excel, Qlikview, Tableau and SSRS in that order – so Microsoft has three of the top 5 most popular BI tools. Given that users often have trouble understanding the different types of BI that Power BI, Excel and SSRS are suited to, Microsoft’s long-standing strategy of close ties between Power BI and Excel and the forthcoming integration of SSRS reports into Power BI Premium looks very wise.
  • Power BI has – unsurprisingly – one of the best price-to-value ratios reported of all BI tools. The flip side of this is that customers are less happy with the level of support that Microsoft offers; I guess this is all relative to more expensive BI tools which need armies of people from the vendor or from a BI consultancy to implement them, and who are more likely to bend over backwards for their customers.
  • Power BI is now the number two client tool used for SSAS after SSRS, beating Excel. I find that hard to believe but I guess it’s possible.
  • Compared to its obvious competitors, and indeed to all other BI tools, Power BI gets some very good scores. There is no clear leader (and if there was, I would be very suspicious about the survey’s methodology) but it’s clear that Microsoft is now one of the leading BI vendors and given that it’s sustaining the levels of investment that brought it to this position, I think it Power BI will continue to grow and prosper in 2019.
%d bloggers like this: