12th Blog Birthday

Today is the 12th anniversary of the first post on this blog, and as in previous years I’m going to use this as an opportunity to sum up my thoughts over what’s been going on in my corner of the Microsoft BI world in the last twelve months.

Power BI

I think it’s fair to say that 2016 was the year that Power BI became the big commercial success that many of us hoped it would be. After the achingly slow uptake of Power Pivot and the failure of the original Office 365 Power BI it’s great to see Microsoft BI with a hit on its hands. Many of my existing customers have started using it alongside the rest of the SQL Server BI stack, especially SSAS, because it’s much easier to build reports and share them via the browser or mobile devices than with SSRS or Excel. I’ve also started working with new type of customer, one that I’ve never worked with before: small and medium organisations (including many not-for-profits) who have Office 365 but no existing BI solution, the kind of organisation that does not have the money or resources for a SQL Server BI solution or indeed any other kind of traditional BI solution. This, I believe, is where the real opportunity for Power BI lies and where the majority of the new growth will come from.

Apart from my own customers, there’s plenty of other evidence for the success of Power BI. The energy of the Power BI community, on forums and at user groups, is amazing – and once again, the people that I meet at user groups are completely different to the crowd you get at a normal SQL Server user group. The analysts love it too: for example, Microsoft is now in the Leaders section of the Gartner Magic Quadrant. There’s also the fact that competitors like Tableau have started attacking Power BI in their marketing, so I guess they must consider it a major threat.

Why has it been such a success? The underlying technology is great, but then again the technology was always great. The pace of change is incredible and it’s good to see Microsoft throwing its vast resources behind a product with some potential, rather than another Zune or Windows phone. There’s still some catching up to do but at this rate any existing gaps will have been closed by the end of 2017. The willingness to listen to customer feedback and act on it is refreshing. The Excel/Power Query/Power Pivot and SSAS crossover adds a angle that the competition doesn’t have. Finally, the licensing is almost perfect: it’s simple (compared to the usual thousands of SKUs that Microsoft usually comes up with) and cheap/free, although organisations with thousands of users who all need Pro subscriptions find the costs escalate rapidly; I’d like to see special deals for large numbers of users, and some recognition that many users who need to see Pro-level reports don’t need to create reports using these features. I know Microsoft has already heard this from a lot of people, though, and has taken it on board.

Probably the only criticism that I can make that Microsoft doesn’t seem to be actively addressing is the fact that the data visualisation functionality is relatively weak. If you know what you’re doing and you have the patience, you can create good-looking reports. For people like me who have minimal artistic talent and limited patience the experience of building reports can be frustrating. There are some features like small multiples that I can’t believe are still not implemented in the core product, and nothing to help users to follow good data visualisation practice. R visuals and custom visuals help fill the gap (I was quite impressed by this one by Business Solution Group, for example, which isn’t available in the gallery) but really Microsoft need to put some more thought into this area.

Analysis Services

There’s been a lot of good news in the world of Analysis Services this year too. SSAS Tabular 2016 dealt with a lot of the shortcomings that dogged it in 2012 and 2014: a much faster and less buggy development experience; many-to-many relationships supported using bi-directional cross filtering; and powerful new DAX functions and features like variables. SSAS Tabular v.next promises even more great new features such as the integration of the M language. These changes and the fact it’s now available in Standard Edition mean that Tabular should be the default choice over Multidimensional for almost all new SSAS projects.

Sadly, it looks like the neglect of Multidimensional will continue for the foreseeable future. I stopped being angry about this a long time ago and I understand that Microsoft need to concentrate their resources on SSAS Tabular and Power BI, but a lot of Multidimensional customers are now wondering where they stand. Either Microsoft needs to show some commitment to Multidimensional by adding new features – it wouldn’t take much to make a difference – or add features to Tabular that make it possible for more Multidimensional users to migrate over to it, for example equivalents to Scope statements or calculated members on non-Measures dimensions.

Last of all, Azure SSAS opens up a lot of exciting new possibilities for both on-prem SSAS users as well as Power BI users. Kasper does a great job of summing them up here and I won’t repeat what he has to say; once again I’m seeing a lot of interest from my customers and I’m sure I’ll be helping a few to migrate to the cloud very soon. The pricing seems a bit excessive at the moment, even when you take into account the ability to pause servers, and I hope it changes before RTM. Also it’s SSAS Tabular only at this stage but support for Multidimensional is by far the top-voted request on the feedback forum, with more than five times as many votes as the next highest request, so maybe this will be Microsoft’s opportunity to show some love to the Multidimensional world?

The Show Hidden Cubes SSAS Connection String Property

If you need to write queries in SQL Server Management Studio against an SSAS cube that has its Visible property set to false, you have a problem because when you connect to your database you can’t see the cube or its metadata! Luckily you can override this and make the cube visible by setting the following connection string property when you connect:

Show Hidden Cubes=true

image

Connection string properties can be set in the Additional Connection Parameters tab of the connection dialog that appears when you open a new MDX query window.

Unfortunately this doesn’t make any objects in the cube that are not visible, like measures or dimensions, visible again – it just makes the cube itself visible. However, if you’re working on the Calculations tab of the Cube Editor in SSDT it is possible to make all hidden objects visible as I show here.

First Thoughts On The Integration Of Power Query/M Into Analysis Services Tabular

Last Friday’s big news was the release of the first CTP for Analysis Services v.next. Among several major new pieces of functionality (Ragged hierarchies! Drillthrough that works properly, even for calculations! Table-level security!) probably the biggest is the integration of Power Query/M into Analysis Services. As you can probably guess, I’m incredibly pleased that my two favourite technologies have got together. The technical details are given in this blog post, which I suggest you read if you haven’t done so already, but what I think is missing is an explanation of why this is so important and what kind of opportunities it opens up – hence this post. Of course this is just my take on the subject and not what Microsoft may actually thinking; it’s also very early days, so as the functionality develops and I have more chance to think about this my opinions may change. If you have any ideas on this subject I would be interested to hear them so please leave a comment!

Why this had to happen: Power BI

There is an obvious reason why Microsoft decided to integrate Power Query/M into SSAS, and that is because it needs to support the conversion of Power BI models into Analysis Services Tabular models. There are two scenarios where this will be necessary.

The first is the ability to convert a Power BI model into an Azure Analysis Services Tabular model (listed as ‘planned’ here), something that will be a key selling point for Azure Analysis Services when it releases. The engine behind Power BI is essentially the same as the one used in Analysis Services so migrating the data model should be straightforward, but since Power BI uses Power Query/M to load data then a migrated Azure Analysis Services model will also have to use Power Query/M.

The second scenario is similar to the first. We now know that on-premises Power BI will be delivered through Reporting Services v.next, and it’s reasonable to assume Reporting Services will need a database engine to store the data for published Power BI reports. That engine will have to be an Analysis Services instance of some kind (either standalone or running in-process inside Reporting Services) and again for that to work Analysis Services will have to support the same data access mechanisms as Power BI.

Better support for a larger number of data sources

I’ve just argued why Microsoft was obliged to include this functionality in SSAS v.next but in fact there are many positive reasons for doing this too. The most obvious one is to do with support for more data sources. At the moment SSAS Tabular supports a pretty good range of data sources, but the world of BI is getting more and more diverse and in order to stay relevant SSAS needs to support far more than it does today. By using Power Query/M as its data access mechanism, SSAS v.next will immediately support a much larger number of data sources and this number is going to keep on growing: any investment that Microsoft or third parties make for Power BI in this area will also benefit SSAS. Also, because Power Query/M can query and fold to more than just relational databases, I suspect that in the future this will allow for DirectQuery connections to many of these non-relational data sources too.

Different data sources for partitions in the same table

Another benefit of this change is that we’ll have a lot more flexibility with partitioning tables in an SSAS Tabular model. As the blog post says:

As long as a partition’s M query adheres to the column mappings of the table, you are free to perform any transformations and pull in data from any data source defined in the model.

In SSAS 2016 the partitions in a table all have to get data from the same data source whereas in v.next we’ll be able to get data from different data sources in different partitions, and this opens up some interesting new possibilities. For example, I can imagine a simple budgeting application where the partitions in a table get data from different Excel workbooks stored in OneDrive for Business, and where the each partition gets processed automatically when changes are saved to one of these workbooks.

Does this replace SSIS and my data warehouse? 

The short answer is no. Power Query/M is not a full-featured ETL tool and I don’t think it ever will be; it certainly does not have the kind of functionality needed to perform enterprise-level ETL right now. My view is that Microsoft have built Power Query/M into SSAS for the reasons above and not to encourage enterprise SSAS users to do their own quick-and-dirty ETL when loading data (although there is a risk that that will happen anyway). That said, I think the dividing line between corporate and self-service BI will become increasingly blurred over the next few years as the Microsoft BI stack develops, and we’ll see Analysis Services being used in self-service scenarios as well as the more traditional corporate ones.

Centralised data source objects

One last thing to point out is that the way SSAS v.next makes a distinction between data sources and other queries is very interesting. In Power BI and Power Query it’s easy to end up with data source connection information duplicated across multiple queries unless you know what you’re doing, and this can cause no end of problems later on in a project. As far as I can see, in SSAS v.next a “data source object” is an M query that only contains the connection to external data, while all other queries have to reference a data source to be able to access external data. This means, as the blog post says:

Referring to data source objects helps to centralize data source settings for multiple queries and simplifies deployments and maintenance if data source definitions must be updated later on. When updating a data source definition, all M queries that refer to it automatically use the new settings.

I wonder whether this concept is coming to Power BI and Power Query at some point? I hope so – it makes a lot of sense.

Power BI Model Size Bloat And Auto Date/Time Tables

Opinion is split over Power BI’s ability to automatically create Date hierarchies in your model. Personally it drives me mad and I always turn it off, but I know a lot of people love the convenience of it. Whatever your feelings, though, it is important to be aware of the problems it can cause with the size of your model.

Imagine you have a .pbix file and you load just this one table into the data model:

image

Three columns, each containing three dates with long gaps in between, but only nine values overall. No other tables or queries, no measures, no visuals at all. When you save it results in a file that is a massive 4.7MB – but why, when there’s hardly any data?

Actually, there is a lot of data hidden in this file. If you connect to the .pbix file with DAX Studio you can see that the Auto Date/Time functionality has built three hidden Date tables whose names are prefixed with “LocalDateTable”, one for each date column in the original table above:

image

These tables can be queried in DAX Studio, and the following query reveals more about them (if you try this on your model you will need to alter the name of the table used in the query to match the names of one of the tables in your model):

[sourcecode language=”text” padlinenumbers=”true”]
EVALUATE
ROW (
"Hidden Date Table Rowcount",
COUNTROWS ( ‘LocalDateTable_17eac8aa-f559-4ade-971f-9a1ad5258fbe’ ),
"Min Date",
MIN ( ‘LocalDateTable_17eac8aa-f559-4ade-971f-9a1ad5258fbe'[Date] ),
"Max Date",
MAX ( ‘LocalDateTable_17eac8aa-f559-4ade-971f-9a1ad5258fbe'[Date] )
)
[/sourcecode]

image

In this case each of the three tables has 109938 rows. That’s one row for each date between the beginning of the year containing the earliest date in the source column and the end of the year containing the latest date in the source column – which is the best practice for building a Date table, but results in three very large tables in this case.

To stop Power BI automatically building these tables for you, in Power BI Desktop go to the File menu, select Options, then Data Load and deselect the Auto Date/Time option:

image

When you do this the automatically created date tables are removed from the model. In this case, after saving, the .pbix file shrinks to 181KB! Remember that, by doing this, you won’t get automatic date hierarchies created for you when you are designing your reports and you will have to build any Date tables and hierarchies you need manually.

This is an extreme example of course, but overall you should probably turn off Auto Date/Time if your model size is an issue and:

  • You have a lot of date columns in your tables, or
  • Your date columns contain large ranges of dates. Remember that some systems use 1/1/1900 as an ‘unknown’ date value, which can make things a lot worse.

Passing Parameters To SQL Queries With Value.NativeQuery() In Power Query And Power BI

I first came across the Value.NativeQuery() M function about six months ago, but it didn’t do anything useful then so I didn’t blog about it. I checked it again recently though and now it does something very handy indeed: it allows you to pass parameters to SQL queries. Before this, if you wanted to use parameters in your SQL, you had to do some nasty string manipulation in the way Ken Puls shows here. Now, with Value.NativeQuery(), you can handle SQL query parameters properly in M.

Here’s a simple example that shows how it works, passing two parameters to a SQL query on the Adventure Works DW database:

[sourcecode language="text" padlinenumbers="true"]
let
    Source = Sql.Database("localhost", "Adventure Works DW"),
    Test = Value.NativeQuery(
            Source,
            "SELECT * FROM DimDate
            WHERE EnglishMonthName=@MonthName
            AND
            EnglishDayNameOfWeek=@DayName",
            [MonthName="March", DayName="Tuesday"])
in
    Test
[/sourcecode]

image

Some explanation of what’s happening here:

  • The Source step returns a reference to a SQL Server database, and this is passed to the first parameter of Value.NativeQuery().
  • The second parameter of the Value.NativeQuery() function is the SQL query to be executed. It contains two parameters called @MonthName and @DayName.
  • The parameters to the SQL query are passed using a record in the third parameter of Value.NativeQuery(). Note how the named of the fields in the records match the names of the parameters in the SQL query.

It looks like, eventually, this will be the way that any type of ‘native’ query (ie a query that you write and give to Power Query, rather than a query that is generated for you) is run against any kind of data source – instead of the situation we have today where different M functions are needed to run queries against different types of data source. I guess at some point the UI will be updated to use this function. I don’t think it’s ‘finished’ yet either, because it doesn’t work on Analysis Services data sources, although it may work with other relational data sources – I haven’t tested it on anything other than SQL Server and SSAS. There’s also a fourth parameter for Value.NativeQuery() that can be used to pass data source specific options, but I have no idea what these could be and I don’t think there are any supported for SQL Server. It will be interesting to see how it develops over the next few releases.

[Update February 2021]

For relational databases that use question marks (?) instead of @ names for parameter placeholders, you need to use Value.NativeQuery in a slightly different way by passing a list of values rather than a record to the second parameter. Here’s an example:

[sourcecode language="text" padlinenumbers="true"]
Value.NativeQuery(
  Source,
  "SELECT * FROM DimDate
   WHERE EnglishMonthName=?
   AND
   EnglishDayNameOfWeek=?",
   {"March", "Tuesday"})
[/sourcecode]

 

Technitrain SQL Server & Microsoft BI Training Course Schedule For 2017

If you’re looking for expert-led, classroom based training on SQL Server and Microsoft BI in London, check out the training courses we have coming up in 2017 at Technitrain:

Introduction to Power BI with Chris Webb, March 13-15 2017
A three-day introduction to building and managing a BI solution using Power BI, suitable for both analysts and BI developers.

Analysis Services Tabular Workshop with Alberto Ferrari, March 20-21 2017
A two-day course for BI developers who want to learn to build Analysis Services Tabular models.

Mastering DAX with Alberto Ferrari, March 22-24 2017
Learn DAX from the master! Suitable for anyone who needs to know how to write calculations and queries for Analysis Services Tabular, Power Pivot or Power BI.

Mission Critical SQL Server with Allan Hirt, March 27-30 2017
This advanced 4-day workshop, designed for SQL Server professionals and IT admins alike, will teach you how to build solid high availability and disaster recovery solutions for SQL Server.

Real-World Cube Design And Performance Tuning With Analysis Services Multidimensional with Chris Webb, May 8-10 2017
A course aimed at intermediate-to-experienced Analysis Services Multidimensional developers, looking at more advanced cube design topics and query performance tuning.

From Zero To BIML with Andy Leonard, June 19-22 2017
A four-day course aimed at BI developers who want to use Business Intelligence Markup Language (BIML) to generate SQL Server Integration Services packages.

Introduction To MDX with Chris Webb, September 25-27 2017
A three-day introduction to writing queries and calculations in MDX for Analysis Services Multidimensional developers and report writers.

Database DevOps with Alex Yates, September 25-27 2017
Aimed at anyone who is responsible for SQL Server schema changes, this course will teach you how to deliver more frequently and more reliably as well as how to solve common technical (and cultural) problems.

 

If you’d like to keep up-to-date with all our new course announcements, please join our mailing list!

Power Query, Power BI And The “Allow Data Preview To Download In The Background” Option

Recently I was asked by a customer to do some tuning on an Excel workbook with a lot of Power Query queries in it. Although all of the data used in the queries was coming from tables in the workbook itself and the data volumes were small, there were fifty Power Query queries and clicking Refresh All resulted in a large, prolonged spike in CPU and memory usage by Excel.

Only a small number of these fifty queries were being loaded into the workbook and none were being loaded into the Excel Data Model. The queries that were being loaded into the workbook were referencing several other queries that in turn referenced several other queries, and indeed there were some reference chains that were over ten queries long. To give you an idea of the complexity here’s what the Query Dependencies view looked like:

image

I’m a big fan of using references to split complex logic up into separate queries, and in this case it was absolutely the right thing to do because otherwise the workbook would have been unmaintainable. That said, there was clearly something going wrong with the refresh in this case.

On further investigation I found that if I individually refreshed the small number of queries that actually loaded data into the workbook, they all refreshed very quickly and with none of the ill-effects seen with a Refresh All. So if it wasn’t the queries that were being loaded into the workbook, what was the problem? It turns out it was the queries that weren’t being loaded into the workbook.

Both Power Query and Power BI load previews of the data returned by a query for display in the Query Editor; clicking Refresh All in the workbook was obviously triggering a refresh of these previews and this was what was using all the memory and CPU. I found that to prevent this happening I had to use an option that was introduced in Power BI in January 2016 and is also now present in Power Query/Get & Transform in Excel: Allow Data Preview To Download In The Background.

You can find this option in Excel by going to the Query Options dialog:

image

…and then going to Current Workbook/Data Load. If you then deselect “Allow data preview to download in the background”:

image

…you will find that the background refresh of query previews stops.

In my case, after I had done this – and with a bit of other tuning using Table.Buffer() – the workbook refreshed very quickly indeed and there was no spike in CPU or memory after a Refresh All.

Other people have run into the same problem in Excel and also in Power BI Desktop (see here and here), so it looks like this is an important property to change if you have a large number of queries in a single workbook or pbix file.

Sharing Power Query Queries With Azure Data Catalog

About a week ago, without any warning, a much-awaited new feature lit up in Azure Data Catalog: the ability to share Power Query queries between workbooks and users. In fact it’s not really a new feature but the reappearance of something that was present in the original version of Power BI for Office 365; it works in a very similar way, although some functionality like the option to search public data sources has now disappeared and some functionality seems to have changed.

How It Works

First, make sure you have an Azure Data Catalog subscription. You can sign up here and a free subscription is fine. If you want to learn more about Azure Data Catalog you can read my post from earlier this year which has a quick overview.

Now imagine that you have just created a really cool Power Query query that you think all of your colleagues will want to use. In Excel right click on the Power Query query that you want to share in the Query Pane, then select Send To Data Catalog:

image

You may need to sign in at this point – use the Organizational account that is associated with your Azure Data Catalog subscription.

image

Next you’ll see the Send to Data Catalog dialog. On the Query tab you can edit the description of the query and supply a URL to documentation:

image

You can also specify who the query is shared with:

image

Click Send and you have shared your query. At this point it will be visible in the Azure Data Catalog web portal along with all of your other assets:

image

Here you can also manage sharing, add more documentation, look at the columns returned and see a preview (if you enabled it when you shared the query). Unfortunately the Open In option is disabled at the time of writing, so you can’t open a new Excel workbook containing this query yet.

Back in Excel, if you want to use a shared query in a new workbook, you have two options on the New Query dropdown menu on the Data tab:

image

You can either search the catalog:

image

When you do this a new Search tab appears on the Excel ribbon, giving several different search options:

image

Alternatively, the My Data Catalog option allows you to see the queries you have shared:

image

Once you’ve found your query, you have two ways to consume it and it’s not immediately obvious what the differences are between them.

First you have the Load/Load To options that copy the query into your workbook and load its output to your destination of choice. At this point the new query runs like any other query, but when you open the Query Editor you’ll see it only has one step:

image

If you look at the M code you’ll see something like this:

[sourcecode language=”text” padlinenumbers=”true”]
let
Source = Embedded.Value("959d482b-3b06-483c-84dd-f6fee2900bf9")
in
Source
[/sourcecode]

The actual query is embedded somewhere in the workbook but the M source code isn’t available for you to view or edit, you can only run it.

If you want to edit the query or see the M code you have to use the Open option in the Shared Queries pane:

image

If you do this a new workbook is created with this query in it, and in the Query Editor you’ll see you can edit this query as normal: all the steps and the M code are visible.

Finally, if you do change the query, you can update the definition or share it as a new query by using the Send To Data Catalog option again. When the Send To Data Catalog dialog appears you have two new options to update the existing shared query in the Data Catalog or to create a new shared query:

image

How Could It Be Improved?

While I’m really happy to have this functionality back, and I think a lot of people will find it useful, there’s still a lot of room for improvement. Some thoughts:

  • This really needs to extended to work with Power BI Desktop too. In fact, it’s such an obvious thing to do it must be happening soon…?
  • Both Power Query and Power BI should also extend their integration with Azure Data Catalog: you should be able to search for all types of data source and be able to create new queries from them. I know you can create new Power BI Desktop files and Excel files with Power Query queries using the Open In functionality in the Azure Data Catalog web portal, but that’s the wrong place to start the process in my opinion.
  • I find the difference between Load/Load To (where the query isn’t editable) and Open (where it is) confusing. It would be clearer to have options to download editable and non-editable versions.
  • It would be useful for Azure Data Catalog to store different versions of queries, so when you uploaded a query it didn’t overwrite the previous version and so you could roll back to an earlier version if you needed to. Source control, basically.
  • I’d like to see some kind of message appear in Excel or Power BI Desktop if a new version of a query I was using had been published, and then have the option to upgrade to the new version.
  • While it’s great to share queries in this way, it would also be cool to publish queries up to some kind of central place (a server, something cloudy) where they also executed and be available as a new data source. That way, other people could just consume the output of the query and not have to copy the query into their workbooks or Power BI reports. Maybe if you could publish an M query as an Azure Function…?

Converting Lists Of Numbers To Text Ranges In Power Query

In a comment on my post on Creating Sequences of Integers And Characters In Power BI/Power Query Lists a reader, Paul G, asked me the following question:

can you reverse this? e.g i have a list (1,2,3,5,7,8,9,12,13,14,15) can i convert this to (1-3, 5 ,7-9,12-15)

This got me thinking… I was sure it could be done in M, but would it be possible using just the UI? As far as I can see, it isn’t – there’s one crucial thing I can’t do – but I would be interested to see if anyone else can come up with a no-code solution.

The Almost UI-Only Solution

Here’s the source data, an Excel table called SourceTable:

image

Here’s what I did to create the ranges in a new Power Query query:

  1. Load the table into Power Query and set the Numbers column to have a data type of Whole Number
    image
  2. Sort the Numbers column in ascending order (even though in this case it already is)
    image
  3. Add an Index Column. It doesn’t matter what number it starts at (though I started it at one) but it does matter that it has a consecutive list of whole numbers in it:
    image
  4. Select both columns in the table, go to the Add Column tab in the query editor and under the Standard button select Subtract to add a new column containing the value of [Numbers]-[Index]:
    image
    Each distinct value in this column equates to a range of numbers in our final output – for example the rows that have the value 0 in this table will become the 1-3 range in the output.
  5. Do a Group By on this table, grouping by the Inserted Subtraction column and finding the minimum and maximum values in each range:
    imageimage
  6. Remove the Inserted Subtraction column because we don’t need it any more:
    image
  7. Select the Min and Max columns and then do a Merge Columns to combine these two sets of values using a dash:
    image
    image
  8. You now have your range names, but in the case of 5-5 you need to just return 5, so add a Conditional Column to handle this like so:
    image
    image
  9. Set the resulting column to have a data type of text:
    image
  10. Remove all other columns apart from Custom:
    image
  11. And finally, combine all the values together into a single comma-delimited list. This is where the UI lets you down, at the last step! Gil Raviv shows one way to solve this problem in a recent post, and then in a follow-up posts a much more elegant solution using M from Imke Feldmann which is my preferred approach too.  I won’t repost all the steps (which are very well described here anyway) but basically you do another Group By, then alter the code of the resulting step to use Text.Combine() to aggregate the text.
    image

Job done! If anyone from the Power Query team is reading this blog, here’s a message for you: please give us a new option in the Group By dialog to concatenate all the text values in a column, rather like the DAX ConcatenateX() function. Thanks!

The Pure M Solution

Of course you would like to see a pure M solution too, wouldn’t you? Here you go:

[sourcecode language=”text” padlinenumbers=”true”]
let
//The list to find ranges in
Source = {1,2,3,5,7,8,9,12,13,14,15},
//Sort in ascending order
SortedSource = List.Sort(Source, Order.Ascending),
//Get a list of all the positions in the list
//rather like adding an index column
Positions = List.Positions(SortedSource),
//Create a list of all the groups by subtracting
//each number from its position in the list
Groups = List.Transform(Positions, each SortedSource{_}-_),
//The function to use with List.Accumulate
RangeAccFunction = (state, current) =>
let
//Get the current number from the iteration
CurrentNumber = Text.From(SortedSource{current}),
//Get the current group number
CurrentGroup = Groups{current},
//Get the previous group number
LastGroupIndex = if current=0 then 0
else current-1,
LastGroup = Groups{LastGroupIndex},
//Get the next group number
NextGroupIndex = if current=List.Max(Positions)
then 0
else current+1,
NextGroup = Groups{NextGroupIndex},
//Generate the text for this iteration
TextToAdd =
if current=0 then CurrentNumber
else
if CurrentGroup<>LastGroup
then ", " & CurrentNumber
else
if CurrentGroup<>NextGroup
then "-" & CurrentNumber
else ""
in
//Return the previous text plus
//the newly generated text
state & TextToAdd,
//Use List.Accumulate to iterate over the list of positions
//And generate the text for the ranges
Output = List.Accumulate(Positions, "", RangeAccFunction)

in
Output
[/sourcecode]

The output in this case is just the text we’re looking for:

image

This turned out to be a great opportunity to use the List.Accumulate() function; once again, Gil Raviv has a great post describing this function and how it can be used. In this case I’m using a very similar approach to the one above to group the numbers, then declaring a function called RangeAccFunction that can be passed to List.Accumulate() to generate the output text. Again, I would be curious to see other solutions!

You can download the sample workbook for this post here.

Referenced Queries And Caching In Power BI And Power Query

Last week, Maxim Zelensky (whose blog is well worth checking out) tweeted about a very interesting answer he had received to a question he posted on the Power Query MSDN forum, on the subject of caching and referenced queries in Power Query. You can read the thread here:

https://social.technet.microsoft.com/Forums/en-US/8d5ee632-fdff-4ba2-b150-bb3591f955fb/queries-evaluation-chain?forum=powerquery

…but since this is such important information – it’s certainly something I’ve wondered about myself – I though I would share Maxim’s question and the response from Ehren of the Power Query dev team here in full so it gets the wider visibility it deserves. I’m very grateful to Maxim for letting me share this and to Ehren for writing such a detailed response.

First, Maxim’s original question:

There are two different scenarios I am working with:

1) Query1 connects to the data source (flat file) and make basic cleaning and transformations.

Then Query2 and Query3 reference to Query1, performing other transformations needed.

Query3 also take some data from Query2.

Query2 and Query3 then exported to sheet, Query1 – connection only.

As far as I can understand, PQ can define refresh chain like this: evaluate Query1, then evaluate Query2, then evaluate Query3 (as it need the results from Query2). The question is: When PQ performs calculation of Query3, will it recalculate Query1? Or, as it was evaluated in the chain before, Query3 will use cached results of previous Query1 calculation (performed when Query2 was evaluated)?

2) I have a set of flat files, and I take data from them with one Query1. Query1 also performs some transformations. Then I have two independent Query2 and Query3, both connected to Query1 performing different transformations of source data. Results of Query2 and Query3 evaluations then used in Query4, which exports its results to the sheets, Query1, Query2 and Query3 – connection only

The second question is: performing "Refresh" on Query4, how much times will be Query1 evaluated – one, two or three ? Or there also will be chain: calculate Q1, caching, then Q2 or Q3, calculate next using cached results of Q1 evaluation, and then – Q4?

3) Is there is a difference with connection to database?

4) Is there any rules of evaluation chain (like each expression/query will be calculated once in the evaluation chain)?

And here’s Ehren’s reply:

There’s a lot involved in answering your question, so let me back up and explain a few things first.

Caching
Power Query (in both Excel and Power BI Desktop) utilizes a "persistent cache", stored on disk, when refreshing queries. But what exactly does that mean?
First, let’s look at what gets cached. The persistent cache does not store the results of your M Queries (Q1, Q2, etc. in your example). Instead, it stores the results of the behind-the-scenes requests sent to data sources.

So if Q1 queries a SQL Server database called "MyServer/MyDatabase" and returns a single unfiltered table called "MyTable", the query sent to the server might be "select [Col1] from [MyTable]". In this case, the persistent cache will now know the result of sending "select [Col1] from [MyTable]" to "MyServer/MyDatabase". If another M query (whether through referencing Q1, or by querying the same table directly) needs the same result, the persistent cache can provide it, and the result won’t have to be fetched a second time from the SQL Server.
"Great," you might say. "So if I’m pulling from a flat file in Q1, and in a few places in Q2 I need to do Table.RowCount(Q1), the file should only be read from disk once, right?" And the answer would be…no. This is because not all data sources are cached. Specifically, the results of calls to File.Contents are not stored in the persistent cache. Why not? Well, the cache is stored on disk, and caching local files (which are already on disk) elsewhere on disk doesn’t really make sense. (Using Table.Buffer in this context may help…see more on Table.Buffer below.)
"Okay", you might say. "But if Q1 is pulling from a SQL table, and in a few places in Q2 I reference Q1, that should hit the persistent cache, right?" Maybe. It depends on how Q2 is using Q1, since doing additional operations on Q1 (such as filtering or merging) might cause the M engine to compute a different SQL query, resulting in the server being hit again.
Next, let’s look at the scope of caching. The scope of caching differs depending on what you’re doing, as well as what tool you’re using.

Previewing
If you’ve opened the Power Query editor in Excel or Power BI Desktop, you might have seen warnings like "This preview may be up to 3 days old". This is because there is a persistent cache used for interactive previewing of query results. As you can imagine from the fact that we have warnings about preview results being days old, this cache is long-lived and is intended to make the experience of working in the editor faster.

Loading to Excel
If you load/refresh three queries in Excel, each of them gets their own persistent cache. So the fact that a SQL result is cached during the load of Q2 won’t benefit the loading of Q3, even if it needs the same result.

Loading to Power BI Desktop
If you load/refresh three queries in PBI Desktop, they all share a single persistent cache. When you refresh multiple times, each refresh operation gets its own cache (shared by all the queries being refreshed at that particular time). This means that if SQL result is cached during the load of Q2, it will still be cached during the loading of Q3 (assuming they’re both being loaded at the same time).

What about Table.Buffer?
Table.Buffer can be useful if you want to store an intermediate result in memory and avoid pulling content from disk, a remote file share, a SQL Server, a website, or any other data source multiple times during an evaluation.  Think of Table.Buffer as, "load this table into memory, and stop folding subsequent operations back to the data source".

However, because buffering happens in memory and is not persisted on disk, buffering during the load of one query does not affect the load of another query. If Q1 is buffered when Q2 is loaded, Q1 will be re-buffered when Q3 is loaded.

And now to answer your question…
Now let’s take a look at your example (Q4 references Q2 and Q3, and Q2 and Q3 both reference Q1).
Since you’re pulling from a flat file, and File.Contents results aren’t cached, the flat file will be read each time Q1 is referenced (twice in Q4, once in Q3, Q2, and Q1). If you buffered the result of Q1, then Q4 would only read the file once. But when Q1, Q2, and Q3 are loaded (even in PBI Desktop), they will still each also read the file.

What about immutability?
You asked about the fact that M values are supposed to be immutable. This is true for the "pure" parts of the language, but breaks down when you introduce external data sources and folding. (In fact, you could think of Table.Buffer as transferring a table from the fuzzy unpredictable world of folding to the immutable world of pure M values.) You can see this in action by doing the following test, using a query called "MyTextFileQuery" that pulls from a local file on disk.

Reads the file five times
= Table.RowCount(MyTextFileQuery) + Table.RowCount(MyTextFileQuery) + Table.RowCount(MyTextFileQuery) + Table.RowCount(MyTextFileQuery) + Table.RowCount(MyTextFileQuery)

Reads the file once
= let rowCount = Table.RowCount(MyTextFileQuery) in rowCount + rowCount + rowCount + rowCount + rowCount

I’ve read this response several times and it’s still sinking in, but clearly there are some important implications here for anyone doing more advanced data loading work in Power Query and Power BI. I’m sure it will be the inspiration for many future blog posts on tuning Power Query query performance.