The Pros And Cons Of Modelling Measures As A Dimension In Power BI

You probably know that dimensional modelling and building a star schema are very important in Power BI. In a star schema you’ll have at least one fact table, and in your fact table you’ll have two main types of column: dimension keys, which link the fact table to dimension tables, and measures, which store the data you want to aggregate and analyse. Here’s a simple example of a fact table called Sales with three dimension key columns (Product, Store and Customer) and three measure columns (Sales, Tax and Volume Sold):

Quite often, though, I see people taking a slightly different approach to modelling their fact tables: instead of having separate columns for each measure they unpivot their data, create one row in their fact table for each measure value, use a single column to store all the measure values and create a new dimension to allow the user to select which measure values they want. Here’s an example of another fact table, called Sales Unpivot, showing how the data from the Sales fact table above can be remodelled using this technique:

In this fact table the dimension keys remain the same, but the Value column stores all the data from the Sales, Tax and Volume Sold measures in the original table and the Measure Name column tells you what type of measure value is stored on any given row. Let’s call this approach the Measures Dimension approach.

There are some advantages to building fact tables using the Measures Dimension approach, for example:

  • You can now use a slicer in a report to select the measures that appear in a visual
  • You can now easily add new measures without having to add new columns in your fact table
  • You can use row-level security to control which measures a user has access to

Generally speaking, though, any time you deviate from a conventional dimensional model you risk running into problems later on and this is no exception. Let’s go through the disadvantages of modelling data using a Measures Dimension.

Formatting

Notice that the Sales and Tax measure columns from the Sales fact table are currency values and that Volumn Sold contains decimal values with four decimal places. It’s quite easy to set different formats for different measures when each measure is a separate column:

However, when all your values are stored in one column, as in the Measures Dimension example, formatting is not so straightforward. You might be able to get away with using one generic format for all your data:

…but that isn’t ideal. Of course you can create DAX measures and format them appropriately but then you lose some of the flexibility of this approach; you could also use a calculation group and dynamic format strings as Kasper describes here.

Compression

More seriously, Power BI does a much more efficient job of storing and compressing the data in a conventional fact table compared to when the Measures Dimension approach is used and this has consequences for query performance. Using the View Metrics button in DAX Studio to see the details of how the data is stored for each table is revealing. Here are some selected highlights:

First of all, notice that the Sales Unpivot table (which uses the Measures dimension approach) is 66% larger than the Sales table. Notice also that in the Sales table the Sales and Tax measure columns, which contain currency values, can use the Currency data type (which shows up Decimal here, confusingly) which in turn means that they can use Value encoding; only the Volume Sold column needs to be stored using the Decimal Number data type (which shows up as Double here), and must use Hash encoding. In the Sales Unpivot table, since all the measure values are stored in the Value column, this column has to use the Decimal Number data type and Hash encoding. As this article explains (the Definitive Guide To DAX goes into a lot more detail) Value encoding can give you a lot of performance benefits.

Calculation Complexity

When you start to build more complex DAX calculations then the disadvantages of the Measures Dimension approach become even more apparent. Let’s say you want a visual in your report that shows Sales, Tax and a measure that subtracts Tax from Sales called Sales After Tax:

Here’s the DAX needed for this visual:

Sales Measure = SUM('Sales'[Sales])
Tax Measure = SUM('Sales'[Tax])
Sales After Tax = [Sales Measure] - [Tax Measure]

To achieve the same result with the Measures Dimension approach, though, you need to know how to use the DAX Calculate() function, something like this:

Sales Measure 2 =
CALCULATE (
    SUM ( 'Sales Unpivot'[Value] ),
    KEEPFILTERS ( 'Sales Unpivot'[Measure Name] = "Sales" )
)
Tax Measure 2 =
CALCULATE (
    SUM ( 'Sales Unpivot'[Value] ),
    KEEPFILTERS ( 'Sales Unpivot'[Measure Name] = "Tax" )
)
Sales After Tax 2 = [Sales Measure 2] - [Tax Measure 2]

[Note that in most cases I’d create a separate dimension table for the Measures dimension, but to keep things simple here I’ve not done that]

If you expect other people to build measures on your dataset then this additional complexity can be a significant barrier to overcome. Calculate isn’t an easy function to use properly.

Calculation Performance

Last of all, there’s also also a performance penalty to pay with the Measures dimension. Taking the Sales After Tax example from the previous section, here’s what the Server Timings tab in DAX Studio shows for the query associated with the visual showing Sales, Tax and Sales After Tax:

Notice that there’s just one Storage Engine query: DAX fusion has kicked in so that the Sales and Tax values required can be retrieved in the same scan.

However, here’s what the Server Timings tab shows for the same visual using the Measures Dimension approach and the second set of measures using the Sales Unpivot table shown above:

Not only is this query slower but there are now two Storage Engine queries: one to get the Sales data and one to get the Tax data. Since separate scans are needed to get each measure value, the more measures you have in a visual or the more measures needed by your calculations, the more scans are needed. This can very quickly add up to a big performance problem, especially if each scan is relatively slow – which is more likely to be the case since the Measures Dimension approach means Power BI is less able to compress data effectively.

Conclusion

As you’ve probably guessed by now I’m not a big fan of the Measures Dimension approach. While there are definitely some advantages to using it I think the disadvantages – which aren’t always immediately obvious – outweigh them.

Dynamically Generated Lines On A Map In Power BI Using DAX, WKT And The Icon Map Custom Visual

I don’t generally blog about data visualisation in Power BI because it’s not my strong point, and I don’t blog about maps and geospatial analysis because I know even less about that subject and there are people like David Eldersveld who cover it so well. I do like playing around with maps though and recently I’ve been having fun with the Icon Map custom visual developed by James Dales. Probably the thing I like most about it is that so many properties accept measures as inputs, which means that you can use DAX to do some interesting things.

In this post I’ll show you a very basic example of how to display dynamically generated lines on a map – specifically, I’ll show you how to draw lines between any four cities that a user selects from a slicer. Pre-calculating all the possible combinations of cities when you load your data is not feasible because of the sheer number, so this is a scenario where being able to dynamically generate the lines between the four selected cities in DAX is really useful.

To start off, I downloaded a table with a list of cities in the UK and their latitudes and longitudes from https://simplemaps.com/data/gb-cities and then loaded this data into four identical tables in Power BI called Start, Second Stop, Third Stop and End.

I then created four slicers from the City column from these four tables, so the user can select the four different cities they want to draw lines between.

The key to making this work is to use the Icon Map visual’s ability to display geometric data in Well Known Text format; documentation on this can be found here. To achieve this I wrote some (slightly over-engineered) DAX in a measure that generates a LINESTRING object that has four lines to connect the four selected cities:

Route = 
var CityTable = 
{
    (SELECTEDVALUE('Start'[Longitude]), 
    SELECTEDVALUE('Start'[Latitude])),
    (SELECTEDVALUE('Second Stop'[Longitude]), 
    SELECTEDVALUE('Second Stop'[Latitude])),
    (SELECTEDVALUE('Third Stop'[Longitude]), 
    SELECTEDVALUE('Third Stop'[Latitude])),
    (SELECTEDVALUE('End'[Longitude]), 
    SELECTEDVALUE('End'[Latitude]))
}
var FilteredCityTable = 
FILTER(
    CityTable, 
    NOT(ISBLANK([Value1])) && 
    NOT(ISBLANK([Value2])))
return
"LINESTRING ( " & 
CONCATENATEX(
    FilteredCityTable, 
    [Value1] & " " & 
    [Value2], ", ") & 
")"

Here’s what the measure returns when the cities of Bath, Birmingham, Banbridge and Bradford are selected:

LINESTRING ( -2.365556 51.379444, -1.916667 52.466667, -6.26701 54.35091, -1.75 53.783333)

Finally, I dragged this measure in the Icon URL / WKT / SVG well in the Icon Map visual (NB: you need to use the version of the visual from http://www.icon-map.com/ and not the version from AppSource for now for this to work). Here’s the result:

You can download the sample pbix file here.

There are far more practical things you can do with this: for example, Mimoune Djouallah has a great blog post here on how to use Well Known Text in Icon Map to display 1.2 million points on a map. Let me know if you have a real-world application for this by leaving a comment.

Returning Annotations From OData Sources In Power BI/Power Query

The Power Query OData.Feed function has an option called IncludeAnnotations that allows you to return annotation values from an OData data source. It’s not obvious how to use it though – even if you use this option when connecting, you won’t see the annotation values by default because they are returned as metadata. Here’s an example of how to get annotation values using some simple M code.

The following query gets data from the statuscode column of the bookableresourcecategories table in Dynamics CRM via OData:


let
Source =
OData.Feed(
"https://xyz.crm.dynamics.com/api/data/v9.1",
null,
[Implementation = "2.0"]),
bookableresourcecategories_table =
Source{[
Name = "bookableresourcecategories",
Signature = "table"
]}[Data],
#"Removed Other Columns" =
Table.SelectColumns(
bookableresourcecategories_table,
{"statuscode"})
in
#"Removed Other Columns"

First

To get the option set labels associated with these values, you first of all need to edit the record in the third parameter of OData.Feed and use the IncludeAnnotations option to get the FormattedValue annotation like so:


OData.Feed(
"https://xyz.crm.dynamics.com/api/data/v9.1",
null,
[Implementation="2.0",
IncludeAnnotations="OData.Community.Display.V1.FormattedValue"]
)

Then you need to add a custom column in the Power Query Editor that gets the metadata from each cell in the statuscode column using the Value.Metadata function:

CustomColumn


Value.Metadata(
[statuscode]
)[OData.Community.Display.V1.FormattedValue]?

By the way, if you’re wondering what the question mark does in this expression, it stops an error occurring if there is no OData.Community.Display.V1.FormattedValue field in the metadata record; this post has more details.

The full query looks like this:


let
Source =
OData.Feed(
"https://xyz.crm.dynamics.com/api/data/v9.1",
null,
[Implementation = "2.0",
IncludeAnnotations =
"OData.Community.Display.V1.FormattedValue"]
),
bookableresourcecategories_table =
Source{
[Name = "bookableresourcecategories",
Signature = "table"]
}[Data],
#"Removed Other Columns" =
Table.SelectColumns(
bookableresourcecategories_table,
{"statuscode"}
),
#"Added Custom" =
Table.AddColumn(
#"Removed Other Columns",
"Custom",
each Value.Metadata(
[statuscode]
)[OData.Community.Display.V1.FormattedValue]?
)
in
#"Added Custom"

…and here’s the output of the query showing the option set labels in a new column:

Last

[Thanks again to Matt Masson for providing this information]

Adding Tooltips To Column Headers In The Power Query Editor In Power BI And Excel

Here’s something that will excite all you M nerds out there! Did you know you can make tooltips appear when you hover over column headers in a table in the Power Query Editor in either Power BI Desktop or Excel?

It’s all done with M metadata, and here’s an example query:


let
source =
#table({"firstname", "lastname"}, {{"matt", "masson"}}),
tableType =
type table[firstname = Text.Type, lastname = Text.Type]
meta [
Documentation.FieldDescription =
[firstname = "Given Name", lastname = "Family Name"]
],
replaceType = Value.ReplaceType(source, tableType)
in
replaceType

Here’s what you see when your mouse hovers over the firstname column in the Power Query Editor:

Capture1

…and here’s what you see when your mouse hovers over the lastname column:

Capture2

How does this work? Here’s what each of the steps do:

  • The source step creates a simple table with two columns called firstname and lastname using #table (see here for more details on that).
  • The tabletype step declares a new table type with two columns (the same two columns in the table from the previous step) and then adds a metadata record to this type. In that record the Documentation.FieldDescription field contains the text values that will appear as tooltips when you hover over each column.
  • The replacetype step replaces the type of the table returned by source with the type declared in tabletype.

The Power Query Editor UI then looks for a Documentation.FieldDescription field in any metadata associated with a table and displays the values in that field when you hover over the appropriate column.

Note that if you add any steps to your query after this that change the table type (for example that add or remove columns), the metadata is removed ☹. That said I still feel like this might be a useful feature for anyone building a custom connector, for example.

[Thanks to Matt Masson for telling me how all this works]

 

 

Migration From Analysis Services Multidimensional – Your Feedback Needed!

Do you have Analysis Services Multidimensional cubes in production? Although I know it’s a long time since I last posted any Multidimensional/MDX content here I hope I still have some readers who do. If so, then you may be able to help me.

The reason I ask is that in my current job at Microsoft I’m working with some colleagues to investigate what it is that prevents people from migrating away from Analysis Services Multidimensional to Analysis Services Tabular, Azure Analysis Services, Power BI or indeed any other BI platform. Is it missing features? Is it organisational intertia? Cost? Is it the fact that your Multidimensional cubes still work well and there’s no point in migrating when you wouldn’t see much benefit? Something else? Has the idea of migration ever even crossed your mind?

In particular, what I need is:

  • Examples of Analysis Services Multidimensional cubes you have in production. All I want is the Visual Studio project or an XMLA script of the database, I do not need or want your data. Please leave a message for me here if you’re willing to do this and I’ll let you know where to send your cubes to.
  • Your thoughts on this subject – please leave a comment below. You know how I love a good argument discussion!

I already have plenty of ideas and theories regarding this topic, but what I need is hard evidence (hence the request for the cube definitions) and quotes from actual customers.

Last of all, don’t read too much into this: it’s a research project, nothing more. I can’t comment on, or make any promises about, the future of Multidimensional or new features that might be added to Analysis Services Tabular or Power BI.

Why Is Power BI Running My SQL Query Twice?

When you import data from a relational database like SQL Server in Power BI you have the option of entering your own SQL query to use as a starting point:

NativeSQL

Here’s the M code for a query that does this:


let
Source = Sql.Database(
"localhost",
"AdventureWorksDW2017",
[Query
= "SELECT [DateKey]#(lf) ,[FullDateAlternateKey]#(lf) ,
[DayNumberOfWeek]#(lf) ,[EnglishDayNameOfWeek]#(lf)
FROM [AdventureWorksDW2017].[dbo].[DimDate]"]
)
in
Source

If you’re confident writing SQL this might seem like a good option, but as I said in this blog post it has the side-effect of disabling query folding inside the Power Query query, so if you add any other transformations they will always be performed inside the Power Query engine – which may be less efficient than performing them in the data source.

There’s also another drawback: when you refresh your dataset in Power BI Desktop (although not in the Power BI Service) you’ll see that your SQL query is run twice. Here’s the evidence from SQL Server Profiler showing what happens when the query above is refreshed in Power BI Desktop:

Profiler

If your query is slow, or if each query execution costs you money, then this is something you want to avoid.

Why is this happening? It turns out this is just another example of what I blogged about here: Power BI wants to know the schema of the table before the query actually runs, so it asks Power Query to return the top 0 rows. Unfortunately, in this case query folding can’t take place and the top 0 filter can’t be pushed back to the database, so the entire query gets run once to get the schema and once to get the data.

The solution is the same as the blog post I just mentioned too: use the Table.View M function to hard-code the schema returned by the query and implement query folding manually. Here’s the adapted version of the new query:


let
Source = Sql.Database(
"localhost",
"AdventureWorksDW2017",
[Query
= "SELECT [DateKey]#(lf) ,[FullDateAlternateKey]#(lf) ,
[DayNumberOfWeek]#(lf) ,[EnglishDayNameOfWeek]#(lf)
FROM [AdventureWorksDW2017].[dbo].[DimDate]"]
),
OverrideZeroRowFilter = Table.View(
null,
[GetType = () => type table[
DateKey = Int32.Type,
FullDateAlternateKey = DateTime.Type,
DayNumberOfWeek = Byte.Type,
EnglishDayNameOfWeek = Text.Type
], GetRows = () => Source, OnTake
= (count as number) =>
if count = 0
then #table(GetType(), {})
else Table.FirstN(Source, count)]
)
in
OverrideZeroRowFilter

Profiler2

Generally speaking, I think there’s a lot to be said for creating views (if possible) instead of embedding your own SQL into a Power BI dataset – it makes maintenance and tuning much easier, and of course if you can connect straight to the view without writing any SQL in Power BI, then query folding will work and Power BI Desktop will only query the view once when you refresh.

Naming Tables, Columns And Measures In Power BI

I see a lot of Power BI datasets in the course of my work, and as a result I see a lot of datasets that look like this:

Fields

What’s wrong with this picture? Look at the names:

  • The tables and columns have the same names that they had in the data source, in this case a SQL Server database. Note the table name prefixes of “Dim” for dimensions and “Fact” for fact tables.
  • The column and measure names either don’t have spaces or use underscores instead of spaces.
  • What on earth does the measure name _PxSysF even mean?

Datasets like this seem to work perfectly well and are often built by professional BI developers but these names are a mess – and this can cause a lot of problems later on.

This is an issue I’ve been moaning about for years, but I wanted to blog about it again because it’s just as important today for Power BI as it was ten years ago for Analysis Services. My advice is to make naming a top priority when you’re building a dataset. If you have already published your dataset it will be difficult to change the names you’ve used – if you do so, you risk breaking reports and calculations that you and other people have built on it – so this is something that should be dealt with as early in the development process as possible.

In my opinion there are three things to consider when naming a table, column or measure:

  • You should use human-readable names rather than any kind of technical naming convention, with spaces where you would expect to have spaces and all vowels present. For example, that means having names like [Sales Amount] rather than [Sales_Amount] or [SlsAmt]; similarly, prefixes like “Dim” and “Fact” might make sense to you but won’t mean anything to your users.
  • You should use the correct business terminology, the terminology that your users will know and understand, rather than just make up some names that seem appropriate. Your users might not understand what [Total Sales Value] is if the generally accepted term is [Net Sales Amount].
  • The names you use should be consistent across all datasets that contain the same data. That means that if you have a table called Sales in one dataset it should be called Sales in every other dataset that you build from the same data source, not Transactions, FactSales or something else.

This advice might be controversial to some people, especially those with a database background, but to me designing a dataset is more like designing a user interface rather than designing a database. Indeed the consequences of a dataset with no thought put into naming are similar to the consequences of a poor user interface:

  • If your end users don’t understand what your report is trying to show, what’s the point of even building a report?
  • Reuse of datasets is a good thing and ideally any dataset you build should be easy for other people to build reports from. If those other people don’t understand what the names of your tables, columns and measures mean they won’t be able to build new Power BI reports from your dataset using Live connections or use Analyze in Excel.
  • It’s not just other people who are building reports from your dataset that you need to think about. If you want to use the new visual personalisation feature or Q&A then you’ll only be able to if your end users can understand the names you’ve used.
  • Even if you’re the only person building datasets and reports in your organisation, you owe it to yourself to make your code as readable as possible and to save yourself the effort of having to rename columns and measures when you use them in a visual.

That’s enough ranting for now. Good naming is only one part of good data modelling but it’s something that’s too often neglected!

Video: Power BI Data Privacy Settings Deep Dive

Over the past year or so I’ve been delivering a presentation on the Power Query engine’s data privacy settings at various conferences and Power BI user groups, in an attempt to try to pull together all the knowledge I have on this complex topic. Luckily, when I presented this session at the London Power BI User Group recently, they recorded it and posted it on YouTube here:

If you’re struggling with data privacy errors like:

Formula.Firewall: Query ‘Query1’ (step ‘xyz’) is accessing data sources that have privacy levels which cannot be used together. Please rebuild this data combination.

or

Formula.Firewall: Query ‘Query1’ (step ‘xyz’) references other queries or steps, so it may not directly access a data source. Please rebuild this data combination.

…in either the Power Query Editor in Power BI or Power Query/Get&Transform in Excel, then I hope this video will help you understand why you’re getting these errors and what you can do to avoid them.

A lot of what I show in this session draws on other material, such as:

  • My five-part series of posts on the Power Query data privacy settings that starts here
  • My post on how credentials and data privacy settings are stored for dynamic data sources here
  • My post here on the performance overhead of applying data privacy checks
  • Ehren von Lehe’s detailed paper on (available here) on how the engine partitions data sources while applying data privacy checks. One of the things I show in the video is that it’s now possible to see these partitions using Power Query Query Diagnostics (see here for some details – although I should probably devote a whole blog post to this in the future)

Lastly, one minor correction to something I said in the video: at the 44:32 mark I do a demo that shows how you can use M functions to avoid a Formula Firewall error. While this trick works in Power BI Desktop and Excel, it results in a dataset that can’t be refreshed in the Power BI Service unfortunately.

 

Optimising The Performance Of Power Query Merges In Power BI, Part 4: Table.Join And Other Join Algorithms

In the previous post in this series I showed how you can use the Table.Join function and the SortMerge algorithm to get much better performance for merge operations in Power Query – but only if your data is sorted. But what if your data isn’t sorted? Can those other join algorithms give you better performance? The answer is… you guessed it, it depends.

As a reminder, the seven join algorithms that can be used with Table.Join are:

  • JoinAlgorithm.Dynamic
  • JoinAlgorithm.LeftHash
  • JoinAlgorithm.LeftIndex
  • JoinAlgorithm.PairwiseHash
  • JoinAlgorithm.RightHash
  • JoinAlgorithm.RightIndex
  • JoinAlgorithm.SortMerge

The first thing to say is that if you don’t specify a join algorithm in the sixth parameter of Table.Join (it’s an optional parameter), Power Query will try to decide which algorithm to use based on some undocumented heuristics. The same thing also happens if you use JoinAlgorithm.Dynamic in the sixth parameter of Table.Join, or if you use the Table.NestedJoin function instead, which doesn’t allow you to explicitly specify an algorithm.

There are going to be some cases where you can get better performance by explicitly specifying a join algorithm instead of relying on JoinAlgorithm.Dynamic but you’ll have to do some thorough testing to prove it. From what I’ve seen there are lots of cases where explicitly setting the algorithm will result in worse performance, although there are enough cases where doing so results in better performance to make all that testing worthwhile.

For example, using the same CSV file that I’ve been using in my previous posts, I created two source queries called First and Second that only returned column A and the first 300,000 rows. Here’s the M code for First (the code for Second only differs in that it renames the only column to A2):


let
Source = Csv.Document(
File.Contents("C:\Users\chwebb\Documents\NumbersMoreColumns.csv"),
[Delimiter = ",", Columns = 7, Encoding = 65001, QuoteStyle = QuoteStyle.None]
),
#"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars = true]),
#"Removed Other Columns" = Table.SelectColumns(#"Promoted Headers", {"A"}),
#"Renamed Columns" = Table.RenameColumns(#"Removed Other Columns", {{"A", "A1"}}),
#"Kept First Rows" = Table.FirstN(#"Renamed Columns", 300000)
in
#"Kept First Rows"

Here’s a query that uses Table.Join and JoinAlgorithm.Dynamic to merge these two queries:


let
Source = Table.Join(First, {"A1"}, Second, {"A2"}, JoinKind.Inner, JoinAlgorithm.Dynamic)
in
Source

The average timings for this query on my PC were:

  • Progress Report End/25 Execute SQL – 2.0 seconds
  • Progress Report End/17 Read Data – 0.4 seconds

Changing this query to use JoinAlgorithm.LeftHash instead, like so:


let
Source = Table.Join(First, {"A1"}, Second, {"A2"}, JoinKind.Inner, JoinAlgorithm.LeftHash)
in
Source

…resulted in the following average timings:

  • Progress Report End/25 Execute SQL –  0.9 seconds
  • Progress Report End/17 Read Data – 0.6 seconds

An improvement of almost one second – but I’ve not included here all the other test results for algorithms that performed worse (I had to cancel the query that used JoinAlgorithm.LeftIndex because it was so slow). And just to be clear: I’m not saying that using JoinAlgorithm.LeftHash is always better than JoinAlgorithm.Dynamic, just that it happened to perform better in this case with these queries and this data. With different data and different queries then different algorithms may perform better. If you find other scenarios where specifying an algorithm improves performance then please let me know by leaving a comment.

Optimising The Performance Of Power Query Merges In Power BI, Part 3: Table.Join And SortMerge

In the last two posts in this series I showed how removing columns from the tables used in a Power Query merge operation can improve refresh performance. In this post I’ll show you a different trick that – when you can use it – can give you an equally significant performance boost.

When you merge data from two queries in the Power Query Editor the M code generated uses the Table.NestedJoin function. There is, however, another M function that can be used to merge data: Table.Join. The interesting thing about this function is that has a parameter that Table.NestedJoin doesn’t have: the joinAlgorithm parameter allows you to specify the algorithm used by the Power Query engine for the merge. The documentation doesn’t tell you what the possible values for this parameter are but #shared shows them:

Algs

The SortMerge algorithm, last in the list above, is the focus of this blog post. I mentioned in my earlier posts that the reason that merge operations on non-foldable data sources are often slow is that both of the tables used in the merge need to be held in memory. There is an exception though: if you know that the data in the columns used to join the two tables is sorted in ascending order, you can use the Table.Join function and the SortMerge algorithm and the data from both sources can be streamed rather than held in memory, which in turn results in the merge being much faster.

Here’s an example. As before there are two source queries that take data from a one million row CSV file, but for this test no columns are removed and there’s no filter on rows. The two source queries, called First and Second, are almost identical (in First the columns are named A1 to G1 and in Second the columns are named A2 to G2) and are not loaded into the dataset. Here’s the M code for the query called First:


let
Source = Csv.Document(
File.Contents("C:\Users\chwebb\Documents\NumbersMoreColumns.csv"),
[Delimiter = ",", Columns = 7, Encoding = 65001, QuoteStyle = QuoteStyle.None]
),
#"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars = true]),
#"Changed Type" = Table.TransformColumnTypes(
#"Promoted Headers",
{
{"A", Int64.Type},
{"B", Int64.Type},
{"C", Int64.Type},
{"D", Int64.Type},
{"E", Int64.Type},
{"F", Int64.Type},
{"G", Int64.Type}
}
),
#"Renamed Columns" = Table.RenameColumns(
#"Changed Type",
{{"A", "A1"}, {"B", "B1"}, {"C", "C1"}, {"D", "D1"}, {"E", "E1"}, {"F", "F1"}, {"G", "G1"}}
)
in
#"Renamed Columns"

First

Here’s a query that merges these queries using Table.NestedJoin and returns all columns from the source queries and is enabled:


let
Source = Table.NestedJoin(First, {"A1"}, Second, {"A2"}, "Second", JoinKind.Inner),
#"Expanded Second" = Table.ExpandTableColumn(
Source,
"Second",
{"A2", "B2", "C2", "D2", "E2", "F2", "G2"},
{"Second.A2", "Second.B2", "Second.C2", "Second.D2", "Second.E2", "Second.F2", "Second.G2"}
)
in
#"Expanded Second"

The timings for refreshing the merge query are:

  • Progress Report End/25 Execute SQL – 54 seconds
  • Progress Report End/17 Read Data – 58 seconds

[As I mentioned before, these timings may vary by a few seconds each way from run to run because paging is taking place]

Now it just so happens that in this case I know the data in the A columns in both the source queries is sorted in ascending order, so this means I can rewrite the merge query using Table.Join and the SortMerge algorithm like so:


let
Source =
Table.Join(
First, {"A1"}, Second, {"A2"},
JoinKind.Inner, JoinAlgorithm.SortMerge
)
in
Source

Here are the timings for this version:

  • Progress Report End/25 Execute SQL – 0.1 seconds
  • Progress Report End/17 Read Data – 21 seconds

This new query is clearly much faster than the original version of the merge!

Now let’s talk about the limitations of this approach. First of all, if the data in the columns used to join the two tables together is not sorted in ascending order you won’t get an error message, you’ll just get incorrect data returned, so you really need to be sure that the data is indeed sorted. Secondly, if your data is not sorted, then you can sort it in Power Query before the merge – but since sorting itself takes time and sorting for non-foldable data sources is another one of those operations which requires the table to be held in memory, you’re unlikely to get any performance improvement.

That said, I can see that there are going to be plenty of scenarios where you can use this technique. For example, if you’re extracting data to a CSV file it may be possible to specify that you want to sort the data when the extract takes place. Also, if you’re trying to merge data from two different data sources (say, SQL Server and Oracle) that both support query folding for sorting, then the time it takes to sort the data may be less than the performance gain from using the SortMerge algorithm.

 

 

%d bloggers like this: