A Function To Generate The M Code For A Table Type

This is going to sound obscure, and it is, but recently I’ve been using the #table() intrinsic function in M a lot – specifically the version that takes a table type as its first parameter (as I describe here) – and because it’s a bit of a pain to have to write the M code for a table type by hand, I’ve written an M function that takes a table and returns the text for the M code that is needed to define a table type. Here it is:

(InputTable as table) as text =>
let
    Source = 
        Table.Schema(InputTable),
    SortRows = 
        Table.Sort(
            Source,
            {{"Position", Order.Ascending}}),
    RemoveColumns = 
        Table.SelectColumns(
            SortRows,
            {"Name", "TypeName"}),
    AddCustom = 
        Table.AddColumn(
            RemoveColumns, 
            "TypeNames", 
            each 
            Expression.Identifier([Name]) & " = " & [TypeName]),
    Output = 
        "[" & Text.Combine(AddCustom[TypeNames], ", ") & "]"
in
    Output

Nothing complex here, but now I’ve posted this I know that in the future I’ll be able to Google for it when I’m working onsite with a customer and I need it!

To give you an idea of how it works, take the table that is returned by the following M expression, which calls the public TripPin OData web service:

OData.Feed(
"https://services.odata.org/TripPinRESTierService/Airports", 
null, 
[Implementation="2.0"])

image

Passing this table to the function above returns the following text, the M code for a record that lists the names of the columns in this table and their data types, suitable for use with #table:

[Name = Text.Type, IcaoCode = Text.Type, 
IataCode = Text.Type, Location = Record.Type]

Converting Decimal Numbers To Hexadecimal In Power Query M

This is a very short post! A lot of people have blogged about how to convert numbers between different bases in M (see for example Maxim Zelensky’s very elegant solution for converting from binary to decimal), but today I noticed there was a very easy way to convert a decimal number to hexadecimal using the Number.ToText() function: you just need to use “x” in the second parameter. For example:

Number.ToText(12, "x") //returns c
Number.ToText(123, "x") //returns 7b

I’m sure this will come in handy somewhere…

Invoking M Functions In Parallel Using List.ParallelInvoke()

I was looking at the list of M functions supported in custom connectors and not in Power BI Desktop (using the technique I blogged about here) in the latest version of the Power Query SDK when I came across an intriguing new function: List.ParallelInvoke(). It doesn’t seem to be documented anywhere, but I think I’ve worked out what it does and it’s very exciting!

Consider the following M function, declared in a custom connector:

SlowFunction = () as number =>
    Function.InvokeAfter(()=>1, #duration(0,0,0,5));

When you call it, it waits 5 seconds and returns the value 1. If you call it three times and sum up the results, as follows:

List.Sum({SlowFunction(), SlowFunction(), SlowFunction()})

…then after 15 seconds you get the value 3 back.

Now, consider the following expression:

List.Sum(
 List.ParallelInvoke(
  {SlowFunction, SlowFunction, SlowFunction}
 )
)

When this is evaluated in a custom connector, you get the value 3 back after 5 seconds – so it looks like List.ParallelInvoke() allows you to invoke a list of functions in parallel. There’s also an optional second parameter called concurrency, which seems to control the amount of parallelism. So, for example:

List.Sum(
 List.ParallelInvoke(
  {SlowFunction, SlowFunction, SlowFunction},
  2
 )
)

…returns after 10 seconds, suggesting that only two function calls at a time are invoked in parallel.

I can imagine all kinds of uses for this, for example making multiple parallel calls to data sources or doing expensive calculations in parallel. I wonder if it will ever be allowed to be used outside custom connectors?

UPDATE: see Curt Hagenlocher’s comment below for some important information about this function.

Using Html.Table() To Extract URLs From A Web Page In Power BI/Power Query M

Last year I blogged about how to use the Text.BetweenDelimiters() function to extract all the links from the href attributes in the source of a web page. The code was reasonably simple but there’s now an even easier way to solve the same problem using the new Html.Table() function. This function doesn’t seem to be documented online yet, but the built-in documentation for the function available in the Query Editor is up-to-date:

image

Miguel Escobar also has a great post showing how to use it and the new Web.BrowserContents function here.

Here’s an example M query that extracts all the links that start with the letters “http” from my company homepage:

let
    Source = 
	 Web.BrowserContents("https://www.crossjoin.co.uk/"),
    Links = 
	 Html.Table(
	  Source, 
	  {{
	   "Link", 
	   "a[href^=""http""]", 
	   each [Attributes][href]}})
in
    Links

image

To explain what’s going on here:

  • Web.BrowserContents returns the text of the html DOM for the web page
  • In the second step Html.Table takes that text and searches for all <a> elements whose href attribute starts with the letters “http”. I found this CSS selector here.

Using Process Monitor To Find Out How Much Data Power Query Reads From A File

This post is really just a quick follow-on from my post earlier this week on using Process Monitor to troubleshoot Power Query performance issues with file-based data sources, which I suggest you read before carrying on. I realised, after playing around with Process Monitor some more, that the ReadFile operation actually tells you how much data is being read from a file when a Power Query query is running. For example, here’s a sample of some of the ReadFile operations captured while running the unoptimised version of the query I talked about in my last post:

image

Since Process Monitor can export captured events to a CSV file, it’s pretty easy to load the events into Power BI, filter the events down to only the ReadFile operations, parse the Detail column to extract the Offset values (which I’m sure you can work out how to do if you’re reading a post like this), and then draw a graph showing how much data gets read from a file when a query is run. Here’s what the graph looks like for the unoptimised version of the query from my previous blog post, with relative time on the x axis and the amount of data read  in bytes on the y axis:

image

In that post I noted that there were six reads of the file – and while that’s clear from the graph above, it’s also possible to see that the first read does not read the whole contents of the file while the next five do (the file is 149MB). So maybe I was right that there is one complete read of the file for each row in the output query? What is that first, partial read for, I wonder?

More Details On Creating Tables In Power BI/Power Query M Code Using #table()

About two years ago I wrote a blog post describing how the #table M function can be used to generate tables, but in that post I only covered the functionality I used regularly – namely using #table with a list of column names or a table type in the first parameter. However there two other variations on #table that I have used recently that I thought were worth pointing out.

For example, if you need to generate a table with a set number of columns but you don’t care what the columns are called, you can use an integer in the first parameter to get a table with that number of columns. The following expression returns a table with four columns of data type Any called Column1, Column2, Column3 and Column3, and no rows:

#table(4,{})

image

Also, if you have a list of lists with an unknown number of items in and you want to use each nested list for the row values in a table, you can use a null value in the first parameter of #table. The following expression returns a table with four columns like the one above, but with two rows of integer values:

#table(null, {{1,2,3,4},{2,3,4,5}})

image

Improving The Performance Of Aggregation After A Merge In Power BI And Excel Power Query/Get&Transform

A long time ago someone – probably from the Power Query dev team – told me that adding primary keys to M tables could improve the performance of certain transformations in Power BI and Excel Power Query/Get&Transform. Indeed, I mentioned this in chapter 5 of my book “Power Query for Power BI and Excel”, but I it wasn’t until this week that I found a scenario where this was actually the case. In this post I’ll describe the scenario and try to draw some conclusions about when adding primary keys to tables might make a difference to performance.

Imagine you have two queries that return data from two different files. First of all there’s a query called #”Price Paid”, which reads data from a 140MB csv file containing some of my favourite demo data: all 836933 property transactions in the year 2017 from the UK Land Registry Price Paid dataset. The query does nothing special, just reads the all the data, renames some columns, removes a few others, and sets data types on columns. Here’s what the output looks like:

image

Second, you have a query called #”Property Types” that reads data from a very small Excel file and returns the following table:

image

As you can see, the Property Type column from the #”Price Paid” query contains single letter codes describing the type of property sold in each transaction; the Property Type column from #“Property Types” contains a distinct list of the same codes and acts as a dimension table. Again there’s nothing interesting going on in this query.

The problems start when you try to join data from these two queries using a Merge and then, for each row in #”Property Types”, show the sum of the Price Paid column from #”Price Paid”. The steps to do this are:

1. Click on the Merge Queries/Merge Queries as New button on the Home tab in the Power Query Editor:

image

2. In the Merge dialog, do a left outer join between #”Property Types” and #”Price Paid” on the Property Type column on each table

image

3. Back in the Power Query Editor window, click on the Expand/Aggregate button in the top right-hand corner of the column created by the previous step.

image

4. Click on the Aggregate radio button and select Sum of Price Paid

image

The output of the query is this table:

image

Here’s the M code for this query, as generated by the UI:

let
    Source = 
	Table.NestedJoin(
		#"Property Types",
		{"Property Type"},
		#"Price Paid",
		{"Property Type"},"Price Paid",
		JoinKind.LeftOuter),
    #"Aggregated Price Paid" = 
	Table.AggregateTableColumn(
		Source, 
		"Price Paid", 
		{{"Price Paid", List.Sum, "Sum of Price Paid"}})
in
    #"Aggregated Price Paid"

It’s a very common thing to do, and in this case the query is extremely slow to run: on my laptop, when you refresh the preview in the Power Query Editor window in Power BI Desktop it takes 55 seconds. What’s more, in the bottom left-hand corner of the screen where it displays how much data is being read from the source file, it shows a value that is a lot more than the size of the source file. In fact it seems like the source file for #”Price Paid” is read once for each row in the #”Property Types” query: 140MB * 5 rows = 700MB.

image

Not good. But what if you specify that the Property Type column from the #”Property Types” query is the primary key of the table? After all it does contain unique values. Although it isn’t widely known, and it’s not shown in the UI, tables in M can have primary and foreign keys defined on them whatever data source you use (remember that in this case the data sources are csv and Excel). One way to do this is using Table.AddKey as follows:

let
    WithAddedKey = 
	Table.AddKey(
		#"Property Types", 
		{"Property Type"}, 
		true),
    Source = 
	Table.NestedJoin(
		WithAddedKey,
		{"Property Type"},
		#"Price Paid",
		{"Property Type"},
		"Price Paid",
		JoinKind.LeftOuter),
    #"Aggregated Price Paid" = 
	Table.AggregateTableColumn(
		Source, 
		"Price Paid", 
		{{"Price Paid", List.Sum, "Sum of Price Paid"}})
in
    #"Aggregated Price Paid"

And guess what? After making this change, the query only takes 12 seconds to run instead of 55 seconds! What’s more the amount of data read from disk shown in the UI suggests that the source file for #”Price Paid” is only read once. But making this change involves writing M code, and not everyone is comfortable making changes in the Advanced Editor. The good news is that it’s possible to get this performance benefit in another way without writing any M code.

As I mentioned in this blog post, using the Remove Duplicates transformation on a column has the side-effect of marking that column as a primary key. Therefore if you right-click on the Property Type column and select Remove Duplicates in between steps 2 and 3 above, before you click on the Expand/Aggregate button:

image

…then you also get the performance benefit. Here’s the M code generated by the UI for this query:

let
    Source = 
	Table.NestedJoin(
		#"Property Types",
		{"Property Type"},
		#"Price Paid",
		{"Property Type"},
		"Price Paid",
		JoinKind.LeftOuter),
    #"Removed Duplicates" = 
	Table.Distinct(
		Source, 
		{"Property Type"}),
    #"Aggregated Price Paid" = 
	Table.AggregateTableColumn(
		#"Removed Duplicates", 
		"Price Paid", 
		{{"Price Paid", List.Sum, "Sum of Price Paid"}})
in
    #"Aggregated Price Paid"

It’s #”Removed Duplicates” that is the new step here, and it uses the Table.Distinct function.

All this suggests that if you are doing merge operations on queries that get data from large csv or Excel files and then aggregating data, if you can set a primary key on one of the columns used in the join it can make a massive difference to performance. However this is just one example, and I would be really interested to hear if any of you can reproduce these findings with your own data and queries – if you can, please leave a comment. I suspect this could be a very important discovery for M query performance tuning!

%d bloggers like this: