Adding Tooltips To Column Headers In The Power Query Editor In Power BI And Excel

Here’s something that will excite all you M nerds out there! Did you know you can make tooltips appear when you hover over column headers in a table in the Power Query Editor in either Power BI Desktop or Excel?

It’s all done with M metadata, and here’s an example query:


let
source =
#table({"firstname", "lastname"}, {{"matt", "masson"}}),
tableType =
type table[firstname = Text.Type, lastname = Text.Type]
meta [
Documentation.FieldDescription =
[firstname = "Given Name", lastname = "Family Name"]
],
replaceType = Value.ReplaceType(source, tableType)
in
replaceType

Here’s what you see when your mouse hovers over the firstname column in the Power Query Editor:

Capture1

…and here’s what you see when your mouse hovers over the lastname column:

Capture2

How does this work? Here’s what each of the steps do:

  • The source step creates a simple table with two columns called firstname and lastname using #table (see here for more details on that).
  • The tabletype step declares a new table type with two columns (the same two columns in the table from the previous step) and then adds a metadata record to this type. In that record the Documentation.FieldDescription field contains the text values that will appear as tooltips when you hover over each column.
  • The replacetype step replaces the type of the table returned by source with the type declared in tabletype.

The Power Query Editor UI then looks for a Documentation.FieldDescription field in any metadata associated with a table and displays the values in that field when you hover over the appropriate column.

Note that if you add any steps to your query after this that change the table type (for example that add or remove columns), the metadata is removed ☹. That said I still feel like this might be a useful feature for anyone building a custom connector, for example.

[Thanks to Matt Masson for telling me how all this works]

 

 

Optimising The Performance Of Power Query Merges In Power BI, Part 3: Table.Join And SortMerge

In the last two posts in this series I showed how removing columns from the tables used in a Power Query merge operation can improve refresh performance. In this post I’ll show you a different trick that – when you can use it – can give you an equally significant performance boost.

When you merge data from two queries in the Power Query Editor the M code generated uses the Table.NestedJoin function. There is, however, another M function that can be used to merge data: Table.Join. The interesting thing about this function is that has a parameter that Table.NestedJoin doesn’t have: the joinAlgorithm parameter allows you to specify the algorithm used by the Power Query engine for the merge. The documentation doesn’t tell you what the possible values for this parameter are but #shared shows them:

Algs

The SortMerge algorithm, last in the list above, is the focus of this blog post. I mentioned in my earlier posts that the reason that merge operations on non-foldable data sources are often slow is that both of the tables used in the merge need to be held in memory. There is an exception though: if you know that the data in the columns used to join the two tables is sorted in ascending order, you can use the Table.Join function and the SortMerge algorithm and the data from both sources can be streamed rather than held in memory, which in turn results in the merge being much faster.

Here’s an example. As before there are two source queries that take data from a one million row CSV file, but for this test no columns are removed and there’s no filter on rows. The two source queries, called First and Second, are almost identical (in First the columns are named A1 to G1 and in Second the columns are named A2 to G2) and are not loaded into the dataset. Here’s the M code for the query called First:


let
Source = Csv.Document(
File.Contents("C:\Users\chwebb\Documents\NumbersMoreColumns.csv"),
[Delimiter = ",", Columns = 7, Encoding = 65001, QuoteStyle = QuoteStyle.None]
),
#"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars = true]),
#"Changed Type" = Table.TransformColumnTypes(
#"Promoted Headers",
{
{"A", Int64.Type},
{"B", Int64.Type},
{"C", Int64.Type},
{"D", Int64.Type},
{"E", Int64.Type},
{"F", Int64.Type},
{"G", Int64.Type}
}
),
#"Renamed Columns" = Table.RenameColumns(
#"Changed Type",
{{"A", "A1"}, {"B", "B1"}, {"C", "C1"}, {"D", "D1"}, {"E", "E1"}, {"F", "F1"}, {"G", "G1"}}
)
in
#"Renamed Columns"

First

Here’s a query that merges these queries using Table.NestedJoin and returns all columns from the source queries and is enabled:


let
Source = Table.NestedJoin(First, {"A1"}, Second, {"A2"}, "Second", JoinKind.Inner),
#"Expanded Second" = Table.ExpandTableColumn(
Source,
"Second",
{"A2", "B2", "C2", "D2", "E2", "F2", "G2"},
{"Second.A2", "Second.B2", "Second.C2", "Second.D2", "Second.E2", "Second.F2", "Second.G2"}
)
in
#"Expanded Second"

The timings for refreshing the merge query are:

  • Progress Report End/25 Execute SQL – 54 seconds
  • Progress Report End/17 Read Data – 58 seconds

[As I mentioned before, these timings may vary by a few seconds each way from run to run because paging is taking place]

Now it just so happens that in this case I know the data in the A columns in both the source queries is sorted in ascending order, so this means I can rewrite the merge query using Table.Join and the SortMerge algorithm like so:


let
Source =
Table.Join(
First, {"A1"}, Second, {"A2"},
JoinKind.Inner, JoinAlgorithm.SortMerge
)
in
Source

Here are the timings for this version:

  • Progress Report End/25 Execute SQL – 0.1 seconds
  • Progress Report End/17 Read Data – 21 seconds

This new query is clearly much faster than the original version of the merge!

Now let’s talk about the limitations of this approach. First of all, if the data in the columns used to join the two tables together is not sorted in ascending order you won’t get an error message, you’ll just get incorrect data returned, so you really need to be sure that the data is indeed sorted. Secondly, if your data is not sorted, then you can sort it in Power Query before the merge – but since sorting itself takes time and sorting for non-foldable data sources is another one of those operations which requires the table to be held in memory, you’re unlikely to get any performance improvement.

That said, I can see that there are going to be plenty of scenarios where you can use this technique. For example, if you’re extracting data to a CSV file it may be possible to specify that you want to sort the data when the extract takes place. Also, if you’re trying to merge data from two different data sources (say, SQL Server and Oracle) that both support query folding for sorting, then the time it takes to sort the data may be less than the performance gain from using the SortMerge algorithm.

 

 

Speed Up Data Refresh Performance In Power BI Desktop Using Table.View

It can sometimes be frustrating to work with slow data sources or complex Power Query queries in Power BI Desktop: you open the Power Query Editor, make some changes, click Close & Apply and then wait a loooong time for your data to refresh. In this post I’m going to show you a technique that can cut this wait by up to 50%. It involves some fairly complex M code but I promise you, the effort is worth it!

In fact, what I’m going to describe is more or less what I showed towards the end of my appearance on Guy In A Cube last year and in a few other posts, but at that time I didn’t understand properly why it worked or what the performance implications actually were. Now, thanks to a lot of help from Curt Hagenlocher of the Power Query development team I have all the details I need to blog about it.

Let’s see a simple example. Say you have a large JSON file – for this test I generated one that is 67MB containing random data using this handy online tool – and you want to load it into Power BI. You’ll end up with a query that looks something like this:

let
  Source = Json.Document(File.Contents("C:\generated.json")),
  #"Converted to Table" = Table.FromList(
    Source, 
    Splitter.SplitByNothing(), 
    null, 
    null, 
    ExtraValues.Error
  ),
  #"Expanded Column1" = Table.ExpandRecordColumn(
    #"Converted to Table", 
    "Column1", 
    {"_id", "index", "guid", "isActive", "balance"}, 
    {"_id", "index", "guid", "isActive", "balance"}
  ),
  #"Changed Type" = Table.TransformColumnTypes(
    #"Expanded Column1", 
    {
      {"_id", type text}, 
      {"index", Int64.Type}, 
      {"guid", type text}, 
      {"isActive", type text}, 
      {"balance", type text}
    }
  )
in
  #"Changed Type"

Here’s what the output of this query looks like:

Output

On my laptop this query takes around 7-8 seconds to load. Using Process Monitor to see how much data is being read from the file (see this post for more details) shows that Power Query is reading the data from this file twice. This graph, generated using data from Process Monitor, has time on the x axis and amount of data read from the file on the y axis, and the two peaks indicate that the data is being read twice:

FirstLoad

Why? When you refresh a table in Power BI Desktop two things happen:

  • First, Power BI has to check what columns are present in the table. To do this, it runs the query for the table but asks the Power Query engine to filter the table so it returns zero rows.
  • Secondly the query is run again but this time with no filter, so all rows are returned, and this is when the data is actually loaded into the table in Power BI.

It’s on this first run of the query where problems can occur. If you’re using a data source like SQL Server and you’re not doing any complex transformations then query folding will take place, so Power BI’s request to return the table but with no rows can be handled very efficiently. However, if you’re using a data source where query folding is not possible (such as Excel, CSV or JSON files) or you have complex transformations that stop folding taking place then the only way Power Query can work out what columns the query returns is by running the entire query. That’s what is happening in the example above.

This only happens in Power BI Desktop, though. When you refresh a dataset in the Power BI Service the query is only run once.

The solution is to trick the Power Query engine into running these zero-row filter queries instantly, and you can do that using the Table.View M function. I blogged about Table.View here (and I strongly suggest you read that post before carrying on) but that example didn’t quite get to the solution you need here. Here’s a new version of the query above with an extra step:

let
  Source = Json.Document(File.Contents("C:\generated.json")),
  #"Converted to Table" = Table.FromList(
    Source, 
    Splitter.SplitByNothing(), 
    null, 
    null, 
    ExtraValues.Error
  ),
  #"Expanded Column1" = Table.ExpandRecordColumn(
    #"Converted to Table", 
    "Column1", 
    {"_id", "index", "guid", "isActive", "balance"}, 
    {"_id", "index", "guid", "isActive", "balance"}
  ),
  #"Changed Type" = Table.TransformColumnTypes(
    #"Expanded Column1", 
    {
      {"_id", type text}, 
      {"index", Int64.Type}, 
      {"guid", type text}, 
      {"isActive", type text}, 
      {"balance", type text}
    }
  ),
  OverrideZeroRowFilter = Table.View(
    null, 
    [
    GetType = () => 
      type table[
      _id = Text.Type, 
      index = Int64.Type, 
      guid = Text.Type, 
      isAction = Text.Type, 
      balance = Text.Type
    ], 
    GetRows = () => 
      #"Changed Type", 
    OnTake = (count as number) => 
      if count = 0 then 
      #table(
      type table[
        _id = Text.Type, 
        index = Int64.Type, 
        guid = Text.Type, 
        isAction = Text.Type, 
        balance = Text.Type
      ], 
      {}
    ) 
    else 
     Table.FirstN(#"Changed Type", count)]
  )
in
  OverrideZeroRowFilter

The OverrideZeroRowFilter step is where the magic happens. It uses Table.View to override query folding behaviour by intercepting what happens when the table returned by the query is filtered. There are three fields in the record in the second parameter of Table.View that you need to change:

  • GetType returns a table type that describes the columns and their data types present in the output of the query. The six columns listed here are the six columns you can see in the screenshot of the query output above. It’s very easy to generate the required M code when you use the custom function that I blogged about here.
  • GetRows returns all the rows that the query can return, that’s to say the table returned by the #”Changed Type” step.
  • OnTake is used when a top n filter is applied to the table returned by the query.  In this case the code looks at the number of rows that are being requested (specified in the count parameter) and if that’s zero, it just returns an empty table with the same columns as the #”Changed Type” step; if it’s more than zero then it uses Table.FirstN to get the actual number of rows requested from #”Changed Type”. This means that when Power BI does that zero-row filter it can now happen immediately because there’s no need to go back to the data source or execute any of the transformations in the query.

This version of the query now runs in 4-5 seconds, and Process Monitor now shows that the JSON file is only read once and, obviously, reading the data once is a lot faster than reading it twice:

SecondLoad

Look at how the graph flattens at the end… but that’s something for another blog post.

There is a downside to this approach: you have to hard-code the table schema that you expect your query to return, and if you change your query to return different columns you’ll have to update the table type in the last step.

I’ve used this technique on other slow queries and it has made a massive difference to the development experience in Power BI Desktop – one query that was taking five minutes to refresh when I closed the Power Query Editor went down to two and a half minutes. If you try this out yourself please let me know if it makes a difference by leaving a comment.

[Don’t forget that there are other things you can do that will also speed up the development experience for slow queries: you should definitely turn off the “Allow data preview to download in the background” option, and you might want to consider turning off data privacy checks so long as you fully understand what the implications]

 

Power Query Formatter

If you’ve ever used DAX Formatter to format your DAX code and wondered why there isn’t an equivalent for Power Query/M, then wonder no more: the nice people at Power Pivot Insights in Germany (read their German-language blog here) have built one. You can find it here:

https://powerqueryformatter.com/

PQ Formatter

They also have an API you can use too.

Bonus information: did you know that Microsoft also has an open source parser for M, available here?

Understanding The “The key didn’t match any rows in the table” Error In Power Query In Power BI Or Excel

One of the most common errors you’ll see when working with Power Query in Power BI or Excel is this:

Expression.Error: The key didn’t match any rows in the table

KeyErrorScreenshot

It can occur with almost any data source and in a wide variety of different circumstances, and for new users of Power Query it can be very confusing. In this post I’ll explain what the error message means and when you’re likely to encounter it using a simple example.

TL;DR You’re probably getting this error because your Power Query query is trying to connect to a table or worksheet or something in your data source that has been deleted or been renamed.

Say you have an Excel workbook with three worksheets in called Sheet1, Sheet2 and Sheet3:

WorksheetsBefore

On Sheet1 there’s some data you want to load into Power BI or Excel using Power Query:

Data

It’s very easy to do this: after you click on the Excel data source and select your Excel workbook, you see the Navigator window with the contents of the workbook and select Sheet1:

Nav

…click OK and your query is ready:

PQQuery

What actually happens in this query though? If you click on each of the four steps in the Applied Steps pane on the right-hand side of the screen, you’ll see what each step does. In particular, notice that the first step in the query (called Source) returns a table with one row for each worksheet, table and named range in the workbook:

SourceStep

This is something that happens with many types of data source: after you have connected you have to choose what data inside the data source you want to connect to in the first step, and the list of things you can connect to is returned in a table. This table lists the same things displayed in the Navigator dialog shown above. In the same way, if you connect to a SQL Server database this first step returns a table with all the tables and views in the database; if you connect to an OData feed the first step shows all the resources available through the feed.

The other thing to notice in the table in the screenshot is the Data column, which contains nested table values. If you were to click on one of these nested tables you would navigate to the data in the worksheet named on that row and that is in fact what the second step in the query does.

Here’s the M code for the whole query:


let
Source =
Excel.Workbook(
File.Contents(
"C:\MyExcelFile.xlsx"),
null,
true),
Sheet1_Sheet =
Source{[Item="Sheet1",Kind="Sheet"]}[Data],
#"Promoted Headers" =
Table.PromoteHeaders(
Sheet1_Sheet,
[PromoteAllScalars=true]
),
#"Changed Type" =
Table.TransformColumnTypes(
#"Promoted Headers",
{{"Month", type text},
{"Sales", Int64.Type}}
)
in
#"Changed Type"

If you want to understand how the code in the second step of this query gets the contents of the cell containing the nested table, I have a blog post here that goes into a lot more detail. Basically the second step (called Sheet1_Sheet in the code, but shown as Navigation in the Applied Steps pane) returns the nested table in the Data column from the row in the table where the Item column contains the value “Sheet1” and the Kind column contains the value “Sheet”. It’s the contents of the Item and Kind columns that are used to identify the row in the table that contains the data you want to see – in database terminology these columns are the key columns on the table.

What happens if you change the name of the worksheet that contains the data you want, from “Sheet1” to “Hello”?

WorksheetsAfter

The result is that the first step of the query now returns a table that looks like this:

AfterNav

Since there isn’t a row where the Item column contains “Sheet1” any more, the second step can no longer find the row it’s looking for – the key value it’s looking for no longer exists in the table – which is why you see the error message you do:

KeyErrorScreenshot

There are two ways to fix the error:

  1. Go to the Excel workbook and change the name of the sheet called “Hello” back to “Sheet1” or
  2. In either the Advanced Editor or the Formula Bar edit the M code and replace the reference to “Sheet1” with “Hello” in the second step of the query:
    FormulaBar

In summary, when you’re working with data sources in Power Query you need to be aware that if something in your data source is renamed or deleted then your query will break. You can of course write some clever M code to deal with situations like this but my recommendation is to try to fix the problem in the data source and not in your query.

An M Function To Help You Explore Power Query Diagnostics Data

Last week’s post showed an M function that took Power Query diagnostics data and formatted in a way that made it suitable for visualisation in a Power BI Decomposition Tree visual. This is great for understanding what’s going on at a high level, but by doing this you also lose a lot of detailed information from the diagnostics logs that could be useful for performance tuning. This week I have an M function for you that takes a different approach to solving the same problem that might be more appealing for the hardcore Power Query fans out there.

Here’s the code for the function:

As before you need to pass a query created by the Power Query diagnostics functionality to the function:

InvokeFunction

[Note: I recommend calling the function on the “detailed” diagnostics query and not the summarised view, because I’ve found what might be a bug with the summarised view]

The function will then return a table with one row for each query that is evaluated while the diagnostics were running:

QueryList

[Note: You may find that some query evaluations have nulls instead of the name of the query; that’s an issue with the diagnostics data that will be addressed soon]

You could get the same thing simply by filtering the raw diagnostics data down to the rows where the Path column contains 0, but I’ve added two extra things that make the data much easier to understand.

First of all, the original data contains columns called “Exclusive Duration” and “Exclusive Duration (%)”, which tell you the amount of time each operation took in seconds and what percentage this was of the entire query evaluation. However, the data itself is in a parent/child structure so while a parent operation may have a very small duration it may have many slow descendant operations. Therefore my function adds two new columns called “Exclusive Duration (Including Child Operations)” and “Exclusive Duration (%)(Including Child Operations)”. The first of these new columns returns the total duration of each operation and all its child operations, while the second returns the percentage that the duration of each operation and all its child operations make up of the whole query duration. At the root level, this means the first of these new columns shows the total duration of each query in seconds and the second of these new columns should always show 100%.

Durations

Secondly, the function adds a new column called Child Rows that contains a nested table containing the child operations associated with each operation:

ChildRows

You can either click on the Table link in this column to navigate to a new table containing all the child operations, or use the Expand/Aggregate button in the top-right hand corner to keep the current operations and join the nested rows onto them (I strongly recommend you do the former). Experienced Power Query developers will be familiar with the concept of nested tables and I think this provides an easy way of navigating the parent/child operation structure that is similar to what you see with a Decomposition Tree but keeps all the detail columns that the diagnostics data provides you with and means you don’t have to leave the Power Query Editor.

I would love to hear your feedback about this and know if there are any bugs or other information that could be added. You can download a sample pbix file for this function here.

 

Why Does Power BI Query My Data Source More Than Once?

This week I was honoured to be a guest on the Guy In A Cube channel, and for my topic I decided to tackle one of life’s eternal questions: why does Power BI query my data source more than once when I refresh my dataset?

You can watch the video here:

Although I’ve never answered this question directly in a blog post before, nevertheless almost every technique I showed in the video is something I’ve blogged about so I thought it would be useful to collect all the links to these posts in one place to provide some background to what I show in the video.

The first thing to say is that this is really a Power Query question, not just a Power BI question; a lot of what I show will therefore be relevant to Excel’s Get&Transform, Power BI dataflows and every other manifestation of Power Query out there, although I can’t guarantee that all the options and behaviour will be the same in these places.

If you’re developing in Power BI Desktop and you think that refresh is taking a long time, you should definitely check whether the Power Query engine is hitting your data source more than once. There are lots of ways to do this. Some data sources have tools that show when they are queried, such as the Run History screen in Microsoft Flow that I show in the video or SQL Server Profiler. Other ways include using Fiddler for web services or Process Monitor for files.

Next, if you find that Power Query is indeed querying your data source multiple times, the most common explanation is that it’s doing a “background refresh”. This happens by default when you refresh a dataset and is Power Query refreshing all the data previews that you see when you click on a step in the Power Query Editor window. In most cases this is so quick you don’t notice it but when you have lots of queries with lots of steps it can be a gigantic problem. You can turn it off in the Options dialog in Power BI Desktop, as shown in the video, or in the Excel Query Options dialog. I blogged about this issue here.

Another possible reason is the data privacy settings you have applied. This is a gigantic, complex topic and something that I and other people have blogged about many times. I wrote a multi-part series of posts explaining data privacy settings starting here and there’s also a great explanation of what’s going on in behind the scenes written by the dev team that I link to here; I also wrote a post here showing an example of the performance impact of data privacy checks. As I say in the post, however, do not play around with these settings unless you really know what you’re doing.

If you’re ok with writing some M code, using the technique I blogged about here to implement basic query folding on a web data source can be important – in the example in the video, it was doing this that reduced the number of calls to the web service from six to three.

Some transformations can also lead to data being read from a data source more than once. For example, in this post I showed how a merge transformation leads to multiple reads from the same Excel file and how to stop this happening and drastically improve performance.

Moving on, another important lesson to learn is how referenced queries are evaluated. This is one of the most counter-intuitive things about Power Query! With the set of referenced queries used in my demo:

…where three queries called Referenced Query 1, Referenced Query 2 and Referenced Query 3 each reference a query called Call Web Service and are loaded into a dataset (and where Call Web Service is not loaded into the dataset), most people would assume that when a refresh takes place the following happens:

  1. The Call Web Service query is run, getting the data from the web service
  2. The data returned is then passed to Referenced Query 1, Referenced Query 2 and Referenced Query 3

This is wrong. In fact what happens is this:

  1. Referenced Query 1 is refreshed. Because it references the Call Web Service query, that means the Call Web Service query is also evaluated.
  2. Referenced Query 2 is refreshed. Because it references the Call Web Service query, that means the Call Web Service query is also evaluated.
  3. Referenced Query 3 is refreshed. Because it references the Call Web Service query, that means the Call Web Service query is also evaluated.

This explains why the Call Web Service query is still being evaluated three times once the M code above has been added to it. You can read a detailed explanation of how referenced queries are evaluated here.

There is something we can do to help here though: the Power Query engine can cache the results returned from some types of data source, such as web services, but to take full advantage of this caching in this particular scenario you have to turn off the Enable Parallel Loading Of Tables option on the Options dialog as I show here. With that done, each of the three referenced queries run one after the other, instead of in parallel. When the first of these queries runs its calls the Call Web Service query, which gets data from the web service; when the next two queries run, though, and they trigger two more evaluations of Call Web Service, the data for the call to the web service has been cached. This caching can be turned off using the IsRetry option in Web.Contents as Curt Hagenlocher explains in a comment here.

The final thing shown in the video is that the infamous Table.Buffer M function does not improve performance in this case and in fact makes performance worse. If you use Table.Buffer inside the Call Web Service query it will indeed buffer the result of the call to the web service into memory, but the data that is buffered cannot be shared between separate query executions – so the buffering happens three times, once for each time Call Web Service is evaluated by the three referenced queries. Again this is something discussed in this post. Of course there are plenty of other scenarios where Table.Buffer will help performance, but these will only be in cases where the same query requests data from the same data source more than once.

Phew, that was a lot. I don’t pretend to know everything about when and why Power Query gets data from a data source multiple times but this is the sum total of my knowledge right now. I hope it’s useful!

[Update: it’s just been pointed out to me, quite rightly, that the names of my queries are confusing. For example, “Referenced Query 1” is the query doing the referencing, not the query being referenced – which is Call Web Service. I haven’t changed the names because I want the blog post to remain consistent with the video, and I hope the diagrams and the context make everything clear]

Power BI And Case Sensitivity

Case sensitivity is one of the more confusing aspects of Power BI: while the Power Query engine is case sensitive, the main Power BI engine (that means datasets, relationships, DAX etc) is case insensitive. Imke Feldmann shows how to make Power Query case insensitive through custom M here; in this blog post I’m going to concentrate on what case insensitivity means in datasets and DAX and show a way to (kind of) work around it.

Starting with a simple example, let’s say you have a data source that contains data that looks like this:

image

This is what the data looks like in the Power Query Editor which is, as I said, case sensitive – which means that it sees “A” and “a” as different characters, so you see six distinct characters in each row of the table. This could be data from any data source; to make things easy I’ve used the following M query to generate this table:

let
    Source = 
    #table(
        type table 
        [MyTextColumn = text], 
        {
            {"A"},
            {"a"},
            {"B"},
            {"b"},
            {"C"},
            {"c"}
        })
in
    Source

When this data gets loaded into a Power BI dataset, however, you’ll see the following in the Data pane of the main Power BI window:

image

Because the main Power BI engine is case insensitive – so for example “a” and “A” are considered as the same character – when this data is loaded in, it only sees three distinct characters and you can’t be sure whether you’ll get the lower case or upper case character stored twice. This is just the way it works; while an instance of Analysis Services Tabular, which is basically the same engine that is found in Power BI, can be either case sensitive or case insensitive, this option isn’t available in Power BI. There is an idea to support case sensitivity as an option in Power BI here and while there are some valid scenarios where you need this, in my experience case sensitivity causes a lot of problems so I can see why it wasn’t a priority.

What’s more you’ll find that DAX calculations are case insensitive too. If you try the following DAX expression to create a calculated table:

Demo Table = 
DATATABLE(
    "MyTextColumn",
    STRING,
    {
        {"A"},
        {"a"}
    }
)

You’ll get this:

image

DAX measures behave in a similar way. Given the following table:

image

If you use the following measure in a table visual:

Demo Measure = 
IF(
    SELECTEDVALUE('Demo Table 2'[MyIntColumn])=1,
    "A",
    "a"
)

…you’ll see:

image

The only way you can work around this case insensitivity is to make text values that would otherwise look the same to the Power BI engine somehow different. One way of doing this would be to add some extra characters to your text. You might think adding some extra spaces would be the way to go; revisiting the first M query shown above, you could add a space to every lower case character in the table like so:

let
    Source = 
    #table(
        type table 
        [MyTextColumn = text], 
        {
            {"A"},
            {"a "},
            {"B"},
            {"b "},
            {"c "},
            {"C"}
        })
in
    Source

But this doesn’t work because another little-known behaviour of the Power BI engine is that all leading and trailing spaces are trimmed when text is loaded. Anyway, spaces may not be visible but they still take up… well space. A better option – and one that actually works – is to use the Unicode Zero-Width Space character and the M Character.FromNumber function instead like so:

let
    Source = 
    #table(
        type table 
        [MyTextColumn = text], 
        {
            {"A"},
            {"a" & Character.FromNumber(8203)},
            {"B"},
            {"b" & Character.FromNumber(8203)},
            {"c" & Character.FromNumber(8203)},
            {"C"}
        })
in
    Source

The great thing about this character is that although it’s there, it’s invisible and takes no space. When you load this last M query into Power BI you see the following table:

image

Let’s finish off with a more advanced example of how to use this rather excellent character. Say you have the following source data returned by an M query called SourceData:

image

Of course, when this table is loaded into Power BI, you see the following:

image

BUT, the following M query takes this table and for each row goes through the text in the OriginalText column and adds a zero-width space after each lower case character:

let
    Source = 
    SourceData,
    ToList = 
    Table.AddColumn(
        Source, 
        "Chars", 
        each Text.ToList([OriginalText])
        ),
    LowerCaseChars = 
    {"a".."z"},
    AddInvisibleChars = 
    Table.AddColumn(
        ToList, 
        "AddInvisibleChars", 
        each 
        List.Transform(
            [Chars], 
            each 
            if 
            List.Contains(LowerCaseChars, _) 
            then 
            _ & Character.FromNumber(8203) 
            else _
            )
            ),
    RecombineList = 
    Table.AddColumn(
        AddInvisibleChars, 
        "OutputText", 
        each 
        Text.Combine([AddInvisibleChars]), 
        type text
        ),
    RemovedOtherColumns = 
    Table.SelectColumns(
        RecombineList,
        {"OutputText"}
        )

in
    RemovedOtherColumns

When you load the output of this query into Power BI, because the zero-width spaces after each lower case character make Power BI see each piece of text as being different, it looks like case has been preserved even when it hasn’t:

image

You can download the Power BI file for this last example here.

[Thanks to Greg Galloway, Akshai Mirchandani and Jeffrey Wang for their help with this post]

Hierarchical Navigation For Azure Data Lake Storage Gen2 In Power BI/Power Query

While the documentation on how to import data from Azure Data Lake Gen2 Storage into Power BI is pretty detailed, the connector (which at the time of writing is in beta) that supports this functionality in the Power Query engine has some useful functionality that isn’t so obvious. If you look at the built-in documentation on the AzureStorage.DataLake M function in the Power Query Editor you’ll see there are a lot of options that aren’t in the documentation on the web yet:

image

These options are:

  • BlockSize: the number of bytes to read before waiting on the data consumer. The default value is 4MB.
  • RequestSize: the number of bytes to read in a single HTTP request to the server. The default value is 4MB.
  • ConcurrentRequests: The ConcurrentRequests option supports faster download of data by specifying the number of requests to be made in parallel, at the cost of memory utilization. The memory required is (ConcurrentRequest * RequestSize). The default value is 16.
  • HierarchicalNavigation: A logical (true/false) that controls whether the files are returned in a tree-like directory view on in a flat list. The default value is true.

All of these options derserve more detailed examination, but in this post I’m going to focus on the HierarchicalNavigation property.

Say you have the following set of files and folders in ADLSGen2: at the root level there’s a csv file called SimpleSales.csv and a folder called ParentFolder; inside ParentFolder there’s a folder called ChildFolder; and inside ChildFolder there’s another csv file called SimpleSales2.csv.

image

image

image

When you connect first in the Power Query Editor you’ll see a table that looks like this (there are some other columns but I’ve removed them to make the screenshot legible):

image

In this table there are two rows, one for each csv file, and a Folder Path column that shows where each file sits within the folder structure. Here’s the M code for this query:

let
    Source = 
    AzureStorage.DataLake("https://x.dfs.core.windows.net/powerbi")
in
    Source

If you alter this to use the HierarchicalNavigation option, like so:

let
    Source = 
    AzureStorage.DataLake(
        "https://x.dfs.core.windows.net/powerbi",
        [HierarchicalNavigation=true]
        )
in
    Source

…you’ll see a different table is returned by the query:

image

In this case the two rows show ParentFolder and SimpleSales.csv; if you click on the Table link in the first row of the Content column you can drill down to ChildFolder; if you click on the Table link with ChildFolder you’ll see SimpleSales2.csv:

image

If you have a large number of files and folders in ADLSGen2 this way of viewing them is likely to be much easier to work with, I think.

Tables, Numbers, Immutability And Power Query Performance

Following on from my last two posts on Power Query performance, I thought I would write about another trick that can make a big difference to your query execution times. It’s actually something that Ehren from the dev team mentioned in a forums thread that I blogged about here but it deserves to be better known.

Here’s one of the almost-optimised queries from my previous posts:

let
    Source = 
    Csv.Document(
        File.Contents(
            "C:\Users\chwebb\Downloads\pp-2018.csv"
            ),
            [Delimiter=",", Columns=16, 
            Encoding=1252, QuoteStyle=QuoteStyle.None]),
    #"Changed Type" = 
    Table.TransformColumnTypes(
        Source,
        {{"Column2", Int64.Type}}
        ),
    #"Sorted Rows" = 
    Table.Sort(
        #"Changed Type",
        {{"Column2", Order.Descending}}
        ),
    Column2 = 
    #"Sorted Rows"[Column2]{0}
in
    Column2

This query returns a number (the maximum value in the column called Column2 of the table loaded in from a csv file) and takes about 5 seconds to run. Now consider the following query, which is the same as the previous query but with only the last line different:

let
    Source = 
    Csv.Document(
        File.Contents(
            "C:\Users\chwebb\Downloads\pp-2018.csv"
            ),
            [Delimiter=",", Columns=16, 
            Encoding=1252, QuoteStyle=QuoteStyle.None]),
    #"Changed Type" = 
    Table.TransformColumnTypes(
        Source,
        {{"Column2", Int64.Type}}
        ),
    #"Sorted Rows" = 
    Table.Sort(
        #"Changed Type",
        {{"Column2", Order.Descending}}
        ),
    Column2 = 
    #"Sorted Rows"[Column2]{0}
in
    Column2 + Column2 + Column2 + Column2

This query returns the number returned by the previous query added together four times. As you would expect – and hope – this query also takes about 5 seconds to run. Now consider this query:

let
    Source = 
    Csv.Document(
        File.Contents(
            "C:\Users\chwebb\Downloads\pp-2018.csv"
            ),
            [Delimiter=",", Columns=16, 
            Encoding=1252, QuoteStyle=QuoteStyle.None]),
    #"Changed Type" = 
    Table.TransformColumnTypes(
        Source,
        {{"Column2", Int64.Type}}
        ),
    #"Sorted Rows" = 
    Table.Sort(
        #"Changed Type",
        {{"Column2", Order.Descending}}
        )
in
    #"Sorted Rows"[Column2]{0} +
    #"Sorted Rows"[Column2]{0} +
    #"Sorted Rows"[Column2]{0} +
    #"Sorted Rows"[Column2]{0}

Instead of taking the value #”Sorted Rows”[Column2]{0} and storing it in the variable Column2 then adding Column2 four times, I’m  adding the expression #”Sorted Rows”[Column2]{0} together four times. The query returns the same number as the previous query. However this query takes 20 seconds to run! Why?

As well as the brief remarks at the end of the post I mentioned earlier, more details can be found in section 3.6 of the Power Query Language Specification on Immutability:

Once a value has been calculated, it is immutable, meaning it can no longer be changed. This simplifies the model for evaluating an expression and makes it easier to reason about the result since it is not possible to change a value once it has been used to evaluate a subsequent part of the expression. For instance, a record field is only computed when needed. However, once computed, it remains fixed for the lifetime of the record. Even if the attempt to compute the field raised an error, that same error will be raised again on every attempt to access that record field.


An important exception to the immutable-once-calculated rule applies to list and table values. Both have streaming semantics. That is, repeated enumeration of the items in a list or the rows in a table can produce varying results. Streaming semantics enables the construction of M expressions that transform data sets that would not fit in memory at once.

In the second query above the variable Column2 returns a number and after that number has been assigned to Column2 it cannot be changed. However in the last line of the third query when I add #”Sorted Rows”[Column2]{0} together four times, #”Sorted Rows” is a variable that returns a table which means it gets evaluated four times during the addition.

In summary, if you’re getting a table or list from a data source and you want to use a value inside that table or list multiple times, store it in a variable in a let expression before you do so.

%d bloggers like this: