Hierarchical Navigation For Azure Data Lake Storage Gen2 In Power BI/Power Query

While the documentation on how to import data from Azure Data Lake Gen2 Storage into Power BI is pretty detailed, the connector (which at the time of writing is in beta) that supports this functionality in the Power Query engine has some useful functionality that isn’t so obvious. If you look at the built-in documentation on the AzureStorage.DataLake M function in the Power Query Editor you’ll see there are a lot of options that aren’t in the documentation on the web yet:

image

These options are:

  • BlockSize: the number of bytes to read before waiting on the data consumer. The default value is 4MB.
  • RequestSize: the number of bytes to read in a single HTTP request to the server. The default value is 4MB.
  • ConcurrentRequests: The ConcurrentRequests option supports faster download of data by specifying the number of requests to be made in parallel, at the cost of memory utilization. The memory required is (ConcurrentRequest * RequestSize). The default value is 16.
  • HierarchicalNavigation: A logical (true/false) that controls whether the files are returned in a tree-like directory view on in a flat list. The default value is true.

All of these options derserve more detailed examination, but in this post I’m going to focus on the HierarchicalNavigation property.

Say you have the following set of files and folders in ADLSGen2: at the root level there’s a csv file called SimpleSales.csv and a folder called ParentFolder; inside ParentFolder there’s a folder called ChildFolder; and inside ChildFolder there’s another csv file called SimpleSales2.csv.

image

image

image

When you connect first in the Power Query Editor you’ll see a table that looks like this (there are some other columns but I’ve removed them to make the screenshot legible):

image

In this table there are two rows, one for each csv file, and a Folder Path column that shows where each file sits within the folder structure. Here’s the M code for this query:

let
    Source = 
    AzureStorage.DataLake("https://x.dfs.core.windows.net/powerbi")
in
    Source

If you alter this to use the HierarchicalNavigation option, like so:

let
    Source = 
    AzureStorage.DataLake(
        "https://x.dfs.core.windows.net/powerbi",
        [HierarchicalNavigation=true]
        )
in
    Source

…you’ll see a different table is returned by the query:

image

In this case the two rows show ParentFolder and SimpleSales.csv; if you click on the Table link in the first row of the Content column you can drill down to ChildFolder; if you click on the Table link with ChildFolder you’ll see SimpleSales2.csv:

image

If you have a large number of files and folders in ADLSGen2 this way of viewing them is likely to be much easier to work with, I think.

Tables, Numbers, Immutability And Power Query Performance

Following on from my last two posts on Power Query performance, I thought I would write about another trick that can make a big difference to your query execution times. It’s actually something that Ehren from the dev team mentioned in a forums thread that I blogged about here but it deserves to be better known.

Here’s one of the almost-optimised queries from my previous posts:

let
    Source = 
    Csv.Document(
        File.Contents(
            "C:\Users\chwebb\Downloads\pp-2018.csv"
            ),
            [Delimiter=",", Columns=16, 
            Encoding=1252, QuoteStyle=QuoteStyle.None]),
    #"Changed Type" = 
    Table.TransformColumnTypes(
        Source,
        {{"Column2", Int64.Type}}
        ),
    #"Sorted Rows" = 
    Table.Sort(
        #"Changed Type",
        {{"Column2", Order.Descending}}
        ),
    Column2 = 
    #"Sorted Rows"[Column2]{0}
in
    Column2

This query returns a number (the maximum value in the column called Column2 of the table loaded in from a csv file) and takes about 5 seconds to run. Now consider the following query, which is the same as the previous query but with only the last line different:

let
    Source = 
    Csv.Document(
        File.Contents(
            "C:\Users\chwebb\Downloads\pp-2018.csv"
            ),
            [Delimiter=",", Columns=16, 
            Encoding=1252, QuoteStyle=QuoteStyle.None]),
    #"Changed Type" = 
    Table.TransformColumnTypes(
        Source,
        {{"Column2", Int64.Type}}
        ),
    #"Sorted Rows" = 
    Table.Sort(
        #"Changed Type",
        {{"Column2", Order.Descending}}
        ),
    Column2 = 
    #"Sorted Rows"[Column2]{0}
in
    Column2 + Column2 + Column2 + Column2

This query returns the number returned by the previous query added together four times. As you would expect – and hope – this query also takes about 5 seconds to run. Now consider this query:

let
    Source = 
    Csv.Document(
        File.Contents(
            "C:\Users\chwebb\Downloads\pp-2018.csv"
            ),
            [Delimiter=",", Columns=16, 
            Encoding=1252, QuoteStyle=QuoteStyle.None]),
    #"Changed Type" = 
    Table.TransformColumnTypes(
        Source,
        {{"Column2", Int64.Type}}
        ),
    #"Sorted Rows" = 
    Table.Sort(
        #"Changed Type",
        {{"Column2", Order.Descending}}
        )
in
    #"Sorted Rows"[Column2]{0} +
    #"Sorted Rows"[Column2]{0} +
    #"Sorted Rows"[Column2]{0} +
    #"Sorted Rows"[Column2]{0}

Instead of taking the value #”Sorted Rows”[Column2]{0} and storing it in the variable Column2 then adding Column2 four times, I’m  adding the expression #”Sorted Rows”[Column2]{0} together four times. The query returns the same number as the previous query. However this query takes 20 seconds to run! Why?

As well as the brief remarks at the end of the post I mentioned earlier, more details can be found in section 3.6 of the Power Query Language Specification on Immutability:

Once a value has been calculated, it is immutable, meaning it can no longer be changed. This simplifies the model for evaluating an expression and makes it easier to reason about the result since it is not possible to change a value once it has been used to evaluate a subsequent part of the expression. For instance, a record field is only computed when needed. However, once computed, it remains fixed for the lifetime of the record. Even if the attempt to compute the field raised an error, that same error will be raised again on every attempt to access that record field.


An important exception to the immutable-once-calculated rule applies to list and table values. Both have streaming semantics. That is, repeated enumeration of the items in a list or the rows in a table can produce varying results. Streaming semantics enables the construction of M expressions that transform data sets that would not fit in memory at once.

In the second query above the variable Column2 returns a number and after that number has been assigned to Column2 it cannot be changed. However in the last line of the third query when I add #”Sorted Rows”[Column2]{0} together four times, #”Sorted Rows” is a variable that returns a table which means it gets evaluated four times during the addition.

In summary, if you’re getting a table or list from a data source and you want to use a value inside that table or list multiple times, store it in a variable in a let expression before you do so.

Power Query/M Optimisation: Getting The Maximum Value From A Column, Part 2

In part 1 of this series – which I strongly recommend you read before reading this post – I showed how removing columns from a table can make a dramatic improvement to the performance of certain transformations in Power Query. In this post I’ll show some tricks taught to me by Curt Hagenlocher of the dev team that can improve performance even more.

First of all, let me remind you of my original, totally unoptimised query which takes over a minute to run:

let
    Source = 
    Csv.Document(
        File.Contents(
            "C:\Users\chwebb\Downloads\pp-2018.csv"
            ),
            [Delimiter=",", Columns=16, 
            Encoding=1252, QuoteStyle=QuoteStyle.None]),
    #"Changed Type" = 
    Table.TransformColumnTypes(
        Source,
        {{"Column2", Int64.Type}}
        ),
    #"Sorted Rows" = 
    Table.Sort(
        #"Changed Type",
        {{"Column2", Order.Descending}}
        ),
    Column2 = 
    #"Sorted Rows"{0}[Column2]
in
    Column2

In particular, I’d like to draw your attention to the last step where the position and lookup operators (I blogged about them here) are used to return the value in the Column2 column of the first line of the sorted table:

#”Sorted Rows”{0}[Column2]

Breaking this expression down:

  • #”Sorted Rows” returns the sorted table from the previous step in the query
  • {0} returns the first line of that table as a record
  • [Column2] gives the value in the Column2 field from that record

It turns out that changing the order of {0} and [Column2] in this expression makes a big impact on performance. So using the following expression in the last step instead:

#”Sorted Rows”[Column2]{0}

…reduces query execution time to 5-7 seconds! I’m told this is because asking for the column first does the same thing internally as manually removing all other columns in the table which, as I showed in my last post, has a dramatic impact on performance.

It’s important to understand that this is not a general recommendation to request the column before the row in expressions like this, because in other scenarios requesting the row before the column might perform better. What you need to do is test writing expressions like this both ways to see what gives you the best performance.

One piece of general advice that Curt did give me, though, was that most of the optimisations that the Power Query engine can use only happen on tables – not records or lists – so you should always try to work with tables whenever you can. For this particular query, using the Table.FirstN function to get the first row of the table like so:

Table.FirstN(#”Sorted Rows”,1)[Column2]{0}

…allows for another internal optimisation to kick in, taking the query execution time down to around 2 seconds – the same performance as the original query in my previous post.

Adding More Aggregate Columns To The Output Of Table.Profile

A few years ago I blogged about the Table.Profile M function and how you could use it to create a table of descriptive statistics for your data:

https://blog.crossjoin.co.uk/2016/01/12/descriptive-statistics-in-power-bim-with-table-profile/

Since that post was written a new, optional second parameter has been added to the function called additionalAggregates which allows you to add your own custom columns containing aggregate values to the output of Table.Profile, so I thought I’d write a follow-up on how to use it.

Consider the following query:

let
    Source = 
    #table(
        type table[Month = text, Sales = number],
        {
            {"Jan",1},
            {"Feb",5},
            {"Mar",17},
            {"Apr",1}
        }
    ),
    Profile = 
    Table.Profile(
        Source
    )
in
    Profile

 

There are two steps here. Source returns the following table using the #table function:

image

The second step uses Table.Profile to return a table of summary statistics:

image

[Several columns returned aren’t shown in the screenshot above]

If you want some extra columns on this table you can now add them with the second additionalAggregates parameter of Table.Profile like so:

let
    Source = 
    #table(
        type table[Month = text, Sales = number],
        {
            {"Jan",1},
            {"Feb",5},
            {"Mar",17},
            {"Apr",1}
        }
    ),
    Profile = 
    Table.Profile(
        Source,
        {
        {"Median", each Type.Is(_, type number), List.Median},
        {"Mode", each Type.Is(_, type number), List.Mode}
        }
    )
in
    Profile

In this example I’ve added two extra columns called Median and Mode, which return the median and mode values of every numeric column:

image

The additionalAggregates parameter takes a list, where each item is itself a list of three items:

  • The name of the new column to add.
  • A function to call to determine where the aggregate column will be applied to a given column in the source table. In this case I’m checking to see if a column is type number and only returning a median or mode if it is.
  • A function to call to calculate the aggregate value across the column: in this case List.Median and List.Mode. It’s a good idea to use an aggregate function that does not require storing every value in the column in memory (ideally one that can be folded, if you’re using a data source where query folding is supported) here to ensure fast performance.

[Thanks to Curt Hagenlocher for the information here]

Detailed Article On The Power Query Formula Firewall And Data Privacy Settings

While writing my Microsoft Business Applications Summit session on Power Query-related reasons for Power BI dataset refresh failures (which you can watch online here for free, along with all the other sessions), I remembered something important that I have been meaning to blog about for a while. It’s an excellent article by Ehren von Lehe of the Power Query dev team that contains the first fully detailed explanation of how the Power Query Formula Firewall and data privacy settings actually work. Called “Behind the scenes of the Data Privacy Firewall” you can read it on the Power Query MSDN forum here:

https://social.technet.microsoft.com/Forums/en-US/ca434e2d-88fe-4962-b46a-a1db51e8bd89/feedback-wanted-behind-the-scenes-of-the-data-privacy-firewall?forum=powerquery

It’s a must-read for any Power Query fan, and it answers a number of the questions I raised in my series of blog posts on the data privacy settings here.

Removing HTML Tags From Text In Power Query/Power BI

Recently I found I needed to remove all the HTML tags from some text in Power Query. I searched and found a great – if complex – function from Marcel Beug here, but I realised that since that post was written the Html.Table M function has been added to the language and that it makes the task very easy. In fact it’s basically the same as the solution I blogged about here for extracting URLs from a web page.

Here’s an example of how to do it:

let
    Source = 
        "<!DOCTYPE html>
        <html>
        <body>
        <h1>A Heading</h1>
        <p>A paragraph</p>
        </body>
        </html>",
    RemoveTags = Html.Table(Source, {{"text",":root"}}),
    GetText = RemoveTags{0}
in
    GetText

Here’s the output of this query:

image

The important part is the call to Html.Table using the “:root” CSS selector:

Html.Table(Source, {{"text",":root"}})

This returns a table with one column and one row, containing the text; the GetText step in the query above just gets the contents of the only cell in that table.

Power Query Performance And Expanding Columns

As a footnote to my previous post on storing large images in Power BI, I thought all you M nerds out there would be interested to read about a strange performance quirk I noticed while writing the code for that post.

My original attempt to write an M query to convert a folder full of images to text looked something like this:

let
    Source = 
        Folder.Files("C:\Users\Chris\Documents\PQ Pics"),
    SplitText = 
        (ImageBinaryText as text) => 
        let
            SplitTextFunction = 
                Splitter.SplitTextByRepeatedLengths(30000),
            SplitUpText = 
                SplitTextFunction(ImageBinaryText)
        in
            SplitUpText,
    #"Added Custom" = 
        Table.AddColumn(
            Source, 
            "Pic1", 
            each SplitText(
                Binary.ToText(
                    [Content], 
                    BinaryEncoding.Base64
                    )
                )
            ),
    #"Expanded Pic" = 
        Table.ExpandListColumn(
            #"Added Custom", 
            "Pic1"
            )
in
    #"Expanded Pic"

The approach I took was the one that seemed natural to me at the time:

  1. Use the Folder data source to connect to the folder containing the image files
  2. Define a function called SplitText that takes a long piece of text and splits it up into a list of text values no longer than 30000 characters
  3. Call the function once per row on the table returned by step (1)
  4. Use the Expand/Aggregate button to expand the new column created by step (3) and get a table with one row for each of the split-up text values

When I ran this query, though, I caught sight of something that is every Power Query developer’s worst nightmare:

image

Power Query had read 118MB of data from a file that is only 1.6MB: the old problem of multiple file reads. Using Process Monitor (as I describe here) confirmed it. I suspect the query was reading the whole file once for each of the split sections of text the function returned although I admit I didn’t confirm this.

I can’t say I knew what I was doing but I rewrote the query from scratch and came up with the code that I gave in the blog post which only reads from each file once (without using buffering too, I should point out). What’s the difference? I guess it must be the pattern of calling the function once per row in a table and then expanding using Table.ExpandListColumn that was to blame; I tried returning a table from the function instead of a list and the same thing happened. Maybe this is something we should avoid doing? More research is necessary, and, as always, I would be interested to hear about your experiences – it is after all a fairly common pattern.

Storing Large Images In Power BI Datasets

Jason Thomas and Gerhard Brueckl have both blogged on the subject of storing images as text inside a Power BI dataset:

http://sqljason.com/2018/01/embedding-images-in-power-bi-using-base64.html

https://blog.gbrueckl.at/2018/01/storing-images-powerbi-analysis-services-data-models/

Since they wrote those posts, however, Power BI has added the ability to set the Data Category property on measures as well as columns in tables. This means it is now possible to have the output of a DAX measure displayed as an image in a Power BI report and this in turn opens up a lot of new possibilities – including the ability to work around the maximum size of a text value that can be loaded into Power BI (see my previous blog post for more details) and therefore work with larger images.

Here’s a rather lovely picture of a rose:

2014-08-18 17.02.10_20Pct

The original is about 2.1MB; I have a folder on my PC where different versions of this picture, saved at different percentages of the original size, are stored:

image

Using the technique that Gerhard wrote about, where the pictures can be stored as text in a single cell in a Power BI dataset and then displayed (in this case I’m using the Image by CloudScope custom visual) some truncation of the image occurs even with the smallest files because of the 32766 character limit on the length of a text value that can be loaded into Power BI. Here’s what you see when you display the version of the picture that is 20% of the original size, a file of only 113KB:

image

To work around this, what you need to do is to split the text representation of the image up into multiple smaller text values stored across multiple rows, each of which is less than the 32766 character limit, and then reassemble them in a DAX measure after the data has been loaded.

Splitting the text up in M is actually not that hard, but it is hard to do efficiently. Here’s an example of an M query that reads all the data from all of the files in the folder above and returns a table:

let
    //Get list of files in folder
    Source = Folder.Files("C:\Users\Chris\Documents\PQ Pics"),
    //Remove unnecessary columns
    RemoveOtherColumns = Table.SelectColumns(Source,{"Content", "Name"}),
    //Creates Splitter function
    SplitTextFunction = Splitter.SplitTextByRepeatedLengths(30000),
    //Converts table of files to list
    ListInput = Table.ToRows(RemoveOtherColumns),
    //Function to convert binary of photo to multiple
    //text values
    ConvertOneFile = (InputRow as list) =>
        let
            BinaryIn = InputRow{0},
            FileName = InputRow{1},
            BinaryText = Binary.ToText(BinaryIn, BinaryEncoding.Base64),
            SplitUpText = SplitTextFunction(BinaryText),
            AddFileName = List.Transform(SplitUpText, each {FileName,_})
        in
            AddFileName,
    //Loops over all photos and calls the above function
    ConvertAllFiles = List.Transform(ListInput, each ConvertOneFile(_)),
    //Combines lists together
    CombineLists = List.Combine(ConvertAllFiles),
    //Converts results to table
    ToTable = #table(type table[Name=text,Pic=text],CombineLists),
    //Adds index column to output table
    AddIndexColumn = Table.AddIndexColumn(ToTable, "Index", 0, 1)
in
    AddIndexColumn

In my next post I’ll show you my original version of this query, explain why it was slow, and try to explain how the version above works and why it is much faster.

Here’s what the query above returns:

image

The Pic column contains the split text values, each of which are less than the 32766 character limit, so when this table is loaded into Power BI no truncation occurs. The index column is necessary because without it we won’t be able to recombine all the split values in the correct order.

The only thing left to do is to create a measure that uses the DAX ConcatenateX() function to concatenate all of the pieces of text back into a single value, like so:

Display Image = 
IF(
    HASONEVALUE('PQ Pics'[Name]),
    "data:image/jpeg;base64, " & 
    CONCATENATEX(
        'PQ Pics', 
        'PQ Pics'[Pic],
        ,
        'PQ Pics'[Index],
        ASC)
        )

…set the data category of this measure to be “Image URL”:

image

…and then display the value of the image in a report:

image

image

Unfortunately, as I also mentioned in my previous post, most DAX functions (and that includes ConcatenateX()) have a limit of around 2.1 million characters so the original 2.1MB file still can’t be displayed, alas:

image

However, I do think this technique will be useful because it allows you to work with much larger pictures than before.

It can also be useful in other situations too. I recently came across a great new custom visual called PDF Viewer that can display PDF files stored in text form in a Power BI report:

image

The example file for this visual shows how a large PDF file can be split across two columns in a table; the technique I describe here is a more practical solution to this problem.

What Is The Maximum Length Of A Text Value In Power BI?

What is the maximum length of a text value in Power BI? It turns out that this is a more complex question than you might think!

The maximum length of a text value that the Power Query engine can load into a single cell in a table in a dataset is 32766 characters – any more than that and the text will be silently truncated. However, if you’re working with text inside the Power Query engine you’ll find that you can work with much longer text values.  To illustrate this, consider the following M query:

let
    Source = 
        #table(
            type table[charcount = number],
            {
                {1},
                {10000},
                {30000},
                {40000}
                }
                ),
    #"Added Custom" = 
        Table.AddColumn(
            Source, 
            "LongText", 
            each Text.Repeat("1", [charcount]),
            type text
            ),
    #"Inserted Text Length" = 
        Table.AddColumn(
            #"Added Custom", 
            "Length", 
            each Text.Length([LongText]), 
            Int64.Type
            )
in
    #"Inserted Text Length"

It creates a table with four rows and three columns. The first column contains the numbers 1, 10000, 30000 and 40000; the second column contains the character “1” repeated the number of times given in the first column; the third column returns the length of the text in the second column using the Text.Length() M function. Here’s the output in the Power Query Editor, which is pretty much as you’d expect:

image

I’m not sure if there is a maximum length for text values in M; I experimented with adding an extra row to the table above with a 900,000,000 character text value and Text.Length() was able to return the correct value, albeit after a bit of a wait.

Load the table above into your Power BI dataset though, and add a DAX calculated column with the following expression:

DAX Length = LEN('LengthsDemo'[LongText])

…and you can see in the Data pane of the main Power BI Desktop window that the long text value in the last row has been truncated to 32766 characters:

image

Once you’ve loaded your data into Power BI the documentation says that the maximum length of a text value is “268,435,456 Unicode characters (256 mega characters) or 536,870,912 bytes”. The bad news is that many DAX functions such as ConcatenateX() put a limit on the length of the text values that they can work with at around 2.1 million characters (thank you Jeffrey Wang for providing this information – it isn’t documented anywhere at the moment). If you exceed this limit you’ll get the following error:

Function ‘PLACEHOLDER’ encountered a Text that exceeds the maximum allowable length.

In summary, then, there are two different practical limits on the maximum length of a text value in Power BI: the 32766 character limit on text being loaded into Power BI, and the 2.1 million character limit in DAX functions. The first of these can be worked around with some clever M – you need to split long text values up into multiple smaller values stored in different columns or rows – but even if you do this, the second limit may stop you recreating the original value after the data has been loaded.

Why is this useful or important? How can you split text values up in M in the most efficient way? I’ll come to that in my next two posts!

Power Query Comes To Azure Data Factory With Wrangling Data Flows

One of the many big announcements at Build this week, and one that caused a lot of discussion on Twitter, was about Wrangling Data Flows in Azure Data Factory. You can read the blog post here:

https://azure.microsoft.com/en-us/blog/analytics-in-azure-remains-unmatched-with-new-innovations/

…but what isn’t clear from this is that it’s basically Power Query Online integrated into ADF. You can see it in action by watching the following video – the demo of Wrangling Data Flows starts at around the 21 minute mark:

https://mybuild.techcommunity.microsoft.com/sessions/76997

image

As the presenter says, the Power Query Online editor generates M in the background as you would expect and “we are going to take this M and translate it into Spark and run it over big data”. Query folding to Spark, basically. More technical detail about all this is available here:

https://github.com/gauravmalhot/wranglingdataflow

…including a document discussing which M functions currently support query folding and which ones as yet don’t. Obviously, this feature will only work well if as much query folding as possible takes place.

This feels like a much more significant win for team Power Query than the integration with SSIS that was announced recently, if only because SSIS is a bit legacy and ADF is the cool new thing. I wonder if this opens up the possibility of integration between Power BI dataflows and ADF in the future, as another example of how self-service BI solutions can be easily transitioned into centrally-managed, enterprise-grade BI solutions? If that happens I hope someone sorts out the dataflow/data flow naming mess.

You can sign up for the preview of Wrangling Data Flows here.

%d bloggers like this: