DAX Median() Function Does Not Work On Tables With More Than 2 Billion Rows

An interesting – if obscure – fact I learned recently is that a small number of DAX functions such as Median() do not work on tables with more than 2 billion rows in Analysis Services Tabular, Azure AS and Power BI.

It’s quite easy to reproduce in Power BI. The following M expression returns a table with two billion and four rows:

let
    Source = 
    List.Repeat(
        {1,2,3,4},
        500000001
        ),
    #"Converted to Table" = 
    Table.FromList(
        Source, 
        Splitter.SplitByNothing(), 
        null, 
        null, 
        ExtraValues.Error
        ),
    #"Changed Type" = 
    Table.TransformColumnTypes(
        #"Converted to Table",
        {{"Column1", Int64.Type}}
        )
in
    #"Changed Type"

It takes some time to load this table  – around twenty minutes – but because there are only four distinct values in the table the resulting .pbix file is only 31KB thanks to the way Power BI compresses data.

If you load this table into your dataset, call it VeryBigTable and create the following measure:

Median Test = MEDIAN(VeryBigTable[Column1])

…and use the measure in a visual, you’ll see the following error:

image

The current query cannot be evaluated for the ‘VeryBigTable (42)’ table, because the table contains more than two billion rows.

What’s more, the error will always occur even if you apply a filter to the table that returns less than two billion rows. The same problem occurs with some other functions, such as Percentile(), but it’s worth pointing out that the vast majority of DAX functions work as normal with tables with more than two billion rows – for example, in the pbix file used here the Sum() and CountRows() functions not only work fine but return instantly.

Luckily, in the case of the Median() function, there is an easy workaround because you can calculate a median in other ways such as the one described on the DAX Patterns site here. The code is a lot more verbose but it works on a 2 billion+ row table.

image

Storing Large Images In Power BI Datasets

Jason Thomas and Gerhard Brueckl have both blogged on the subject of storing images as text inside a Power BI dataset:

http://sqljason.com/2018/01/embedding-images-in-power-bi-using-base64.html

https://blog.gbrueckl.at/2018/01/storing-images-powerbi-analysis-services-data-models/

Since they wrote those posts, however, Power BI has added the ability to set the Data Category property on measures as well as columns in tables. This means it is now possible to have the output of a DAX measure displayed as an image in a Power BI report and this in turn opens up a lot of new possibilities – including the ability to work around the maximum size of a text value that can be loaded into Power BI (see my previous blog post for more details) and therefore work with larger images.

Here’s a rather lovely picture of a rose:

2014-08-18 17.02.10_20Pct

The original is about 2.1MB; I have a folder on my PC where different versions of this picture, saved at different percentages of the original size, are stored:

image

Using the technique that Gerhard wrote about, where the pictures can be stored as text in a single cell in a Power BI dataset and then displayed (in this case I’m using the Image by CloudScope custom visual) some truncation of the image occurs even with the smallest files because of the 32766 character limit on the length of a text value that can be loaded into Power BI. Here’s what you see when you display the version of the picture that is 20% of the original size, a file of only 113KB:

image

To work around this, what you need to do is to split the text representation of the image up into multiple smaller text values stored across multiple rows, each of which is less than the 32766 character limit, and then reassemble them in a DAX measure after the data has been loaded.

Splitting the text up in M is actually not that hard, but it is hard to do efficiently. Here’s an example of an M query that reads all the data from all of the files in the folder above and returns a table:

let
    //Get list of files in folder
    Source = Folder.Files("C:\Users\Chris\Documents\PQ Pics"),
    //Remove unnecessary columns
    RemoveOtherColumns = Table.SelectColumns(Source,{"Content", "Name"}),
    //Creates Splitter function
    SplitTextFunction = Splitter.SplitTextByRepeatedLengths(30000),
    //Converts table of files to list
    ListInput = Table.ToRows(RemoveOtherColumns),
    //Function to convert binary of photo to multiple
    //text values
    ConvertOneFile = (InputRow as list) =>
        let
            BinaryIn = InputRow{0},
            FileName = InputRow{1},
            BinaryText = Binary.ToText(BinaryIn, BinaryEncoding.Base64),
            SplitUpText = SplitTextFunction(BinaryText),
            AddFileName = List.Transform(SplitUpText, each {FileName,_})
        in
            AddFileName,
    //Loops over all photos and calls the above function
    ConvertAllFiles = List.Transform(ListInput, each ConvertOneFile(_)),
    //Combines lists together
    CombineLists = List.Combine(ConvertAllFiles),
    //Converts results to table
    ToTable = #table(type table[Name=text,Pic=text],CombineLists),
    //Adds index column to output table
    AddIndexColumn = Table.AddIndexColumn(ToTable, "Index", 0, 1)
in
    AddIndexColumn

In my next post I’ll show you my original version of this query, explain why it was slow, and try to explain how the version above works and why it is much faster.

Here’s what the query above returns:

image

The Pic column contains the split text values, each of which are less than the 32766 character limit, so when this table is loaded into Power BI no truncation occurs. The index column is necessary because without it we won’t be able to recombine all the split values in the correct order.

The only thing left to do is to create a measure that uses the DAX ConcatenateX() function to concatenate all of the pieces of text back into a single value, like so:

Display Image =
IF(
    HASONEVALUE('PQ Pics'[Name]),
    "data:image/jpeg;base64, " &
    CONCATENATEX(
        'PQ Pics',
        'PQ Pics'[Pic],
        ,
        'PQ Pics'[Index],
        ASC)
        )

…set the data category of this measure to be “Image URL”:

image

…and then display the value of the image in a report:

image

image

Unfortunately, as I also mentioned in my previous post, most DAX functions (and that includes ConcatenateX()) have a limit of around 2.1 million characters so the original 2.1MB file still can’t be displayed, alas:

image

However, I do think this technique will be useful because it allows you to work with much larger pictures than before.

It can also be useful in other situations too. I recently came across a great new custom visual called PDF Viewer that can display PDF files stored in text form in a Power BI report:

image

The example file for this visual shows how a large PDF file can be split across two columns in a table; the technique I describe here is a more practical solution to this problem.

What Is The Maximum Length Of A Text Value In Power BI?

What is the maximum length of a text value in Power BI? It turns out that this is a more complex question than you might think!

The maximum length of a text value that the Power Query engine can load into a single cell in a table in a dataset is 32766 characters – any more than that and the text will be silently truncated. However, if you’re working with text inside the Power Query engine you’ll find that you can work with much longer text values.  To illustrate this, consider the following M query:

let
    Source = 
        #table(
            type table[charcount = number],
            {
                {1},
                {10000},
                {30000},
                {40000}
                }
                ),
    #"Added Custom" = 
        Table.AddColumn(
            Source, 
            "LongText", 
            each Text.Repeat("1", [charcount]),
            type text
            ),
    #"Inserted Text Length" = 
        Table.AddColumn(
            #"Added Custom", 
            "Length", 
            each Text.Length([LongText]), 
            Int64.Type
            )
in
    #"Inserted Text Length"

It creates a table with four rows and three columns. The first column contains the numbers 1, 10000, 30000 and 40000; the second column contains the character “1” repeated the number of times given in the first column; the third column returns the length of the text in the second column using the Text.Length() M function. Here’s the output in the Power Query Editor, which is pretty much as you’d expect:

image

I’m not sure if there is a maximum length for text values in M; I experimented with adding an extra row to the table above with a 900,000,000 character text value and Text.Length() was able to return the correct value, albeit after a bit of a wait.

Load the table above into your Power BI dataset though, and add a DAX calculated column with the following expression:

DAX Length = LEN('LengthsDemo'[LongText])

…and you can see in the Data pane of the main Power BI Desktop window that the long text value in the last row has been truncated to 32766 characters:

image

Once you’ve loaded your data into Power BI the documentation says that the maximum length of a text value is “268,435,456 Unicode characters (256 mega characters) or 536,870,912 bytes”. The bad news is that many DAX functions such as ConcatenateX() put a limit on the length of the text values that they can work with at around 2.1 million characters (thank you Jeffrey Wang for providing this information – it isn’t documented anywhere at the moment). If you exceed this limit you’ll get the following error:

Function ‘PLACEHOLDER’ encountered a Text that exceeds the maximum allowable length.

In summary, then, there are two different practical limits on the maximum length of a text value in Power BI: the 32766 character limit on text being loaded into Power BI, and the 2.1 million character limit in DAX functions. The first of these can be worked around with some clever M – you need to split long text values up into multiple smaller values stored in different columns or rows – but even if you do this, the second limit may stop you recreating the original value after the data has been loaded.

Why is this useful or important? How can you split text values up in M in the most efficient way? I’ll come to that in my next two posts!

DAX Machine Learning Functionality Used By The Key Influencers Visual In Power BI

I’m one of those people who can’t resist peeking behind the scenes, and so when the Key Influencers visual appeared in Power BI I couldn’t help wondering how it worked its machine learning magic. Using DAX Studio to look at the DAX queries generated by the visual proved to be very revealing: it turns out that it uses a number of new DAX functions that are undocumented and probably not meant to be used outside Microsoft. For example, the following screenshot shows a DAX query generated by the Key Influencers visual that uses functions called AI.SampleStratified, AI.Train, AI.KeyDrivers and AI.ExtractProfileFilters:

image

Using Profiler (in a similar way to what I describe in this post) to go into even more detail about what happens when these queries run, shows that they raise the DAX Extension events that I’ve been wondering about for a long time now:

image

image

So Power BI can train and query machine learning models inside its own database engine – which, when you think about it, is pretty darned cool. And then I thought: hold on, other visuals have had machine learning features for a long time. For example, the Line Chart visual can create forecasts, but although DAX Studio shows yet another undocumented function called SampleAxisWithLocalMinMax() this does not actually seem to perform the forecasting, which I assume must be done inside the code of the visual itself:

image

My guess is that the functionality used by the Key Influencers visual is new functionality in the engine.

A fascinating insight into how Power BI works, but is this any practical use to us? Let me be clear: I don’t think you should be using any of these functions yourself in a real-world report. I’m sure all this would be documented and publicised if Microsoft did want us to use it ourselves! Another consideration is that these new functions return tables and that makes them awkward to use in regular .pbix Power BI reports – I guess we could create calculated tables although that’s not as flexible as returning a table from a query as shown above. That said, even though we can’t write our own DAX queries in regular Power BI reports, we can write our own DAX queries in Paginated Reports and we can now create Paginated Reports that use a Power BI dataset as a data source. I tested putting one of the queries generated by the Key Influencers visual into a Paginated Report connected to the same dataset and it worked ok (even after publishing). You can also embed DAX queries connected to a published dataset in Excel too, as I show here. Hmm, plenty to think about then…

SSAS Tabular 2019, Calculation Groups And Migration From SSAS Multidimensional

With the release of CTP 2.3 of SQL Server 2019 today there was big news for Analysis Services Tabular developers: Calculation Groups. You can read all about them in detail in this blog post:

https://blogs.msdn.microsoft.com/analysisservices/2019/03/01/whats-new-for-sql-server-2019-analysis-services-ctp-2-3/

In my opinion this is the most important new feature in DAX since… well, forever. It allows you to create a new type of calculation – which in most cases will be a time intelligence like a year-to-date or a previous period growth – that can be applied to multiple measures; basically the same thing that we have been doing in SSAS Multidimensional for years with the time utility/shell/date tool dimension technique. It’s certainly going to solve a lot of problems for a lot of SSAS Tabular implementations, many of which have hundreds or even thousands of measures for every combination of base measure and calculation type needed.

I’m not going to repeat any of the detailed technical information in the blog post here, though. Instead the point I want to make is that this is very big news for SSAS Multidimensional users too. In the past couple of years many people who have existing SSAS Multidimensional implementations have thought about migrating to SSAS Tabular so they can take advantage of its new features or move to the cloud, and indeed many of them have already migrated successfully. However, up to now, the biggest blocker for those wanting to migrate from Multidimensional to Tabular has been the fact that some complex calculations that can be expressed in MDX cannot be recreated (or recreated easily and efficiently) in DAX, because DAX has not had an equivalent of calculated members not on the Measures dimension or the MDX SCOPE statement.

Calculation groups do not remove this roadblock completely, but they do remove the roadblock for a large group of existing SSAS Multidimensional users whose only complex calculation requirement is a time utility/shell/date tool dimension. As a result these SSAS Multidimensional users will now be able to migrate to SSAS Tabular 2019, Azure Analysis Services or Power BI if they want to. Only those people who have more exotic uses for calculated members not on the Measures dimension (which are not very common at all) and those who use SCOPE statements (a larger group – many people working with financial data use SCOPE statements heavily) will find that Multidimensional is still the correct platform for them.

The Is Nullable Column Property In Power BI

The new Relationships view (or Modeling view – it seems to have two names) in Power BI Desktop that has been in preview since November 2018 not only makes it easier to work with complex models and set properties more easily, it also exposes a brand new property on a column: the “Is nullable” property. It’s visible at the bottom of the new Properties pane when you click on a column:

image

I was, naturally, curious about what it did and I couldn’t find any documentation so I did a bit of investigation of my own and asked a few people at Microsoft.

It turns out that it is primarily intended for validation purposes, so that if you know a column should never contain a null value and then, at a later date, a null value does appear in that column then you’ll get the following error when you try to refresh a table in Import mode:

image

Column ‘MyColumn’ in Table ‘TestTable’ contains blank values and this is not allowed for columns on the one side of a many-to-one relationship or for columns that are used as the primary key of a table

The only other use I have found for it is when it is used in combination with the DAX CombineValues() function in DirectQuery mode. In his excellent article on this subject, Marco notes that the SQL generated when you use CombineValues() to concatenate values from multiple columns and create a calculated column that can be used in a relationship includes a check for null values in those columns. I can confirm that if you turn off the ‘is nullable’ property on all the columns used by CombineValues(), there are no checks for null values in the SQL queries.

There may be a few other things that it does, but I’ve been told they are likely to be there to ensure correctness rather than performance optimisations. Nonetheless if you do find other scenarios where it has an effect please let me know in a comment.

[Thanks to Akshai Mirchandani and Jeffrey Wang for providing much of the information used in this post]

Complex Filter Conditions In Power BI Reports Using Visual-Level Filters And Measures

While Power BI has a lot of great functionality for filtering the data that appears in a visual using visual-level filters there are going to be situations where you need more control over how filtering works, and in this blog post I’m going to show you how you can use visual-level filters and measures in combination to achieve this. The example I show here may be quite specific but I think the general technique has a lot of other interesting applications.

Let’s say you have a table containing customer purchases, with potentially more than one purchase per customer, and want to display a table in Power BI showing only the most recent purchase for each customer relative to a given date.

Using data from the Adventure Works DW database, here’s the Internet Sales table from my Power BI example dataset that contains the customer purchases. Each purchase has a Sales Order Number and a Sales Order Line Number, and there may be multiple line items in a single sales order.

image

Here’s some data from this table, filtered down to show the purchases of one customer: Abby Subram.

image

[Note: all dates shown are in dd/mm/yyyy format]

The requirement is that the report user should be able to select a date in a slicer and show the details for the most recent sales order that occurs before the selected date. So, for example, if the user selects 1/1/2003 then no rows should appear; if 16/6/2003 is selected then only SO50934 should appear; if 30/9/2003 is selected only the four rows relating to SO54975 (and not SO50934) should appear; if 28/6/2004 is selected only the four rows relating to SO73938 should appear, and so on.

The first step is to create a create a date table that contains the ‘as of’ dates the user can select from, using the disconnected slicer technique: basically a date table that has no relationship at all to the Internet Sales table shown above.

image

Next comes the tricky part. Visual-level filters can be used to filter the contents of visuals in a Power BI report based on the value of a measure, and that measure does not need to be displayed in a report. What you therefore have to do is create a measure that will return the value 1 for data that should be displayed (in this case, the most recent purchase order for the customer) and blank otherwise, then apply a visual-level filter using this measure and filter on this measure equalling 1. Marco and Alberto have a great article going into the details of how measures behave when they are used in visual-level filters here that I recommend you read before carrying on.

For this particular scenario here’s my measure:

Is Latest SO =
VAR CurrentAsOfDate =
    SELECTEDVALUE ( 'As Of Date'[DateKey] )
VAR CurrentSODate =
    SELECTEDVALUE ( 'Internet Sales'[OrderDateKey] )
VAR CurrentCustomer =
    SELECTEDVALUE ( 'Internet Sales'[CustomerKey] )
VAR SameCustomerSOs =
    FILTER (
        ALL ( 'Internet Sales'[CustomerKey], 'Internet Sales'[OrderDateKey] ),
        'Internet Sales'[CustomerKey] = CurrentCustomer
&& 'Internet Sales'[OrderDateKey] <= CurrentAsOfDate
            && 'Internet Sales'[OrderDateKey] > CurrentSODate
    )
VAR NoLaterSOs =
    IF ( COUNTROWS ( SameCustomerSOs ) > 0, BLANK (), 1 )
RETURN
    IF ( CurrentSODate <= CurrentAsOfDate, NoLaterSOs )

The logic here is:

  • Assume that this measure will be used to filter a table visual in a Power BI report, where each row in the table visual displays data from a single row in the Internet Sales table in the dataset.
  • For each row in the Internet Sales table, find the current values for OrderDateKey and CustomerKey, as well as the date selected in the ‘as of date’ slicer, and store them in variables
  • Create a table of all possible combinations of CustomerKey and OrderDateKey ignoring the current filter context, then filter it so you get all the sales orders for the current customer and where the order date is greater than the date of the current sales order and less than or equal to the selected ‘as of’ date. This is stored in the SameCustomerSOs variable.
  • If there are rows in the SameCustomerSOs table then the current row in Internet Sales should not be displayed because it is not the most recent sales order as of the selected date; if SameCustomerSOs is empty then the current row should be displayed, because there are not later sales orders for the current customer before the selected as of date.
  • Finally, return 1 if SameCustomerSOs is blank and if the current row in Internet Sales has an order date before the selected as of date.

You can then create a table (or some other visual) in your Power BI report and, in the visual-level filters for that table, drag in the measure and apply a filter on the value of the measure equalling 1:

image

Here’s the result, showing the data for Abby Subram:

 

MostRecentPurchase

The one thing you do need to be careful of when using this technique is that the DAX you use in your measure must be as efficient as possible: inefficient DAX, complex filters and large data volumes will make your report unusably slow. I’m sure the DAX used in my measure above could be tuned to perform better (it currently takes around half a second to filter the full 60,000 rows in the Internet Sales table; Power BI only queries for and displays 500 rows at a time, which makes things faster) but the real point I wanted to make here is that using measures and visual-level filters in this way allows you to apply almost any complex filter condition you want to the data displayed in a visual.

You can download the sample pbix file for this post here.

%d bloggers like this: