Splitting Text By Character Transition In Power BI And Excel Power Query

The February 2019 release of Power BI Desktop includes some new functionality not mentioned in the summary blog post, functionality that is already in the latest Excel 365 builds (thanks Ed Hansberry for pointing this out) and will be extremely useful. It takes the form of four new options under the Split Column button in the Query Editor:

image

The four options allow you to split text by:

  • Lowercase to uppercase
  • Uppercase to lowercase
  • Digit to non-digit
  • Non-digit to digit

They are intended to solve problems like the one Gil Raviv blogged about here where he shows how to split camel case text in M.

Here’s an example of the lowercase to uppercase split. With the following input table:

image

The output is:

image

Here’s the M code behind:

let
    Source = 
    #table(
        {"MyData"},
        {
            {"CamelCase"},
            {"DimDate"}
            }
            ),
    #"Split Column by Character Transition" = 
    Table.SplitColumn(
        Source, 
        "MyData", 
        Splitter.SplitTextByCharacterTransition(
            {"a".."z"}, 
            {"A".."Z"}), 
            {"MyData.1", "MyData.2"}
            )
in
    #"Split Column by Character Transition"

It turns out that the new Splitter function that’s behind this, Splitter.SplitTextByCharacterTransition, has been in Power BI for at least a month already (if you want some background on Splitter functions, see my blog post here), but I don’t think anyone noticed. This new function is very flexible, and I can see it will be useful for a lot more scenarios than just these new four options in the Query Editor: it allows you to split text when there is a transition from any one character or characters to any other character or characters.

Finding Power Query Query Execution Times In Power BI And Excel Using SQL Server Profiler

Working out how long a Power Query M query takes to run – the first step to troubleshooting refresh performance problems in Power BI or Excel – is something I have blogged about several times (see here, here and here for example). However, there are problems with all of these methods such as the need to alter the M code of your query. Luckily there is another method that I have found, using a combination of DAX Studio and SQL Server Profiler.

How can you get hold of these tools? You can download DAX Studio for free here:

https://daxstudio.org/

Any serious Power BI developer should have it installed already, but if you have an older version you may need to update it to get the features I talk about in this post.

SQL Server Profiler is part of SQL Server Management Studio, and you can download it for free from here:

https://docs.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-2017

You can generate SQL Server Profiler trace files for diagnostic purposes from Power BI Desktop, as Kasper shows here, but these traces don’t show any information about refreshes as far as I can see. You can also hook Profiler up to Power BI Desktop direct, as Sam Lester shows here, but that’s a bit fiddly to do. Luckily DAX Studio now has a preview feature which makes connecting Profiler to Power BI Desktop much easier.

First of all, open up your report in Power BI Desktop, then open up DAX Studio and connect it to the instance of Power BI Desktop you just opened; if you want to use this technique with Excel Power Query queries then you will need to launch DAX Studio from inside Excel. Then, in DAX Studio, open the Options dialog and enable the “Show External Tools” preview feature:

image

Once you have done this, on the Advanced tab on DAX Studio’s ribbon you’ll see an option to launch SQL Profiler and connect it to the Power BI report or Excel file that DAX Studio is currently connected to:

image

Clicking the button opens up Profiler with a trace running. You’ll see that a lot of trace events are selected – probably too many to make sense of – so I suggest that you stop the trace, click the Properties button on the toolbar and then, on the Events Selection tab of the Trace Properties dialog select just the following four trace events:

image

  • Command Begin
  • Command End
  • Progress Report Begin
  • Progress Report End

Restart the trace and then refresh your dataset in Power BI Desktop. You’ll still see a lot of events generated in Profiler, but look for the following:

  • A Command Begin event followed by a lot of Progress Report Begin/End events, which is the command that marks the start of the refresh.
  • After that there will be a series of Progress Report Begin events that have an  EventSubclass of “25 – ExecuteSQL” – one for each of the tables that are being loaded into your dataset and therefore one for each of the Power Query M queries associated with those tables. Don’t be confused by the name of the EventSubclass, these events will appear whatever data source you are using: it looks like the Vertipaq engine requests data from the Power Query engine using a basic form of SQL. For example, here’s a screenshot showing the three events I get when refreshing a dataset with three tables in it called “Property Type”, “Date” and “Price Paid 2017”:
    image
  • Finally, below that there will be a series of Progress Report End events (one for each Progress Report Begin event) with the EventSubclass of “25 – ExecuteSQL”, and the Duration column here will tell you how long it took to get data from the Power Query M query associated with the table. For example, this screenshot shows that it took 5460ms (about 5.5 seconds) to get the data from the “Price Paid 2017” Power Query query:
    image

 

And there you have it, exact timings for each of the Power Query M queries associated with each of the tables in your dataset. Remember that the time taken by each Power Query M query will include the time taken by any other queries that it references, and it does not seem to be possible to find out the amount of time taken by any individual referenced query in Profiler.

There is a lot more interesting information that can be found in this way: for example, dataset refresh performance is not just related to the performance of the Power Query M queries that are used to load data; time is also needed to build all of the structures inside the dataset by the Vertipaq engine once the data has been returned, and Profiler gives you a lot of information on these operations too. Discussion of that will have to wait until a future blog post though…

Excel.Workbook() And The delayTypes Option In Power Query/Power BI

A while ago I found myself wondering – as I often do about this kind of thing – about what the undocumented third parameter of the Excel.Workbook() M function (called delayTypes) actually does. I found a forums post from Guy Hunkin of the Excel Power Query team in 2017 here, which gives the following summary:

Originally, Excel.Workbook used to read the entire workbook data to accurately assign types to the columns. This was EXTREMELY slow on large workbooks.

Having this in mind, we added this flag to delay the behavior. When set to “true”, we don’t infer any column types. Instead, the UI uses its normal inference algorithm to generate a Table.TransformColumnTypes step with the inferred types and if it inferred wrongly, the user can update the transformation explicitly.

I also posted a question on the Power Query MSDN forum and got some more details from Colin Banfield, one of the many M experts who hang out there, as well as doing some investigations of my own. This post summarises what I found out.

First of all, a simple demonstration of what it does. Consider the following Excel table:

image

Connecting to this table in the December 2018 release of Power BI Desktop generates the following M query:

let
    Source = 
    Excel.Workbook(
        File.Contents(
            "C:\DelayTypesTest.xlsx"
            ), 
            null, 
            true
            ),
    SourceData_Table = 
    Source{[Item="SourceData",Kind="Table"]}[Data],
    #"Changed Type" = 
    Table.TransformColumnTypes(
        SourceData_Table,
        {
            {"TextColumn", type text}, 
            {"IntegerColumn", Int64.Type}, 
            {"DecimalColumn", type number}, 
            {"DateColumn", type date}
            }
            )
in
    #"Changed Type"

Notice that the delayTypes option is now automatically set to true and that there is a “Changed Type” step. The output of this query, as seen in the Power Query Editor, is what you would expect:

image

Compare this with the output of the following query which has delayTypes set to false and has no “Changed Type” step:

let
    Source = 
    Excel.Workbook(
        File.Contents(
            "C:\DelayTypesTest.xlsx"
            ), 
            null, 
            false
            ),
    SourceData_Table = 
    Source{[Item="SourceData",Kind="Table"]}[Data]
in
    SourceData_Table

image

There are two things to notice:

  • With delayTypes set to false no “Changed Type” step is necessary, the data types on the columns are set by Excel.Workbook(). With delayTypes set to true all the columns returned by Excel.Workbook() have their data type set to Any and a subsequent “Changed Type” step is necessary to set data types to something useful.
  • With delayTypes set to true the IntegerColumn and the DecimalColumn columns have different data types assigned to them by the “ChangedType” step: IntegerColumn becomes Int64.Type whereas DecimalColumn becomes type number. With delayTypes set to false both columns are set to number. So it looks like Excel.Workbook with delayTypes set to false will only convert columns to primitive data types.

The second question is: if it is, primarily, a performance optimisation how much of an impact does it have? I built four queries that read all the data from an 80MB xlsx file to test the performance of all of the variations of delayTypes true/false and having a “Changed Type” step present or not (just in case there was an overhead to having a “Changed Type” step). Here are the results – the main timings are in minutes and seconds, and in brackets is the length of the “Evaluating” phase of query execution:

  delayTypes=false delayTypes=true
no “Changed Type” step 2:08
(1:21)
1:06
(0:00)
“Changed Type” present 2:19
(1:30)
1:08
(0:00)

I wouldn’t read too much into these exact timings because in my experience timings for the same query can fluctuate quite a lot, but one thing is clear: setting delayTypes to true results in much better performance. Indeed with delayTypes set to false nearly half the time was spent in the “Evaluating” phase whereas with delayTypes set to true there was no “Evaluating” phase at all. As a result I can’t think of a reason not to set the delayTypes option to true.

It’s worth noting that delayTypes is set to false by default if you don’t specify this option, so be aware of this if you are writing your own M code.

Power BI Roadmap Announcements In The April 2019 Release Notes

The latest version of the “release notes” document that details the roadmap for Dynamics 365 and the whole Power Platform – which includes Power BI – has just been released. You can view it online here:

https://docs.microsoft.com/en-us/business-applications-release-notes/april19/

There are lots of very exciting announcements for Power BI. Here are a few of the features that are promised for later this year that caught my eye, that haven’t already been announced elsewhere (although some of them were discussed in public sessions at the last PASS Summit):

  • Expression-based formatting in Power BI reports – the ability to use a DAX expression to control properties such as colour and line style within a visual. One of those features that advanced Power BI users will love, and which will allow all kinds of interesting customisations.
  • Performance analyzer for reports – a tool in Power BI Desktop to tell you how long each visual on a page takes to refresh, making it easier to understand where your performance bottlenecks are. I wish we could have something similar for Power Query queries…
  • Customised navigation through apps
  • Shared and certified datasets – this is going to be incredibly useful! As the docs say “With shared datasets in Power BI, we are allowing a single dataset to be used by multiple reports, across workspaces”. I can’t wait. Lots of other cool, related features too here.
  • Deeper integration of paginated reports in Power BI – including the ability to create paginated reports on top of Power BI datasets.
  • Large-scale compute engine for dataflows – this is somewhat cryptic. The docs say “We are enhancing the dataflow calculation engine in Power BI Premium to enable high-performing large-scale data transformations (TBs of data).”. Are these enhancements to the Power Query engine, or is it being replaced with something else entirely (eg to make it more like Azure Data Factory?). I don’t know.
  • New dataflow viewer to visualize dataflow dependencies – I guess something similar to what we have in the “Query Dependencies” view in the Power Query Editor?
  • More new Q&A features: auto-suggestions and did you mean. I’m glad Microsoft is continuing to invest in Q&A, I think it’s going to be one of those features that suddenly goes from being a demo feature that few people actually use to something that will be very powerful indeed. In fact, I would argue that it’s already a lot more powerful than most people realise.
  • Python support in the Power BI service
  • Power BI Template Apps – the ability to create multiple apps in different workspaces from a single template. This was something that used to be possible with content packs (remember them?) so I guess anyone still using them should be able to finally migrate away. Useful for end-user customisation of apps.
  • Better integration between Flow and Power BI, for example integrated alert creation, and the ability to query a dataset and trigger refresh of a dataset from Flow. This last feature will be really important – there are lots of scenarios where you only want to refresh a dataset when you know the data is ready.

The Is Nullable Column Property In Power BI

The new Relationships view (or Modeling view – it seems to have two names) in Power BI Desktop that has been in preview since November 2018 not only makes it easier to work with complex models and set properties more easily, it also exposes a brand new property on a column: the “Is nullable” property. It’s visible at the bottom of the new Properties pane when you click on a column:

image

I was, naturally, curious about what it did and I couldn’t find any documentation so I did a bit of investigation of my own and asked a few people at Microsoft.

It turns out that it is primarily intended for validation purposes, so that if you know a column should never contain a null value and then, at a later date, a null value does appear in that column then you’ll get the following error when you try to refresh a table in Import mode:

image

Column ‘MyColumn’ in Table ‘TestTable’ contains blank values and this is not allowed for columns on the one side of a many-to-one relationship or for columns that are used as the primary key of a table

The only other use I have found for it is when it is used in combination with the DAX CombineValues() function in DirectQuery mode. In his excellent article on this subject, Marco notes that the SQL generated when you use CombineValues() to concatenate values from multiple columns and create a calculated column that can be used in a relationship includes a check for null values in those columns. I can confirm that if you turn off the ‘is nullable’ property on all the columns used by CombineValues(), there are no checks for null values in the SQL queries.

There may be a few other things that it does, but I’ve been told they are likely to be there to ensure correctness rather than performance optimisations. Nonetheless if you do find other scenarios where it has an effect please let me know in a comment.

[Thanks to Akshai Mirchandani and Jeffrey Wang for providing much of the information used in this post]

Power BI Data Privacy Levels And Cloud /Web-Based Data Sources Or Dataflows

I have already blogged in great detail many times about Power BI/Power Query data privacy settings (see this series for example) but there’s always something new to learn. Recently I was asked a question by Ian Eckert about how Power BI handles data privacy for cloud or web-based data sources after a dataset has been published, and it prompted yet more revelations…

Consider the following M query:

let
    Source = Xml.Tables(
        Web.Contents("https://blog.crossjoin.co.uk/feed/")
        ),
    channel = Source{0}[channel],
    language = channel{0}[language],
    out = Json.Document(
        Web.Contents(
            "https://data.gov.uk/api", 
            [
                RelativePath="3/action/package_search", 
                Query=[q=language]
            ]
            )
            ),
    result = out[result],
    results = result[results],
    #"Converted to Table" = Table.FromList(
        results, 
        Splitter.SplitByNothing(), 
        null, 
        null, 
        ExtraValues.Error
        ),
    Column1 = #"Converted to Table"{0}[Column1],
    #"Converted to Table1" = Record.ToTable(Column1)
in
    #"Converted to Table1"

It doesn’t do anything particularly interesting, but it does take data from one web-based data source (the RSS feed for this blog) and sends it to another (the UK government’s open data metadata search web service). As a result, in Power BI Desktop, if you set the data privacy settings for both data sources to Public then the query runs, but if you set the data privacy settings for both data sources to Private:

image

image

…As expected, you get the following error:

image

Formula.Firewall: Query ‘Test’ (step ‘Converted to Table1’) is accessing data sources that have privacy levels which cannot be used together. Please rebuild this data combination.

Now the strange thing is that, when you publish the dataset that contains this M query, refresh always works. Why? What’s more, other datasets that do something similar will always fail when refreshed.

It turns out that when you publish a dataset that uses cloud or web-based data sources like the two used here, the Power BI service does not use the data privacy settings you have set in Power BI Desktop but instead it automatically assigns data privacy levels as follows:

  • Data sources, like the ones used here, that use Anonymous authentication are automatically given the privacy level Public
  • All other data sources are given the privacy level Private.

Interestingly, Power BI dataflows also count as cloud-based data sources and because they do not use Anonymous authentication they default to Private too, so if you are combining data from a dataflow with another data source in your dataset then you need to be careful of this.

What’s more there is at the time of writing no way to change these data privacy levels in the Power BI web-based portal. Hopefully this will change soon.

There are some workarounds though!

First of all, you can force refresh to take place through a gateway. This might sound strange because in theory, if you’re only using cloud or web-based data sources, a gateway should not be necessary. However there are already similar scenarios where a gateway is needed, for example if you are scraping data from a web page you need to use a gateway, and if you are combining data from a cloud-based data source with an on-premises data source you also need to use a gateway. If you add your cloud/web-based data sources as data sources in your gateway (unfortunately it does not seem to be possible to add a dataflow as a data source in a gateway, though) you can set their data privacy levels in the Advanced Settings section in the Manage Gateways screen:

image

You will also need to set the “Use a data gateway” option to On in the Settings dialog for your dataset after it has been published:

image

The other workaround is to copy the M code for your query and paste it into a new blank M query in an entity in a dataflow, as Matthew Roche shows here. While it does not seem to be possible to set data privacy levels for individual data sources when creating an entity, it is possible to turn off data privacy checks for an entity completely. If you create a query that sends data from one data source to another (regardless, as far as I can see, of the authentication mechanism used), you will see the following message in the Power Query Online query editor:

image

The evaluation was cancelled because combining data from multiple sources may reveal data from one source to another. Click Continue if the possibility of revealing data is okay.

If you click Continue, data privacy checks are turned off and the query runs; you can also click the Options button on the ribbon and check the “Allow combining data from multiple sources” option:

image

If one of your data sources is already itself a dataflow you may need to do some editing of the M query to make things work, but as Matthew Roche shows here it is possible to have an entity in a dataflow refer to another entity without using a computed entity (which is a Premium-only feature).

[Thanks to Arthi Ramasubramanian Iyer from Microsoft for providing background information for this post]

Power BI/Power Query/M Pre-cons At SQLKonferenz, SQLBits and SQL Saturday Iceland

If you’re learning Power BI and you want to deepen your understanding of its data transformation and loading functionality, the Power Query engine and the M language, I’m doing one-day pre-conference seminars on this subject at three conferences over the next few months:

You’ll learn about all the functionality in the Power Query Editor window plus concepts such as merging, appending, parameters, functions, data privacy and M. I hope to see you at one of them!

%d bloggers like this: