Splitting Text By Character Transition In Power BI And Excel Power Query

The February 2019 release of Power BI Desktop includes some new functionality not mentioned in the summary blog post, functionality that is already in the latest Excel 365 builds (thanks Ed Hansberry for pointing this out) and will be extremely useful. It takes the form of four new options under the Split Column button in the Query Editor:

image

The four options allow you to split text by:

  • Lowercase to uppercase
  • Uppercase to lowercase
  • Digit to non-digit
  • Non-digit to digit

They are intended to solve problems like the one Gil Raviv blogged about here where he shows how to split camel case text in M.

Here’s an example of the lowercase to uppercase split. With the following input table:

image

The output is:

image

Here’s the M code behind:

let
    Source = 
    #table(
        {"MyData"},
        {
            {"CamelCase"},
            {"DimDate"}
            }
            ),
    #"Split Column by Character Transition" = 
    Table.SplitColumn(
        Source, 
        "MyData", 
        Splitter.SplitTextByCharacterTransition(
            {"a".."z"}, 
            {"A".."Z"}), 
            {"MyData.1", "MyData.2"}
            )
in
    #"Split Column by Character Transition"

It turns out that the new Splitter function that’s behind this, Splitter.SplitTextByCharacterTransition, has been in Power BI for at least a month already (if you want some background on Splitter functions, see my blog post here), but I don’t think anyone noticed. This new function is very flexible, and I can see it will be useful for a lot more scenarios than just these new four options in the Query Editor: it allows you to split text when there is a transition from any one character or characters to any other character or characters.

Finding Power Query Query Execution Times In Power BI And Excel Using SQL Server Profiler

Working out how long a Power Query M query takes to run – the first step to troubleshooting refresh performance problems in Power BI or Excel – is something I have blogged about several times (see here, here and here for example). However, there are problems with all of these methods such as the need to alter the M code of your query. Luckily there is another method that I have found, using a combination of DAX Studio and SQL Server Profiler.

How can you get hold of these tools? You can download DAX Studio for free here:

https://daxstudio.org/

Any serious Power BI developer should have it installed already, but if you have an older version you may need to update it to get the features I talk about in this post.

SQL Server Profiler is part of SQL Server Management Studio, and you can download it for free from here:

https://docs.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-2017

You can generate SQL Server Profiler trace files for diagnostic purposes from Power BI Desktop, as Kasper shows here, but these traces don’t show any information about refreshes as far as I can see. You can also hook Profiler up to Power BI Desktop direct, as Sam Lester shows here, but that’s a bit fiddly to do. Luckily DAX Studio now has a preview feature which makes connecting Profiler to Power BI Desktop much easier.

First of all, open up your report in Power BI Desktop, then open up DAX Studio and connect it to the instance of Power BI Desktop you just opened; if you want to use this technique with Excel Power Query queries then you will need to launch DAX Studio from inside Excel. Then, in DAX Studio, open the Options dialog and enable the “Show External Tools” preview feature:

image

Once you have done this, on the Advanced tab on DAX Studio’s ribbon you’ll see an option to launch SQL Profiler and connect it to the Power BI report or Excel file that DAX Studio is currently connected to:

image

Clicking the button opens up Profiler with a trace running. You’ll see that a lot of trace events are selected – probably too many to make sense of – so I suggest that you stop the trace, click the Properties button on the toolbar and then, on the Events Selection tab of the Trace Properties dialog select just the following four trace events:

image

  • Command Begin
  • Command End
  • Progress Report Begin
  • Progress Report End

Restart the trace and then refresh your dataset in Power BI Desktop. You’ll still see a lot of events generated in Profiler, but look for the following:

  • A Command Begin event followed by a lot of Progress Report Begin/End events, which is the command that marks the start of the refresh.
  • After that there will be a series of Progress Report Begin events that have an  EventSubclass of “25 – ExecuteSQL” – one for each of the tables that are being loaded into your dataset and therefore one for each of the Power Query M queries associated with those tables. Don’t be confused by the name of the EventSubclass, these events will appear whatever data source you are using: it looks like the Vertipaq engine requests data from the Power Query engine using a basic form of SQL. For example, here’s a screenshot showing the three events I get when refreshing a dataset with three tables in it called “Property Type”, “Date” and “Price Paid 2017”:
    image
  • Finally, below that there will be a series of Progress Report End events (one for each Progress Report Begin event) with the EventSubclass of “25 – ExecuteSQL”, and the Duration column here will tell you how long it took to get data from the Power Query M query associated with the table. For example, this screenshot shows that it took 5460ms (about 5.5 seconds) to get the data from the “Price Paid 2017” Power Query query:
    image

 

And there you have it, exact timings for each of the Power Query M queries associated with each of the tables in your dataset. Remember that the time taken by each Power Query M query will include the time taken by any other queries that it references, and it does not seem to be possible to find out the amount of time taken by any individual referenced query in Profiler.

There is a lot more interesting information that can be found in this way: for example, dataset refresh performance is not just related to the performance of the Power Query M queries that are used to load data; time is also needed to build all of the structures inside the dataset by the Vertipaq engine once the data has been returned, and Profiler gives you a lot of information on these operations too. Discussion of that will have to wait until a future blog post though…

ODBC Data Sources, The SqlCapabilities Option And Power Query/Power BI Data Refresh Performance

A few weeks ago I received a great tip from Simon Nuss about a Power Query feature relating to ODBC data sources that I hadn’t seen before (although it’s in Power BI Desktop and Excel 365, so it can’t be that new) and which is potentially important for the performance of data refresh and of the Power Query Editor’s preview pane. It’s only visible if you click “Advanced options” when creating your ODBC data source:

image

It’s the “Supported row reduction clauses” option:

image

What does it do? Well, first of all let’s see what happens if you don’t set it. Let’s say you have an ODBC data source that uses the Microsoft ODBC Driver for SQL Server to connect to the Adventure Works DW database. If you create a new query that uses this data source to get the contents of the DimDate table it works pretty much as you’d expect; here’s the M code:

let
    Source =
        Odbc.DataSource(
            "dsn=Test SQL",
            [HierarchicalNavigation=true]
        ),
    #"Adventure Works DW_Database" =
        Source{
            [Name="Adventure Works DW",Kind="Database"]
            }[Data],
    dbo_Schema =
        #"Adventure Works DW_Database"{
            [Name="dbo",Kind="Schema"]
            }[Data],
    DimDate_Table =
        dbo_Schema{[Name="DimDate",Kind="Table"]}[Data]
in
    DimDate_Table

image

However, if you look in Profiler when the contents of the Preview window is refreshed, you’ll see that the Power Query engine is running a query that gets the entire contents of the table:

image

Obviously, if you are connecting to a big table this could make the Power Query Editor slow. What’s more, as this blog post shows, the Power Query Editor only actually wants to display 1000 rows and the fact that the query is requesting all the rows from the table suggests that even basic query folding is not happening on this ODBC data source.

Now let’s see what happens if you do select something from the “Supported row reduction clauses” dropdown. As the screenshot above shows, there are four values in the dropdown that relate to the form of SQL supported by your ODBC source to only get the first N rows from a table. If you don’t know which one to choose you can click the Detect button and it will select one for you. In this case I selected TOP:

image

Having done this, when you connect to the same DimDate table, you’ll see that now the Power Query Editor is only requesting the top 1000 rows from the table:

image

Also, if you look at the M code for the query, the Odbc.DataSource function has the as-yet undocumented SqlCapabilities option set in its second parameter,

Odbc.DataSource(
 "dsn=Test SQL",
 [HierarchicalNavigation=true,
 SqlCapabilities=[LimitClauseKind=LimitClauseKind.Top]
 ]
)

Actually it’s not quite undocumented because it is mentioned here as something that is only available if you’re using Odbc.DataSource in a custom connector; obviously things have changed.

Setting this option can make a big difference to the responsiveness of the Power Query Editor when you’re working with large tables and/or slow data sources – I saw this myself this week when working with one of my customers. It’s also going to be important if you use the Keep Rows/Keep Top Rows button on the ribbon or the Table.FirstN() M function; I suspect there are other, more obscure, reasons why it might speed refresh up even if you aren’t filtering the rows of the table but I’m not 100% sure.

One last thing to mention is that there is another undocumented function which is what, I think, the Detect button in the screenshots above uses to find out how to set the SqlCapabilities option. It’s called Odbc.InferOptions and it can be used like so:

Odbc.InferOptions("dsn=Test SQL")

This function returns a record containing a field called LimitClauseKind that tells you the type of limit clause that is supported:

image

image

Here’s a list of the possible LimitClauseKind values I found using #shared:

image

Data Privacy Settings And Data Refresh Performance In Excel And Power BI

I have known for a long time that data privacy checks can have an impact on the performance of data refresh in Power BI and Excel, but on a recent performance tuning engagement I had the chance to observe just how much of a difference changing these settings can make. Unfortunately I can’t share the M code for the queries I was working with but the scenario is very common:

  • The only data source was a single Excel workbook with multiple worksheets
  • There were multiple Power Query/M queries reading data from these worksheets; the output of these queries was not being loaded into the Power BI dataset
  • There were further Power Query/M queries that referenced these source queries, none of which were being loaded into the Power BI dataset, and they all fed into a single query whose output was being loaded into a dataset
  • Overall data volumes were very small, less than 1MB, and the final query only returned approximately 5000 rows
  • Data refresh times were exceedingly slow, even with the “allow data preview to download in the background” setting turned off, and there was a long period where the Power Query engine was stuck in the “Evaluating” phase of the query during refresh in Power BI Desktop before the dialog showed the amount of data being loaded:

    image

  • Data refresh times also varied by +/-25% (I have a theory why this is happening – if I can confirm it I will blog about this) from one refresh to another

As a benchmark, here are some typical refresh timings for the same table in Power BI Desktop (December 2018 version) and Excel for Office 365 (build 11126.20118 click-to-run) with the default setting of data privacy checks turned on and the data privacy level set for the source Excel file set to Public:

Power BI Desktop Excel
Evaluating Phase (Seconds) 122 173
Data Load Phase (Seconds) 140 113
Total Refresh Time (Seconds) 262 286

Here are the same timings with the Power Query engine set to combine data according to each file’s data privacy settings, and the data privacy checks for the current file turned off:

image

image

Power BI Desktop Excel
Evaluating Phase (Seconds) 69 91
Data Load Phase (Seconds) 108 112
Total Refresh Time (Seconds) 177 203

Clearly there are some substantial query performance gains to be had by turning off data privacy checks and these gains are made in the Evaluating phase. In this case the queries in both Power BI Desktop and Excel were around 30% faster with data privacy checks turned off.

BUT is it safe to turn off data privacy checks? I strongly advise you to read my series on data privacy settings starting here, in particular this post on disabling checks, to understand the issues involved. You should not turn off data privacy checks without understanding what they do because you may end up sending sensitive data to somewhere it shouldn’t go and maybe even break the law. What’s more, if you have published a Power BI dataset you cannot disable data privacy checks for scheduled refresh unless you are using the gateway in Personal mode and have configured it with the “Fast Combine” option turned on, as shown here. Therefore turning off privacy checks is only really useful for Power Query users in Excel or to speed up development in Power BI Desktop.

In this particular case there is only one data source, an Excel workbook, a source which does not support query folding, so as far as I can see there is no risk in turning off data privacy checks because there is no way data from this Excel workbook can be sent anywhere else. Now why couldn’t the Power Query engine work this out?

Finding All Selected Items In An Excel Slicer Connected To SSAS, Power BI Or the Excel Data Model Using Dynamic Arrays

The big news in the world of Excel right now is the introduction of dynamic arrays. They’re only available in the Office 365 click-to-run version of Excel and, at the time of writing, only available to people on the Office Insiders programme, but eventually they’ll be available to anyone running Excel for Office 365 on their desktop. There are already lots of blog posts about them including this overview by Jon Acampora, and you probably also want to download Bill Jelen’s detailed mini-book on them here which is free for the rest of 2018. Now I’m not an Excel expert by any stretch of the imagination but I’m as excited as anyone else about them because they will be incredibly useful for anyone building reports using Excel cube functions against Analysis Services, the Excel Data Model/Power Pivot and Power BI. Bill Jelen’s book has a short section on this subject but the possibilities are limitless…

Here’s one example of how they can be used. A while ago I blogged about how to use a regular array formula and the TextJoin() Excel function to get all the selected items from a slicer. Dynamic arrays make this problem much easier to solve. Take the following table loaded into the Excel Data Model:

Capture1

Now, say you have a PivotTable built from this and a slicer (called Slicer_Fruit) connected to it:

Capture2

It’s possible to use the CubeSet() function to get the set of selected items in a slicer using the following formula:

=CUBESET("ThisWorkbookDataModel",Slicer_Fruit,"Slicer Set")

Assuming this formula is in cell H1, you can then get the number of items in this set using CubeSetCount():

=CUBESETCOUNT($H$1)

Assuming this is in cell H2, all you need to do to get a comma-delimited list of all the selected items in the slicer via this set is:

=
 TEXTJOIN(
  ", ", 
  TRUE, 
  CUBERANKEDMEMBER(
   "ThisWorkbookDataModel",
   $H$1, 
   SEQUENCE($H$2,1)
  )
 )

Capture3

Here it is in action:

demo

It works as follows:

  • The new Sequence() function is used to create a dynamic array of numbers from one to the number returned by the CubeSetCount() function.
  • The CubeRankedMember() gets the name of each of the items in the set using the numbers returned by Sequence()
  • TextJoin() then concatenates all of the names returned by CubeRankedMember() into a single comma-delimited list.

You can download the sample workbook here (remember it will only work if you have a version of Excel installed that includes dynamic arrays!).

Can You Use Power Query In Power BI Or Excel To Write Data To A Data Source?

The short answer: no.

The long answer:

It is, technically, possible to use Power Query (in either Power BI or Excel) to write data to a data source; for example, here’s an old post I wrote showing how to run a UPDATE statement to update data in a table in SQL Server. However because the Power Query engine may evaluate a value or statement more than once when a query is executed, it is therefore not safe or supported to use it to try to create, insert, update, delete or otherwise change data in a data source – it should only be used to read data. You may see some functions listed in the M function reference that suggest otherwise but they aren’t supported either, and indeed may not work at all.

Thanks to Matt Masson of the dev team for providing the official position on this.

Comparing The Performance Of CSV And Excel Data Sources In Power Query

My posts from two weeks ago (see here and here) on using Process Monitor to troubleshoot the performance of Power Query queries made me wonder about another question: how does the performance of reading data from CSV files compare to the performance of reading data from Excel files? I think most experienced Power Query users in either Power BI or Excel know that Excel data sources perform pretty badly but I had never done any proper tests. I’m not going to pretend that this post is a definitive answer to this question (and once again, I would be interested to hear your experiences) but hopefully it will be thought-provoking.

To start off, I took the 153.6MB CSV file used in my last few posts and built a simple query that applied a filter on one text column, then removed all but three columns. The query ran in 9 seconds and using the technique from my last post I was able to draw the following graph from a Process Monitor log file showing how Power Query reads data from the file:

image

Nothing very surprising there. Then I opened the same CSV file in Excel and saved the data as an xlsx file with one worksheet and no tables or named ranges; the resulting file was only 80.6MB. Finally I created a duplicate of the first query and pointed it to the Excel file instead. The resulting query ran in 59 seconds – around 6 times slower! Here’s a comparison with the performance of this query with the first query:

image

The black line in the graph above is the amount of data read (actually the offset values showing where in the file the data is read from, which is the same thing as a running total when Power Query is reading all the data) from the Excel file; the green line is the amount of data read from the CSV file (the same data shown in the first graph above). A few things to mention:

  • Running Process Monitor while this second query was refreshing had a noticeable impact on its performance – in fact it was almost 20 seconds slower
  • The initial values of 80 million bytes seem to be where data is read from the end of the Excel file. Maybe this is Power Query reading some file metadata? Anyway, it seems as though it takes 5 seconds before it starts to read the data needed by the query.
  • There’s a plateau between the 10 and 20 second mark where not much is happening; this didn’t happen consistently and may have been connected to the fact that Process Monitor was running

In any case you don’t need to study this too hard to understand that the performance of reading data from an xlsx format Excel file is terrible compared to the performance of reading data from a CSV. So, if you have a choice between these two formats, for example when downloading data, it seems fair to say that you should always choose CSV over xlsx.

%d bloggers like this: