Working With CSV Files That Contain Rogue Line Breaks In Power Query And Power BI

CSV files often contain text fields, and text fields often contain line breaks that shouldn’t be there (especially if the text is entered by end users). For example, here is are the contents of a simple CSV file that should contain three fields(Product, Comment and Sales) and six rows:

Product,Comment,Sales
Apples,This is a commment,10
Oranges,Another comment,20
Pears,A comment with a
line break in,30
Grapes,A normal comment,40
Bananas,A comment
with
two line breaks,50
Pineapples,A normal comment,60

However, as you can see, the Comment field contains line breaks in some cases. If you try to load this file into Power Query/Power BI, you’ll see that it doesn’t handle things well:

image

The easiest way of solving this problem is to make sure you have double quotes around the text in the Comment field, so that Power Query/Power BI knows to ignore the line breaks:

Product,Comment,Sales
Apples,"This is a commment",10
Oranges,"Another comment",20
Pears,"A comment with a
line break in",30
Grapes,"A normal comment",40
Bananas,"A comment
with
two line breaks",50
Pineapples,"A normal comment",60

image

[By the way, no text in the screenshot above has been truncated. The line breaks are still there in the Comment field; you can see the full text in the bottom pane of the Query Editor by clicking on a cell in the table.]

But what if you can’t fix the source data? I was asked this question the other week, and since I had been asked about it before and not come up with a good answer, I decided to spend some time researching the problem.

What I found was that it was relatively easy to write some M code that gave me the correct results, but very hard to write code that performed acceptably well on a large data set (I was testing on a CSV file containing almost half a million rows). Here’s the code for the function I ended up with:

(FileName as text, DelimiterCharacter as text, DelimitersPerLine as number, UseFirstRowAsHeaders as logical) =>
let
    // Load file and buffer in memory
    BufferedBinary = Binary.Buffer(File.Contents(FileName)),
    // Convert to a single block of text
    Source = Text.FromBinary(BufferedBinary),
    // Find the position of all line breaks
    LineBreaks = List.Buffer(Text.PositionOf(Source, "#(cr,lf)", Occurrence.All)),
    // Get a list of numbers from 0 to the number of line breaks-1
    LinePositions = List.Positions(LineBreaks),
    // Count the number of line breaks
    TotalLines = List.Count(LineBreaks),
    // Split the text by line breaks
    FirstSplit = Text.Split(Source, "#(cr,lf)"),
    // On each line, count the number of column delimiter characters
    DelimiterCounts = List.Buffer(
                           List.Transform(
                            LinePositions, 
                            each List.Count(Text.PositionOf(FirstSplit{_}, 
                            DelimiterCharacter, 
                            Occurrence.All)))),
    // Loop through each line and
    // a) Calculate a running total of the number of column delimiter characters on each line
    // b) Divide the running total by the expected number of delimiters per line, and round down to the nearest integer
    RunningSums = List.Buffer(
                           List.Generate(
                            ()=>[Counter=0, RunningTotal=0], 
                            each [Counter]<TotalLines, 
                            each [Counter=[Counter]+1, RunningTotal=[RunningTotal]+DelimiterCounts{[Counter]}],
                            each Number.IntegerDivide([RunningTotal]+DelimiterCounts{[Counter]},DelimitersPerLine))),
    // If the result of the division calculated in the previous step is
    // the same as on the previous line, then the line break can be ignored
    // so return null, otherwise return the position of the line break
    FindFullLines = List.Transform(LinePositions, each if _=0 or RunningSums{_}<>RunningSums{_-1} then LineBreaks{_} else null),
    // Remove nulls from the list
    RemoveNulls = List.RemoveNulls(FindFullLines),
    // Split the original text by the positions of the line breaks left
    LineSplitterFunction = Splitter.SplitTextByPositions(List.Combine({{0},RemoveNulls})),
    ActualLines = LineSplitterFunction(Source),
    // Convert the result to a table
    ConvertToTable = Table.FromList(ActualLines, Splitter.SplitTextByDelimiter(DelimiterCharacter), null, null, ExtraValues.Ignore),
    // Use the first rows as headers if desired
    PromoteHeaders = if UseFirstRowAsHeaders then Table.PromoteHeaders(ConvertToTable) else ConvertToTable
in
    PromoteHeaders

Assuming that you called this function LoadFile(), here’s an example of how to call it:

LoadFile(
//File name
"C:\MyFile.csv", 
//Column delimiter character - in this case a comma
",", 
//The expected number of commas to be found per line
//ie one less than the number of columns
2, 
//If true then first row contains header values
true)

The comments in the code should give you an idea of how it all works. A few other things to note:

  • If any of your lines contain extra columns, this approach won’t work
  • I’ve noticed a difference between how the Power BI Desktop and Excel 2016 behave with this code: the current Excel 2016 Query Editor will keep any line breaks after splitting on them, whereas Power BI Desktop does not. It’s easy to fix the Excel output though: just use the Clean transform to remove the line breaks.

I’m not really sure why it was so hard to come up with code that performed well, even after liberal use of List.Buffer(). It seemed like any time I referenced the original text (as returned by the Source step) in the List.Generate() function then performance died – so you’ll notice in the code above that I’m only referencing the list that contains the number of delimiters found per line.

Disabling Excel PivotTable Grouping And Session Cubes In SSAS Multidimensional

The Excel PivotTable grouping functionality that is available when you are connected to an SSAS Multidimensional cube (but not a Tabular model) is a Very Bad Thing indeed. In my experience it is a major cause of query performance problems – not just for the person running the query, but because it is so resource intensive for everyone else trying to query the cube too. This post from a few years ago gives some more details about why custom grouping, and the CREATE SESSION CUBE MDX statements that it uses behind the scenes, are so expensive.

Up until now it was impossible to prevent users from using custom grouping but in a recent cumulative update this changed. First of all, I encourage you to read the details of the fix because you will probably want to install the relevant CU for security reasons anyway:

https://support.microsoft.com/en-us/kb/3080856

[And if you are wondering whether you should be installing CUs I strongly suggest you read this post by Aaron Bertrand]

Once the CU has been installed the following new property is added to the msmdrv.ini file:

\OLAP\Query\SessionCubesMode

This will be set to 1. Setting this property to 0 will prevent users from creating session cubes and therefore prevent them from using Excel’s custom grouping functionality.

My opinion is that it’s a good idea to disable session cubes and custom grouping even if you don’t know whether your users are using these features. Yes, your users will lose some functionality and some reports might even break, but you will also save yourself and your users a lot of problems. If your users need to do custom grouping in a report that is usually an indication that you have missed something in your dimension design, and that an extra attribute hierarchy or two is necessary.

[Thanks to Akshai for this information]

New Ways To Concatenate Text In Excel 2016 With CONCAT() And TEXTJOIN()

I don’t normally blog about Excel topics outside of Power Query and Power Pivot, but I think anyone who has ever done any serious work in Excel has found how difficult it is to concatenate more than two text values (although there are plenty of blog posts like this one showing how to do it). Well, not any more – Microsoft have finally got around to addressing this problem with two new Excel functions, TEXTJOIN() and CONCAT(). The announcement, plus news of other new Excel functions and a new funnel chart type, is here:

https://support.office.com/en-us/article/What-s-new-in-Excel-2016-for-Windows-5fdb9208-ff33-45b6-9e08-1f5cdb3a6c73

First, the bad news: at the time of writing you’ll only be able to use these functions in Excel on the desktop if you’re using the click-to-run version of Excel 2016 that you get through an Office 365 subscription (ie the version installed through the Office 365 portal that gets updated by Microsoft automatically – probably not the same version that you’re running on your desktop if you work for a big company). It is available in Excel Online and Excel Mobile too. I guess they’ll appear in the regular, Windows-installer version of Excel 2016 in a service pack at some point in the future.

How about some examples? First of all, CONCAT() is a successor function for the old CONCATENATE() function – as far as I can see it does everything CONCATENATE() does but crucially also supports ranges as inputs as well as individual text values. So

=CONCAT("The ", "cat ", "sat ", "on ", "the ", "mat.")

Returns

The cat sat on the mat.

image

…but also, if you have each of your words in different cells (nb I’ve added spaces at the end of each word here), you can concatenate all the values in a range like this:

=CONCAT(B2:B7)

image

The TEXTJOIN() function is more flexible and I suspect will be very popular. It gives you two benefits over CONCAT():

  1. The ability to specify a delimiter – a character or characters (for example a comma or a space) to insert between each item of text you want to concatenate
  2. The ability to ignore empty values

Some examples…

First, using a space in the first parameter:

=TEXTJOIN(" ",FALSE,"The","cat","sat","on","the","mat.")

Returns once again

The cat sat on the mat.

…without having to add spaces to the end of each word, as I did with the CONCAT() example above:

image

And

=TEXTJOIN(",",FALSE,"The","cat","sat","on","the","mat.")

Returns

The,cat,sat,on,the,mat.

image

Finally, here’s an example of how TEXTJOIN() can be used with a range with the second parameter being set to TRUE to ignore empty cells in a range:

=TEXTJOIN(",", TRUE,B2:B9)

With the data in the screenshot below again returns

The,cat,sat,on,the,mat.

image

You can see an Excel Online worksheet with all these examples here.

Power BI Desktop, Sort By Column And DAX Calculations That Use The All() Function

Recently I came across a problem where a DAX measure gave different results in Excel and Power BI Desktop. It turned out not to be a bug or even a feature, but since it confused me for a few minutes I thought it was worth writing up in case anyone else ran into it.

Consider a model built in Excel using Power Pivot from the following two tables:

image

image

With two measures defined in the model as follows:

Sum of Sales:=SUM(Sales[Sales])

Share:=DIVIDE([Sum of Sales], CALCULATE([Sum of Sales], ALL(Month[Month Name])))

…and, importantly, the Sort By Column property on the Month Name column set to Month Number:

image

…it’s possible to build a PivotTable that looks like this:

image

However, when you import the same model into Power BI Desktop and recreate the PivotTable above in the Report view you’ll see that the Share calculation no longer gives the same values:

image

What’s the problem here? It’s all down to the way Power BI Desktop generates DAX queries when you set the Sort By Column property. The Excel PivotTable above generates the following MDX:

SELECT 
{[Measures].[Sum of Sales],[Measures].[Share]} 
DIMENSION PROPERTIES PARENT_UNIQUE_NAME,MEMBER_VALUE,
HIERARCHY_UNIQUE_NAME ON COLUMNS , 
NON EMPTY 
Hierarchize(
{DrilldownLevel({[Month].[Month Name].[All]},,,INCLUDE_CALC_MEMBERS)}) 
DIMENSION PROPERTIES PARENT_UNIQUE_NAME,MEMBER_VALUE,
HIERARCHY_UNIQUE_NAME ON ROWS  
FROM [Model] 
CELL PROPERTIES VALUE, FORMAT_STRING, LANGUAGE, 
BACK_COLOR, FORE_COLOR, FONT_FLAGS

On the rows axis, as you would expect, the only hierarchy you see is Month Name.

However, if you run a Profiler trace (you can find out how to do this here although it’s much easier to get the Process ID from DAX Studio) to look at the DAX query generated by Power BI you’ll see

EVALUATE
TOPN (
    102,
    SUMMARIZECOLUMNS (
        ROLLUPADDISSUBTOTAL (
            ROLLUPGROUP ( 'Month'[Month Name], 'Month'[Month Number] ), 
            "IsGrandTotalRowTotal"
        ),
        "Share", 'Sales'[Share],
        "Sum_of_Sales", 'Sales'[Sum of Sales]
    ),
    [IsGrandTotalRowTotal], 0,
    'Month'[Month Number], 1,
    'Month'[Month Name], 1
)
ORDER BY
    [IsGrandTotalRowTotal] DESC,
    'Month'[Month Number],
    'Month'[Month Name]

The difference here is that the Month Number and Month Name fields are both present – they have to be since the query has to sort by Month Number. In MDX the order of members on a hierarchy can be set inside the model; in a DAX query you can only sort using an ORDER BY clause and for that to work, the field you’re ordering by must be present in the query.

The Share measure calculation needs to be changed in order to fix this, then. Here’s one way of doing this:

Share =

DIVIDE([Sum of Sales],

CALCULATE([Sum of Sales], ALL(Month[Month Name], ‘Month'[Month Number])))

Drillthrough On Multiselect Now Works In Excel 2016 And SSAS 2016

One unadvertised – but still very welcome – feature of Excel 2016 is that it is now possible to do a drillthrough in a PivotTable when there is a multiselect on a filter or a slicer. It only works if you are using SSAS 2016 on the server, or if you’re querying the Excel Data Model/Power Pivot, because the fix needed changes both in Excel and on the server.

In Excel 2013 and earlier, when you try to do a default drillthrough where there is a multiselect on a filter or a slicer, you get the following error message:

image

Show Details cannot be executed when multiple items are selected in a report filter or in a slicer. Select a single item for each field in the report filter area and for each slicer connected to this PivotTable before performing a drillthrough.

For drillthrough actions, where there is a multiselect, you won’t see the action listed under the Additional Actions right-click menu at all.

This is the result of two limitations. First, there’s the issue with the MDSCHEMA_ACTIONS schema rowset that I blogged about here. In SSAS 2016 you can now pass in multiple members from the same hierarchy in the COORDINATE restriction column, as shown in this example I captured in Profiler:

<RestrictionList xmlns="urn:schemas-microsoft-com:xml-analysis" 
xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<CUBE_NAME>Model</CUBE_NAME>
<ACTION_TYPE>401</ACTION_TYPE>
<COORDINATE>
([DimDate].[DateKey].&amp;[20010701],
[DimDate].[DateKey].&amp;[20010702],
[Measures].[Sum of SalesAmount])
</COORDINATE>
<COORDINATE_TYPE>6</COORDINATE_TYPE>
</RestrictionList>

Second, subselects on a drillthrough MDX query are ignored in SSAS 2014 and earlier. For example, here’s a drillthrough query generated by an Excel 2016 PivotTable with a multiselect slicer connected to an SSAS Tabular model:

DRILLTHROUGH MAXROWS 1000 
SELECT [Measures].[Sum of SalesAmount] ON COLUMNS 
FROM 
(SELECT  FROM (
SELECT 
({[DimDate].[DateKey].&[20010702],[DimDate].[DateKey].&[20010701]}) 
ON COLUMNS  
FROM [Model]))

When run against SSAS 2014, this drillthrough returns records that are not filtered by date; when run against the same model in SSAS 2016, the subselect is respected and the resultset is filtered by the selected dates.

Excel 2016 BI Branding Changes

Office 2016 is on the verge of being released, and although Power BI is the cool new thing Excel 2016 has added several new BI-related features too. What is also interesting – and less well publicised – is that several of the BI features in Excel 2016 have been rebranded. Specifically:

  • Power Query (which is now no longer an add-in, but native Excel functionality) is not called Power Query any more, but “Get & Transform”. It has also been squeezed onto the Data tab, next to the older data import functionality:
    image

    image

  • Power Map is not called Power Map any more, but “3D Maps”
    image
  • Power View is still Power View, but as John White points out here it is no longer visible on the ribbon by default, hidden from new users, although it’s easy to add the Power View button back onto the ribbon. Power View in Excel 2016 is unchanged from Power View in Excel 2013. Read into this what you will.
  • Although Power Pivot still has its own tab on the ribbon (and has finally got a space in the middle of its name), there’s also a “Manage Data Model” button on the Data tab in the ribbon that is visible even when the Power Pivot add-in has not been enabled:
    image
    Clicking this button opens the Power Pivot window. There’s a subtle distinction between Power Pivot the add-in and the Excel Data Model (which is the database engine behind Power Pivot, and which is present in all Windows desktop editions of Excel regardless of whether the add-in is enabled or not) that has existed since Excel 2013 and which is generally unknown or misunderstood. The fact this button is titled “Manage Data Model” rather than “Power Pivot” is telling.
  • All the add-ins now have the collective name “Data Analysis add-ins” and can be enabled with a single click:
    image

So, clearly Excel has moved away from branding all its BI functionality as Power-something. My guess, informed by various conversations with various people in the know, is that this has happened for a couple of reasons:

  • The ‘Power’ prefix was intimidating for regular Excel users, who thought it represented something difficult and therefore not for them; it also made it look like this was functionality alien to Excel rather than a natural extension of Excel.
  • Having separate Power add-ins led to a disjointed experience, rather than giving the impression that all of these tools could and should be used together. It also made comparisons, by analysts and in corporate bake-offs, with other competing tools difficult – were the Power-addins separate tools, or should they be considered a single tool along with Excel?
  • Previously there was a lot of confusion about whether these add-ins are anything to do with ‘Power BI’ or not. Up to now, depending on who you talked to, they either were or weren’t officially part of Power BI. Now there is a clear distinction between Excel and Power BI, despite the close technical relationships that remain.

The new names certainly address these problems and on balance I think this change was the right thing to do, even if I was quite annoyed when I first found out about them. There are significant downsides too: for example, changing the names means that several years of books, blog posts, articles and conference presentations about Power Query and Power Map now won’t be found by new users when they search the internet for help. Similarly, it won’t be obvious to new users that a lot of content is relevant for both Power BI Desktop and Excel. Now that the Power Query name has been de-emphasised, why should anyone looking at my old blog posts on that subject know that what I’ve written is still relevant for Excel 2016’s “Get & Transform” and Power BI Desktop? What would I call a second edition of my Power Query book, if I wrote one, given that Power Query exists only as the relatively nondescript “Get & Transform” in Excel 2016 and “Get Data” in Power BI Desktop?

Exporting All M Code From Power Query In Excel 2013

UPDATE – there’s now a better way to do this using copy/paste. See
https://blog.crossjoin.co.uk/2017/02/25/exporting-your-queries-m-code-from-power-query-and-power-bi-using-copypaste/

Here’s a tip that I found out about on the Power Query Technet forum that I thought was worth repeating. If you ever need a quick way of exporting all the M code for all of the queries in an Excel 2013 workbook, just open the Power Query window and click the Send a Frown button shown here:

image

Then, when the Send Feedback dialog appears, make sure Include Formulas is ticked then click OK:

image

When you do that, you’ll get an email created for you that contains a whole lot of debugging information, plus all of the M code for your queries:

image

Obviously, don’t send this email to Microsoft!

It’s quite easy to see the individual queries. You then need to go to your new workbook, create a new query by selecting the Blank Query option under Other Sources, and then open the Advanced Editor window and paste the code for each query in. However, when you do that you will need to modify the code a bit. There are three pieces of code you will need to delete:

  • At the beginning of the M code, where it says
    section Section1;
  • At the beginning of each query, a piece of code that contains the original name of the query:
    shared MyOriginalQueryName =
  • At the very end, a semi-colon

Of course in Excel 2016 we’ll have much better options for copying, pasting and moving queries in VBA, but until then we’ll have to live with hacks like this.