As a reward for participating you’ll get a summary of the results and be entered into a draw for some Amazon vouchers. As a reward for plugging the BI Survey here I get to see the full results and blog about them later on in the year, and the results are always fascinating. Last year Power BI was breathing down the necks of more established vendors like Tableau and Qlik; this year I expect Power BI to be in an even stronger position.
There have been a lot of important Power Apps and Flow announcements today, as well as the announcement about the Common Data Service for Analytics which is undoubtedly massive news for Power BI users, but buried in the very large “Dynamics 365 Spring ‘18 Release Notes” pdf file (downloadable from https://aka.ms/businessappsreleasenotes) are a number of equally significant revelations about the Power BI roadmap for the next few months. I’ve summarised them here, along with the relevant page numbers; I’ve also highlighted what I think are the most important ones.
(P177) Control over linguistic schema – it looks like the phrasing and synonyms functionality for Q&A that was in the old Power BI service has been added back
(P178) User experiences for Q&A in reports – report authors will be able to allow report consumers (I assume people who have had reports shared with them via Apps) to use Q&A, as well as to provide suggested questions
(P178) Incremental refresh policies – the first sighting of incremental refresh in Power BI! But as the screenshot shows, this is for Premium only (which is not a surprise). It looks like it will be a lot easier to use than managing incremental refresh using partitions in SSAS.
(P185) Performance reporting for Power BI Premium – seems to be more detailed metrics for Power BI usage and performance
(P186) Data source setup improvements – better UI for configuring datasets and links to gateways
(P186) Query acceleration for large datasets – this will allow you to create DirectQuery datasets but then create some in-memory aggregate tables to improve query performance
(P186) Power BI metadata translations – translations like we have in SSAS Tabular today, I guess, allowing table and column names to be translated and users to see these translations when they connect
(P187) Smart alerts – seemingly an improvement on the existing alert functionality, with some extra AI thrown in?
(P187) Slideshow mode – cycle through pages in a report when the report is in full-screen mode
(P187) Workspaces with Azure AD Groups – workspaces are being separated from O365 Groups (at last!) and instead permissions can be controlled using Azure AD security groups or Office 365 modern groups
(P188) Report Snapshots for Power BI Premium – more subscription options for Premium users
(P188) SSRS Reports in Power BI Premium – run SSRS reports inside Premium, with no separate installation of SSRS required. Azure Reporting Services has been resurrected!?
(P188) Subscribe other users for email subscriptions – at last!
(P189) XMLA connectivity for Power BI Premium – connect to Premium workspaces as if they were Analysis Services instances (which they pretty much are, I guess)
(P189 and P209) Common Data Services for Analytics capability in Power BI – Iots more detail on this new service in this section than is available in the post on the Power BI blog linked to above
(P192) Power BI Insight apps – pre-built Power BI solutions for services like Dynamics 365 and Salesforce
(P194) Details on the roadmap for Power BI Embedded
(P204) Details on the roadmap for the Power BI Mobile apps
Not in this document, but mentioned in this blog post today is the fact that the web-based version of Power Query that can be used to load data into the Common Data Service is out of Preview.
All I can say is wow – there’s so much to take in here, and it seems like the pace of innovation is only getting faster. It’s also nice to see Microsoft publishing a comprehensive roadmap document like this: it’s something many of my customers have wanted for a long time, and really helps them with their planning.
A long time ago someone – probably from the Power Query dev team – told me that adding primary keys to M tables could improve the performance of certain transformations in Power BI and Excel Power Query/Get&Transform. Indeed, I mentioned this in chapter 5 of my book “Power Query for Power BI and Excel”, but I it wasn’t until this week that I found a scenario where this was actually the case. In this post I’ll describe the scenario and try to draw some conclusions about when adding primary keys to tables might make a difference to performance.
Imagine you have two queries that return data from two different files. First of all there’s a query called #”Price Paid”, which reads data from a 140MB csv file containing some of my favourite demo data: all 836933 property transactions in the year 2017 from the UK Land Registry Price Paid dataset. The query does nothing special, just reads the all the data, renames some columns, removes a few others, and sets data types on columns. Here’s what the output looks like:
Second, you have a query called #”Property Types” that reads data from a very small Excel file and returns the following table:
As you can see, the Property Type column from the #”Price Paid” query contains single letter codes describing the type of property sold in each transaction; the Property Type column from #“Property Types” contains a distinct list of the same codes and acts as a dimension table. Again there’s nothing interesting going on in this query.
The problems start when you try to join data from these two queries using a Merge and then, for each row in #”Property Types”, show the sum of the Price Paid column from #”Price Paid”. The steps to do this are:
1. Click on the Merge Queries/Merge Queries as New button on the Home tab in the Power Query Editor:
2. In the Merge dialog, do a left outer join between #”Property Types” and #”Price Paid” on the Property Type column on each table
3. Back in the Power Query Editor window, click on the Expand/Aggregate button in the top right-hand corner of the column created by the previous step.
4. Click on the Aggregate radio button and select Sum of Price Paid
The output of the query is this table:
Here’s the M code for this query, as generated by the UI:
It’s a very common thing to do, and in this case the query is extremely slow to run: on my laptop, when you refresh the preview in the Power Query Editor window in Power BI Desktop it takes 55 seconds. What’s more, in the bottom left-hand corner of the screen where it displays how much data is being read from the source file, it shows a value that is a lot more than the size of the source file. In fact it seems like the source file for #”Price Paid” is read once for each row in the #”Property Types” query: 140MB * 5 rows = 700MB.
Not good. But what if you specify that the Property Type column from the #”Property Types” query is the primary key of the table? After all it does contain unique values. Although it isn’t widely known, and it’s not shown in the UI, tables in M can have primary and foreign keys defined on them whatever data source you use (remember that in this case the data sources are csv and Excel). One way to do this is using Table.AddKey as follows:
And guess what? After making this change, the query only takes 12 seconds to run instead of 55 seconds! What’s more the amount of data read from disk shown in the UI suggests that the source file for #”Price Paid” is only read once. But making this change involves writing M code, and not everyone is comfortable making changes in the Advanced Editor. The good news is that it’s possible to get this performance benefit in another way without writing any M code.
As I mentioned in this blog post, using the Remove Duplicates transformation on a column has the side-effect of marking that column as a primary key. Therefore if you right-click on the Property Type column and select Remove Duplicates in between steps 2 and 3 above, before you click on the Expand/Aggregate button:
…then you also get the performance benefit. Here’s the M code generated by the UI for this query:
It’s #”Removed Duplicates” that is the new step here, and it uses the Table.Distinct function.
All this suggests that if you are doing merge operations on queries that get data from large csv or Excel files and then aggregating data, if you can set a primary key on one of the columns used in the join it can make a massive difference to performance. However this is just one example, and I would be really interested to hear if any of you can reproduce these findings with your own data and queries – if you can, please leave a comment. I suspect this could be a very important discovery for M query performance tuning!
CSV files are one of the most commonly used data sources in Power BI and Power Query/Get&Transform, and yet the documentation for the Csv.Document M function is very limited and in some cases incorrect. In this rather long post I’ll show you as many of the capabilities of this useful function as I’ve been able to discover.
The Source parameter
The Csv.Document function returns a table, and the first (and only non-optional) parameter of this function is the source data in CSV format. Normally this is a binary value returned by the File.Contents function. For example take the this simple CSV file with no column headers and one row of data:
The following M code uses File.Contents to read the contents of the file, and then passes the contents to Csv.Document to be interpreted as a CSV file:
let
Source = File.Contents("C:\CSVTests\SourceFile.csv"),
ToCSV = Csv.Document(Source)
in
ToCSV
The output is this:
However it is also possible to pass text to the first parameter of Csv.Document too, for example:
let
SourceText = "February,Oranges,2",
ToCSV = Csv.Document(SourceText)
in
ToCSV
The output of this query is:
In both of these examples I’m relying on the default behaviour of the Csv.Document function with regard to delimiters and other properties, which I’ll explain in more detail below.
Using a record in the second parameter
The second parameter of Csv.Document can be used in several different ways. In code generated by the Query Editor UI it usually takes the form of a record, and the different fields in the record specify how the function behaves in different scenarios. For example, if you connect to the CSV file shown above by selecting the Text/CSV source in the Query Editor UI, you’ll see the following window appear showing a preview of the data and three options:
This results in the following M query:
let
Source =
Csv.Document(
File.Contents("C:\CSVTests\SourceFile.csv"),
[
Delimiter=",",
Columns=3,
Encoding=1252,
QuoteStyle=QuoteStyle.None
]),
#"Changed Type" =
Table.TransformColumnTypes(
Source,
{
{"Column1", type text},
{"Column2", type text},
{"Column3", Int64.Type}
})
in
#"Changed Type"
The query above shows the Csv.Document function with a record in its second parameter containing four fields: Delimiter, Columns, Encoding and QuoteStyle. There is also a fifth field that can be added to the record, CsvStyle, but this cannot be set anywhere in the UI.
The Data Type Detection option shown in the screenshot gives you three options for detecting the data types in each column of your file: by default it looks at the first 200 rows in the dataset, but you can also ask it to look at the entire dataset (which may be slower) or not to detect data types at all, in which case it will treat all columns as text. In this case data types are not set in the Csv.Document function but in the #”Changed Type” step with the Table.TransformColumnTypes function, but as we will see later it is possible to set column names and data types in a single step with Csv.Document instead.
The Encoding field
The File Origin dropdown menu shown above corresponds to the Encoding field in the Csv.Document function. This integer value specifies the code page used to encode the contents of the file:
In the M query in the previous section the 1252 code page is set explicitly. The following M query sets the (incorrect) 1200 code page for the CSV file shown above:
let
Source = File.Contents("C:\CSVTests\SourceFile.csv"),
ToCSV = Csv.Document(Source,[Encoding=1200])
in
ToCSV
…with the following result:
The Delimiter field
The Delimiter dropdown allows you to specify the delimiter used to separate the columns in each row of data. There are a number of options available through the UI, including commas and tabs, and the Custom option allows you to enter your own delimiter:
If you specify a single character delimiter at this point then the Delimiter field of the record in the second parameter of Csv.Document is set; the Custom and Fixed Width options shown here use a different form of the Csv.Document function described below. If the Delimiter record is not set then a comma is used as the delimiter. If you want to use a special character like a tab then you need to use an M escape sequence; for example to use a tab character as delimiter you need to use the text “#(tab)” which returns a text value containing just a single tab character.
For example, the following query:
let
Source = "123a456a789",
ToCSV = Csv.Document(Source, [Delimiter="a"])
in
ToCSV
Returns:
And this query:
let
Source = "789#(tab)456#(tab)123",
ToCSV = Csv.Document(Source, [Delimiter="#(tab)"])
in
ToCSV
Returns:
The Columns field
The Columns field specifies the number of columns in the table returned by Csv.Document, regardless of how many columns are actually present in the source data. For example, the following query:
let
Source = "a,b,c",
ToCSV = Csv.Document(Source, [Delimiter=",", Columns=3])
in
ToCSV
…returns a table with three columns:
While the following query returns a table with four columns, even though only three columns are present in the data:
let
Source = "a,b,c",
ToCSV = Csv.Document(Source, [Delimiter=",", Columns=4])
in
ToCSV
And the following query returns a table with only two columns, discarding the third column of data present in the data:
let
Source = "a,b,c",
ToCSV = Csv.Document(Source, [Delimiter=",", Columns=2])
in
ToCSV
The Columns field is not explicitly set by the user when you first connect to a CSV file via the UI, but the UI infers its value from the number of columns it finds in the CSV file. This can be a problem if the number of columns in the source data changes in the future because, as shown above, when the Columns field is set the table returned always has a fixed number of columns. As a result if the number of columns in the data source increases in the future you will find columns on the right-hand side of the table are not returned; similarly if the number of columns decreases you’ll see unwanted empty columns. Several people such as Prathy Kamasani have blogged about this problem and it may be better to delete the Columns field from the record, or not set the Columns field in the first place, in order to avoid it. If you do not set the Columns field then Csv.Document returns a table with the number of columns that are present in the first row of your source data.
The QuoteStyle field
The QuoteStyle field can take two possible values of type QuoteStyle: QuoteStyle.None and QuoteStyle.Csv. Here’s what the built-in documentation has to say about the QuoteStyle type:
While the value for QuoteStyle is set automatically when you connect to a file, if you edit a step in the Query Editor that uses Csv.Document you can change this value in the UI in the Line Breaks dropdown shown here:
As the screenshot above suggests this field controls whether line breaks inside text values are respected. For both QuoteStyle.None and QuoteStyle.Csv, if you wrap a text value inside double quotes those double quotes are used to indicate the start and the end of the text value and are not shown in the output; if you want a double quote to appear, you have to double it up. However if QuoteStyle.None is set then line breaks are always respected, even if they appear inside double quotes; if QuoteStyle.Csv is set, then line breaks inside double quotes are ignored. Take the following CSV file for example:
The following M query, using QuoteStyle.None:
let
Source = File.Contents("C:\CSVTests\SourceFileWithQuotes.csv"),
ToCSV = Csv.Document(Source,[QuoteStyle=QuoteStyle.None])
in
ToCSV
…returns the following table with two rows in it:
Whereas the following M query, using QuoteStyle.Csv:
let
Source = File.Contents("C:\CSVTests\SourceFileWithQuotes.csv"),
ToCSV = Csv.Document(Source,[QuoteStyle=QuoteStyle.Csv])
in
ToCSV
…returns a table with just one row, and a line break present in the text value in the first column:
The CsvStyle field
The final field that can be used, CsvStyle, is also related to quotes. It can take one of two values of type CsvStyle: Csv.QuoteAfterDelimiter and CsvStyle.QuoteAlways.
If the CsvStyle field is not set, the default is CsvStyle.QuoteAlways. Consider the following CSV file:
Notice that on the second line there is a space after the comma. The following M query:
let
Source =
File.Contents("C:\CSVTests\SourceFileWithQuotes.csv"),
ToCSV =
Csv.Document(
Source,
[CsvStyle=CsvStyle.QuoteAlways])
in
ToCSV
Returns this, because the space before the comma is not treated as significant:
Whereas the following M query:
let
Source =
File.Contents("C:\CSVTests\SourceFileWithQuotes.csv"),
ToCSV =
Csv.Document(
Source,
[CsvStyle=CsvStyle.QuoteAfterDelimiter])
in
ToCSV
Returns the text “four” in double quotes on the second line, because the space before the comma on the second line changes how the double quotes are treated:
Using a list or a table type in the second parameter
If the first line of your CSV file contains column headers and you connect to the file using the Query Editor user interface, in most cases this will be detected and an extra step will be added to your query that uses Table.PromoteHeaders to use these values as the column headers. However if you don’t have column headers inside your CSV file, instead of a record it is also possible to supply a list of column names or even better a table type to define the columns present in your CSV file. When you do this Csv.Document has three other parameters that can be used to do some of the same things that are possible if you use a record in the second parameter – Delimiter, ExtraValues and Encoding – and they are described below.
For example, in the following CSV file there are three columns: Month, Product and Sales.
Using a list of text values containing these column names in the second parameter of Csv.Document, as in the following M query:
let
Source = File.Contents("C:\CSVTests\SourceFile.csv"),
ToCSV = Csv.Document(Source,{"Month","Product","Sales"})
in
ToCSV
Returns the following table:
This has set the column names correctly but the data types of the three columns are set to text. What if I know that only the Month and Product columns contain text and the Sales column should be a number? Instead of a list of column names, using a table type allows you to set names and data types for each column:
let
Source = File.Contents("C:\CSVTests\SourceFile.csv"),
ToCSV = Csv.Document(
Source,
type table
[#"Month"=text, #"Product"=text, #"Sales"=number])
in
ToCSV
Notice how now the Sales column has its data type set to number.
The Delimiter parameter
If you have used a list of column names or a table type in the second parameter of Csv.Document, you can use the third parameter to control how each row of data is split up into columns. There are two ways you can do this.
First of all, you can pass any piece of text to the third parameter to specify a delimiter. Unlike the delimiter field of the second parameter described above, this can be a single character or multiple characters. For example, the following M query:
let
Source = "abcdefg",
ToCSV = Csv.Document(Source,{"first","second"},"c")
in
ToCSV
Returns:
And the following M query:
let
Source = "abcdefg",
ToCSV = Csv.Document(Source,{"first","second"},"cd")
in
ToCSV
Returns:
Instead of text, the Delimiter parameter can also take a list of integer values to allow you to handle fixed-width files. This functionality is available from the UI when you choose the Fixed Width option from the Delimiter dropdown box when you connect to a CSV file for the first time:
Each integer in the list represents the number of characters from the start of the row that marks the start of each column; as a result, each integer in the list has to be a larger than the preceding integer. The values are 0-based so 0 marks the start of a row. For example, the M query:
let
Source = "abcdefg",
ToCSV = Csv.Document(Source,{"first","second","third"},{0,3,5})
in
ToCSV
Returns:
The ExtraValues parameter
The ExtraValues parameter allows you to handle scenarios where there are extra columns on the end of lines. This isn’t quite as useful as it sounds though: most of the time when the number of columns varies in a CSV file it’s because there are unquoted line breaks in text columns, in which case you should make sure your source data always wraps text in double quotes and use the QuoteStyle option described above, or if you can’t fix your data source, see this post.
The ExtraValues parameter can take one of three values of type ExtraValues: ExtraValues.List, ExtraValues.Ignore and ExtraValues.Error.
Consider the following CSV file with two extra columns on the second row:
The following query reads data from this file:
let
Source = File.Contents("C:\CSVTests\SourceFile.csv"),
ToCSV = Csv.Document(Source,{"Month","Product","Sales"})
in
ToCSV
As you can see from the screenshot below, because we have specified that there are three columns in the table, the error “There were more columns in the result than expected” is returned for each cell on the second line:
The same thing happens when ExtraValues.Error is explicitly specified in the fourth parameter, like so:
let
Source = File.Contents("C:\CSVTests\SourceFile.csv"),
ToCSV =
Csv.Document(
Source,
{"Month","Product","Sales"},
",",
ExtraValues.Error
)
in
ToCSV
If you set ExtraValues.Ignore instead, though:
let
Source = File.Contents("C:\CSVTests\SourceFile.csv"),
ToCSV =
Csv.Document(
Source,
{"Month","Product","Sales"},
",",
ExtraValues.Ignore
)
in
ToCSV
The extra columns are ignored and no errors are returned:
Setting ExtraValues.List allows you to capture any extra column values in a list; however, if you want to do this you will need to add an extra column to your table to hold these values. For example, notice in this query that four columns rather than three have been defined:
let
Source = File.Contents("C:\CSVTests\SourceFile.csv"),
ToCSV =
Csv.Document(
Source,
{"Month","Product","Sales","Extra Columns"},
",",
ExtraValues.List)
in
ToCSV
The output looks like this:
On the first and third rows the Extra Columns column contains an empty list. On the second row, however, the Extra Columns column contains a list containing two values – the two values from the two extra columns on that line.
The Encoding parameter
The Encoding parameter corresponds directly to the Encoding field used when you pass a record to the second parameter, as described above. The one difference is that it can take an integer or a value of type TextEncoding, although the TextEncoding data type only contains values for some of the more common code pages so the only reason to use it would be for readability:
As a result, the following two M queries:
let
Source = File.Contents("C:\CSVTests\SourceFile.csv"),
ToCSV = Csv.Document(
Source,
{"Month","Product","Sales"},
",",
ExtraValues.Ignore,
TextEncoding.Windows
)
in
ToCSV
let
Source = File.Contents("C:\CSVTests\SourceFile.csv"),
ToCSV = Csv.Document(
Source,
{"Month","Product","Sales"},
",",
ExtraValues.Ignore,
1252
)
in
ToCSV
…return the same result.
What about CsvStyle and QuoteStyle?
If you specify a list of column names or a table type in the second parameter of Csv.Document there’s no way to set CsvStyle or QuoteStyle – these options are only available when you use a record in the second parameter. The behaviour you get is the same as CsvStyle.QuoteAlways and QuoteStyle.Csv, so with the following source data:
This M query:
let
Source = File.Contents("C:\CSVTests\SourceFileWithQuotes.csv"),
ToCSV = Csv.Document(
Source,
{"Month","Sales"},
",",
ExtraValues.Ignore,
1252)
in
ToCSV
When you’re working with M code in Power BI or Power Query/Get&Transform you may want to include special characters such as carriage returns, line feeds or tabs inside a text value. To do this you’ll need to use character escape sequences inside your text. Information on this topic is buried in the Power Query Formula Language Specification document that can be downloaded here, but since that’s not an easy read I thought I’d write a quick post repeating the information in there with a few examples.
The escape sequence for a tab character is #(tab), the escape sequence for a carriage return is #(cr) and the escape sequence for a line feed is #(lf). You can combine multiple special characters inside the same escape sequence too, so #(cr,lf) is equivalent to #(cr)#(lf). So, for example, following M code:
"A tab:#(tab), a carriage return and a line feed #(cr,lf) and stop"
…used as a query and entered as follows in the Advanced Editor:
…returns this output:
[Interestingly, in the formula bar, the value #(cr,lf) is displayed as #(cr)#(lf) although that’s not what I wrote in the Advanced Editor]
To use the two character sequence #( in your text you’ll need to escape those values; if you don’t, you’ll get an error. For example the following M code:
"This causes an error #("
Gives an “Invalid literal” error:
Whereas this works:
"This does not error #(#)("
Finally, this can be used to return any Unicode character in your text by putting either the four-digit or eight-digit hex code of the Unicode character in between the brackets. For example this expression returns the Unicode character with the four-digit hex code 2605, which is a star: