An In-Depth Look At The Csv.Document M Function

CSV files are one of the most commonly used data sources in Power BI and Power Query/Get&Transform, and yet the documentation for the Csv.Document M function is very limited and in some cases incorrect. In this rather long post I’ll show you as many of the capabilities of this useful function as I’ve been able to discover.

The Source parameter

The Csv.Document function returns a table, and the first (and only non-optional) parameter of this function is the source data in CSV format. Normally this is a binary value returned by the File.Contents function. For example take the this simple CSV file with no column headers and one row of data:

image

The following M code uses File.Contents to read the contents of the file, and then passes the contents to Csv.Document to be interpreted as a CSV file:

let
    Source = File.Contents("C:\CSVTests\SourceFile.csv"),
    ToCSV = Csv.Document(Source)
in
    ToCSV

The output is this:

image

However it is also possible to pass text to the first parameter of Csv.Document too, for example:

let
    SourceText = "February,Oranges,2",
    ToCSV = Csv.Document(SourceText)
in
    ToCSV

The output of this query is:

image

In both of these examples I’m relying on the default behaviour of the Csv.Document function with regard to delimiters and other properties, which I’ll explain in more detail below.

Using a record in the second parameter

The second parameter of Csv.Document can be used in several different ways. In code generated by the Query Editor UI it usually takes the form of a record, and the different fields in the record specify how the function behaves in different scenarios. For example, if you connect to the CSV file shown above by selecting the Text/CSV source in the Query Editor UI, you’ll see the following window appear showing a preview of the data and three options:

image

This results in the following M query:

let
    Source = 
	Csv.Document(
		File.Contents("C:\CSVTests\SourceFile.csv"),
		[
		 Delimiter=",", 
		 Columns=3, 
		 Encoding=1252, 
		 QuoteStyle=QuoteStyle.None
		]),
    #"Changed Type" = 
		Table.TransformColumnTypes(
			Source,
			{
			{"Column1", type text}, 
			{"Column2", type text}, 
			{"Column3", Int64.Type}
			})
in
    #"Changed Type"

The query above shows the Csv.Document function with a record in its second parameter containing four fields: Delimiter, Columns, Encoding and QuoteStyle. There is also a fifth field that can be added to the record, CsvStyle, but this cannot be set anywhere in the UI.

The Data Type Detection option shown in the screenshot gives you three options for detecting the data types in each column of your file: by default it looks at the first 200 rows in the dataset, but you can also ask it to look at the entire dataset (which may be slower) or not to detect data types at all, in which case it will treat all columns as text. In this case data types are not set in the Csv.Document function but in the #”Changed Type” step with the Table.TransformColumnTypes function, but as we will see later it is possible to set column names and data types in a single step with Csv.Document instead.

The Encoding field

The File Origin dropdown menu shown above corresponds to the Encoding field in the Csv.Document function. This integer value specifies the code page used to encode the contents of the file:

image

In the M query in the previous section the 1252 code page is set explicitly. The following M query sets the (incorrect) 1200 code page for the CSV file shown above:

let
    Source = File.Contents("C:\CSVTests\SourceFile.csv"),
    ToCSV = Csv.Document(Source,[Encoding=1200])
in
    ToCSV

…with the following result:

image

The Delimiter field

The Delimiter dropdown allows you to specify the delimiter used to separate the columns in each row of data. There are a number of options available through the UI, including commas and tabs, and the Custom option allows you to enter your own delimiter:

image

If you specify a single character delimiter at this point then the Delimiter field of the record in the second parameter of Csv.Document is set; the Custom and Fixed Width options shown here use a different form of the Csv.Document function described below. If the Delimiter record is not set then a comma is used as the delimiter. If you want to use a special character like a tab then you need to use an M escape sequence; for example to use a tab character as delimiter you need to use the text “#(tab)” which returns a text value containing just a single tab character.

For example, the following query:

let
    Source = "123a456a789",
    ToCSV = Csv.Document(Source, [Delimiter="a"])
in
    ToCSV

Returns:

image

And this query:

let
    Source = "789#(tab)456#(tab)123",
    ToCSV = Csv.Document(Source, [Delimiter="#(tab)"])
in
    ToCSV

Returns:

image

The Columns field

The Columns field specifies the number of columns in the table returned by Csv.Document, regardless of how many columns are actually present in the source data. For example, the following query:

let
    Source = "a,b,c",
    ToCSV = Csv.Document(Source, [Delimiter=",", Columns=3])
in
    ToCSV

…returns a table with three columns:

image

While the following query returns a table with four columns, even though only three columns are present in the data:

let
    Source = "a,b,c",
    ToCSV = Csv.Document(Source, [Delimiter=",", Columns=4])
in
    ToCSV

image

And the following query returns a table with only two columns, discarding the third column of data present in the data:

let
    Source = "a,b,c",
    ToCSV = Csv.Document(Source, [Delimiter=",", Columns=2])
in
    ToCSV

image

The Columns field is not explicitly set by the user when you first connect to a CSV file via the UI, but the UI infers its value from the number of columns it finds in the CSV file. This can be a problem if the number of columns in the source data changes in the future because, as shown above, when the Columns field is set the table returned always has a fixed number of columns. As a result if the number of columns in the data source increases in the future you will find columns on the right-hand side of the table are not returned; similarly if the number of columns decreases you’ll see unwanted empty columns. Several people such as Prathy Kamasani have blogged about this problem and it may be better to delete the Columns field from the record, or not set the Columns field in the first place, in order to avoid it. If you do not set the Columns field then Csv.Document returns a table with the number of columns that are present in the first row of your source data.

The QuoteStyle field

The QuoteStyle field can take two possible values of type QuoteStyle: QuoteStyle.None and QuoteStyle.Csv. Here’s what the built-in documentation has to say about the QuoteStyle type:

image

While the value for QuoteStyle is set automatically when you connect to a file, if you edit a step in the Query Editor that uses Csv.Document you can change this value in the UI in the Line Breaks dropdown shown here:

image

As the screenshot above suggests this field controls whether line breaks inside text values are respected. For both QuoteStyle.None and QuoteStyle.Csv, if you wrap a text value inside double quotes those double quotes are used to indicate the start and the end of the text value and are not shown in the output; if you want a double quote to appear, you have to double it up. However if QuoteStyle.None is set then line breaks are always respected, even if they appear inside double quotes; if QuoteStyle.Csv is set, then line breaks inside double quotes are ignored. Take the following CSV file for example:

image

The following M query, using QuoteStyle.None:

let
    Source = File.Contents("C:\CSVTests\SourceFileWithQuotes.csv"),
    ToCSV = Csv.Document(Source,[QuoteStyle=QuoteStyle.None])
in
    ToCSV

…returns the following table with two rows in it:

image

Whereas the following M query, using QuoteStyle.Csv:

let
    Source = File.Contents("C:\CSVTests\SourceFileWithQuotes.csv"),
    ToCSV = Csv.Document(Source,[QuoteStyle=QuoteStyle.Csv])
in
    ToCSV

…returns a table with just one row, and a line break present in the text value in the first column:

image

The CsvStyle field

The final field that can be used, CsvStyle, is also related to quotes. It can take one of two values of type CsvStyle: Csv.QuoteAfterDelimiter and CsvStyle.QuoteAlways.

image

If the CsvStyle field is not set, the default is CsvStyle.QuoteAlways. Consider the following CSV file:

image

Notice that on the second line there is a space after the comma. The following M query:

let
    Source = 
	File.Contents("C:\CSVTests\SourceFileWithQuotes.csv"),
    ToCSV = 
	Csv.Document(
		Source,
		[CsvStyle=CsvStyle.QuoteAlways])
in
    ToCSV

Returns this, because the space before the comma is not treated as significant:

image

Whereas the following M query:

let
    Source = 
	File.Contents("C:\CSVTests\SourceFileWithQuotes.csv"),
    ToCSV = 
	Csv.Document(
		Source,
		[CsvStyle=CsvStyle.QuoteAfterDelimiter])
in
    ToCSV

Returns the text “four” in double quotes on the second line, because the space before the comma on the second line changes how the double quotes are treated:

image

 

Using a list or a table type in the second parameter

If the first line of your CSV file contains column headers and you connect to the file using the Query Editor user interface, in most cases this will be detected and an extra step will be added to your query that uses Table.PromoteHeaders to use these values as the column headers. However if you don’t have column headers inside your CSV file, instead of a record it is also possible to supply a list of column names or even better a table type to define the columns present in your CSV file. When you do this Csv.Document has three other parameters that can be used to do some of the same things that are possible if you use a record in the second parameter – Delimiter, ExtraValues and Encoding – and they are described below.

For example, in the following CSV file there are three columns: Month, Product and Sales.

image

Using a list of text values containing these column names in the second parameter of Csv.Document, as in the following M query:

let
    Source = File.Contents("C:\CSVTests\SourceFile.csv"),
    ToCSV = Csv.Document(Source,{"Month","Product","Sales"})
in
    ToCSV

Returns the following table:

image

This has set the column names correctly but the data types of the three columns are set to text. What if I know that only the Month and Product columns contain text and the Sales column should be a number? Instead of a list of column names, using a table type allows you to set names and data types for each column:

let
    Source = File.Contents("C:\CSVTests\SourceFile.csv"),
    ToCSV = Csv.Document(
	Source,
	type table 
		[#"Month"=text, #"Product"=text, #"Sales"=number])
in
    ToCSV

image

Notice how now the Sales column has its data type set to number.

The Delimiter parameter

If you have used a list of column names or a table type in the second parameter of Csv.Document, you can use the third parameter to control how each row of data is split up into columns. There are two ways you can do this.

First of all, you can pass any piece of text to the third parameter to specify a delimiter. Unlike the delimiter field of the second parameter described above, this can be a single character or multiple characters. For example, the following M query:

let
    Source = "abcdefg",
    ToCSV = Csv.Document(Source,{"first","second"},"c")
in
    ToCSV

Returns:

image

And the following M query:

let
    Source = "abcdefg",
    ToCSV = Csv.Document(Source,{"first","second"},"cd")
in
    ToCSV

Returns:

image

Instead of text, the Delimiter parameter can also take a list of integer values to allow you to handle fixed-width files. This functionality is available from the UI when you choose the Fixed Width option from the Delimiter dropdown box when you connect to a CSV file for the first time:

image

Each integer in the list represents the number of characters from the start of the row that marks the start of each column; as a result, each integer in the list has to be a larger than the preceding integer. The values are 0-based so 0 marks the start of a row. For example, the M query:

let
    Source = "abcdefg",
    ToCSV = Csv.Document(Source,{"first","second","third"},{0,3,5})
in
    ToCSV

Returns:

image

 

The ExtraValues parameter

The ExtraValues parameter allows you to handle scenarios where there are extra columns on the end of lines. This isn’t quite as useful as it sounds though: most of the time when the number of columns varies in a CSV file it’s because there are unquoted line breaks in text columns, in which case you should make sure your source data always wraps text in double quotes and use the QuoteStyle option described above, or if you can’t fix your data source, see this post.

The ExtraValues parameter can take one of three values of type ExtraValues: ExtraValues.List, ExtraValues.Ignore and ExtraValues.Error.

image

Consider the following CSV file with two extra columns on the second row:

image

The following query reads data from this file:

let
    Source = File.Contents("C:\CSVTests\SourceFile.csv"),
    ToCSV = Csv.Document(Source,{"Month","Product","Sales"})
in
    ToCSV

As you can see from the screenshot below, because we have specified that there are three columns in the table, the error “There were more columns in the result than expected” is returned for each cell on the second line:

image

The same thing happens when ExtraValues.Error is explicitly specified in the fourth parameter, like so:

let
    Source = File.Contents("C:\CSVTests\SourceFile.csv"),
    ToCSV = 
	Csv.Document(
		Source,
		{"Month","Product","Sales"},
		",",
		ExtraValues.Error
	)
in
    ToCSV

If you set ExtraValues.Ignore instead, though:

let
    Source = File.Contents("C:\CSVTests\SourceFile.csv"),
    ToCSV = 
	Csv.Document(
		Source,
		{"Month","Product","Sales"},
		",",
		ExtraValues.Ignore
	)
in
    ToCSV

The extra columns are ignored and no errors are returned:

image

Setting ExtraValues.List allows you to capture any extra column values in a list; however, if you want to do this you will need to add an extra column to your table to hold these values. For example, notice in this query that four columns rather than three have been defined:

let
    Source = File.Contents("C:\CSVTests\SourceFile.csv"),
    ToCSV = 
	Csv.Document(
		Source,
		{"Month","Product","Sales","Extra Columns"},
		",",
		ExtraValues.List)
in
    ToCSV

The output looks like this:

image

On the first and third rows the Extra Columns column contains an empty list. On the second row, however, the Extra Columns column contains a list containing two values – the two values from the two extra columns on that line.

The Encoding parameter

The Encoding parameter corresponds directly to the Encoding field used when you pass a record to the second parameter, as described above. The one difference is that it can take an integer or a value of type TextEncoding, although the TextEncoding data type only contains values for some of the more common code pages so the only reason to use it would be for readability:

image

As a result, the following two M queries:

let
    Source = File.Contents("C:\CSVTests\SourceFile.csv"),
    ToCSV = Csv.Document(
	Source, 
	{"Month","Product","Sales"},
	",",
	ExtraValues.Ignore,
	TextEncoding.Windows
	)
in
    ToCSV
let
    Source = File.Contents("C:\CSVTests\SourceFile.csv"),
    ToCSV = Csv.Document(
	Source, 
	{"Month","Product","Sales"},
	",",
	ExtraValues.Ignore,
	1252
	)
in
    ToCSV

…return the same result.

What about CsvStyle and QuoteStyle?

If you specify a list of column names or a table type in the second parameter of Csv.Document there’s no way to set CsvStyle or QuoteStyle – these options are only available when you use a record in the second parameter. The behaviour you get is the same as CsvStyle.QuoteAlways and QuoteStyle.Csv, so with the following source data:

image

This M query:

let
    Source = File.Contents("C:\CSVTests\SourceFileWithQuotes.csv"),
    ToCSV = Csv.Document(
		Source, 
		{"Month","Sales"},
		",",
		ExtraValues.Ignore,
		1252)
in
    ToCSV

returns:

image

5 responses

  1. Dear Sir,
    Your help is needed.
    Whenever I try to manage a relationship in two files between two columns I get the message
    “You can’t create a relationship between these two columns because one of the columns must have unique values.”
    Please note that the values are truly unique in the columns but somehow the above message is displayed the moment I select even one column.
    Then a black error screen appears and the Power BI application powers off.
    regards

  2. Pingback: The Binary.InferContentType M Function – Chris Webb's BI Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: