Working With CSV Files That Contain Rogue Line Breaks In Power Query And Power BI

CSV files often contain text fields, and text fields often contain line breaks that shouldn’t be there (especially if the text is entered by end users). For example, here is are the contents of a simple CSV file that should contain three fields(Product, Comment and Sales) and six rows:

[sourcecode language=”text” padlinenumbers=”true”]
Product,Comment,Sales
Apples,This is a commment,10
Oranges,Another comment,20
Pears,A comment with a
line break in,30
Grapes,A normal comment,40
Bananas,A comment
with
two line breaks,50
Pineapples,A normal comment,60
[/sourcecode]

However, as you can see, the Comment field contains line breaks in some cases. If you try to load this file into Power Query/Power BI, you’ll see that it doesn’t handle things well:

image

The easiest way of solving this problem is to make sure you have double quotes around the text in the Comment field, so that Power Query/Power BI knows to ignore the line breaks:

[sourcecode language=”text” padlinenumbers=”true”]
Product,Comment,Sales
Apples,"This is a commment",10
Oranges,"Another comment",20
Pears,"A comment with a
line break in",30
Grapes,"A normal comment",40
Bananas,"A comment
with
two line breaks",50
Pineapples,"A normal comment",60
[/sourcecode]

image

[By the way, no text in the screenshot above has been truncated. The line breaks are still there in the Comment field; you can see the full text in the bottom pane of the Query Editor by clicking on a cell in the table.]

But what if you can’t fix the source data? I was asked this question the other week, and since I had been asked about it before and not come up with a good answer, I decided to spend some time researching the problem.

What I found was that it was relatively easy to write some M code that gave me the correct results, but very hard to write code that performed acceptably well on a large data set (I was testing on a CSV file containing almost half a million rows). Here’s the code for the function I ended up with:

[sourcecode language=”text” wraplines=”false” light=”true”]
(FileName as text, DelimiterCharacter as text, DelimitersPerLine as number, UseFirstRowAsHeaders as logical) =>
let
// Load file and buffer in memory
BufferedBinary = Binary.Buffer(File.Contents(FileName)),
// Convert to a single block of text
Source = Text.FromBinary(BufferedBinary),
// Find the position of all line breaks
LineBreaks = List.Buffer(Text.PositionOf(Source, "#(cr,lf)", Occurrence.All)),
// Get a list of numbers from 0 to the number of line breaks-1
LinePositions = List.Positions(LineBreaks),
// Count the number of line breaks
TotalLines = List.Count(LineBreaks),
// Split the text by line breaks
FirstSplit = Text.Split(Source, "#(cr,lf)"),
// On each line, count the number of column delimiter characters
DelimiterCounts = List.Buffer(
List.Transform(
LinePositions,
each List.Count(Text.PositionOf(FirstSplit{_},
DelimiterCharacter,
Occurrence.All)))),
// Loop through each line and
// a) Calculate a running total of the number of column delimiter characters on each line
// b) Divide the running total by the expected number of delimiters per line, and round down to the nearest integer
RunningSums = List.Buffer(
List.Generate(
()=>[Counter=0, RunningTotal=0],
each [Counter]<TotalLines,
each [Counter=[Counter]+1, RunningTotal=[RunningTotal]+DelimiterCounts{[Counter]}],
each Number.IntegerDivide([RunningTotal]+DelimiterCounts{[Counter]},DelimitersPerLine))),
// If the result of the division calculated in the previous step is
// the same as on the previous line, then the line break can be ignored
// so return null, otherwise return the position of the line break
FindFullLines = List.Transform(LinePositions, each if _=0 or RunningSums{_}<>RunningSums{_-1} then LineBreaks{_} else null),
// Remove nulls from the list
RemoveNulls = List.RemoveNulls(FindFullLines),
// Split the original text by the positions of the line breaks left
LineSplitterFunction = Splitter.SplitTextByPositions(List.Combine({{0},RemoveNulls})),
ActualLines = LineSplitterFunction(Source),
// Convert the result to a table
ConvertToTable = Table.FromList(ActualLines, Splitter.SplitTextByDelimiter(DelimiterCharacter), null, null, ExtraValues.Ignore),
// Use the first rows as headers if desired
PromoteHeaders = if UseFirstRowAsHeaders then Table.PromoteHeaders(ConvertToTable) else ConvertToTable
in
PromoteHeaders
[/sourcecode]

Assuming that you called this function LoadFile(), here’s an example of how to call it:

[sourcecode language=”text” padlinenumbers=”true”]
LoadFile(
//File name
"C:\MyFile.csv",
//Column delimiter character – in this case a comma
",",
//The expected number of commas to be found per line
//ie one less than the number of columns
2,
//If true then first row contains header values
true)
[/sourcecode]

The comments in the code should give you an idea of how it all works. A few other things to note:

  • If any of your lines contain extra columns, this approach won’t work
  • I’ve noticed a difference between how the Power BI Desktop and Excel 2016 behave with this code: the current Excel 2016 Query Editor will keep any line breaks after splitting on them, whereas Power BI Desktop does not. It’s easy to fix the Excel output though: just use the Clean transform to remove the line breaks.

I’m not really sure why it was so hard to come up with code that performed well, even after liberal use of List.Buffer(). It seemed like any time I referenced the original text (as returned by the Source step) in the List.Generate() function then performance died – so you’ll notice in the code above that I’m only referencing the list that contains the number of delimiters found per line.

15 thoughts on “Working With CSV Files That Contain Rogue Line Breaks In Power Query And Power BI

  1. Can I perform this after filtering out some other data? My text file has a header and footer that need taken out prior to checking for the rogue line breaks. Can I load the file into power query and then buffer the result of my initial modifications? Then I could jump into this function.

  2. For some more exotic delimiters you might need to specify the codepage in Text.FromBinary. The line “Source = Text.FromBinary(BufferedBinary),” then has to be “Source = Text.FromBinary(BufferedBinary, 1252),”.

  3. Thanks so much! I am new to Power BI.
    The file I am working on has 27 Millions rows. The error I get is “Buffered lists can support up to 2147483647 items and streamed lists can support up to 4503599627370496 items”.
    Could you suggest a way to modify the function to use streamed list?

  4. Thanks for this.
    I am using Powe BI so this indeed didn’t work for me as per your comments.
    In my case I edited the `Csv.Document` function and replaced `QuoteStyle=QuoteStyle.None` with `QuoteStyle=QuoteStyle.Csv`. Not sure it’s a solution for everybody but it worked for me

    1. This works perfectly.
      Instead of
      = Csv.Document(Parameter2,[Delimiter=”,”, Columns=33, Encoding=1252, QuoteStyle=QuoteStyle.None])

      Use
      = Csv.Document(Parameter2,[Delimiter=”,”, Columns=33, Encoding=1252, QuoteStyle=QuoteStyle.Csv])

  5. Terrific! worked “out of the box” for a problem we’ve had for months – now we can fully-automate what had to be manually done every week – thank you!

  6. I may have stumbled across a simpler solution if you are combining and transforming csv files with rogue line breaks in text cells – from PowerQuery select the ‘Transform Sample file’ query, and the first step, select the “Settings” icon, and from the requester you have the option (for csv imports) to “ignore quoted line breaks”. This worked for a large-ish csv import I’m doing

  7. Good info! Is there a way to ignore quoted line breaks when building a dataflow in Service?

Leave a Reply