Working With CSV Files That Contain Rogue Line Breaks In Power Query And Power BI

CSV files often contain text fields, and text fields often contain line breaks that shouldn’t be there (especially if the text is entered by end users). For example, here is are the contents of a simple CSV file that should contain three fields(Product, Comment and Sales) and six rows:

Apples,This is a commment,10
Oranges,Another comment,20
Pears,A comment with a
line break in,30
Grapes,A normal comment,40
Bananas,A comment
two line breaks,50
Pineapples,A normal comment,60

However, as you can see, the Comment field contains line breaks in some cases. If you try to load this file into Power Query/Power BI, you’ll see that it doesn’t handle things well:


The easiest way of solving this problem is to make sure you have double quotes around the text in the Comment field, so that Power Query/Power BI knows to ignore the line breaks:

Apples,"This is a commment",10
Oranges,"Another comment",20
Pears,"A comment with a
line break in",30
Grapes,"A normal comment",40
Bananas,"A comment
two line breaks",50
Pineapples,"A normal comment",60


[By the way, no text in the screenshot above has been truncated. The line breaks are still there in the Comment field; you can see the full text in the bottom pane of the Query Editor by clicking on a cell in the table.]

But what if you can’t fix the source data? I was asked this question the other week, and since I had been asked about it before and not come up with a good answer, I decided to spend some time researching the problem.

What I found was that it was relatively easy to write some M code that gave me the correct results, but very hard to write code that performed acceptably well on a large data set (I was testing on a CSV file containing almost half a million rows). Here’s the code for the function I ended up with:

(FileName as text, DelimiterCharacter as text, DelimitersPerLine as number, UseFirstRowAsHeaders as logical) =>
    // Load file and buffer in memory
    BufferedBinary = Binary.Buffer(File.Contents(FileName)),
    // Convert to a single block of text
    Source = Text.FromBinary(BufferedBinary),
    // Find the position of all line breaks
    LineBreaks = List.Buffer(Text.PositionOf(Source, "#(cr,lf)", Occurrence.All)),
    // Get a list of numbers from 0 to the number of line breaks-1
    LinePositions = List.Positions(LineBreaks),
    // Count the number of line breaks
    TotalLines = List.Count(LineBreaks),
    // Split the text by line breaks
    FirstSplit = Text.Split(Source, "#(cr,lf)"),
    // On each line, count the number of column delimiter characters
    DelimiterCounts = List.Buffer(
                            each List.Count(Text.PositionOf(FirstSplit{_}, 
    // Loop through each line and
    // a) Calculate a running total of the number of column delimiter characters on each line
    // b) Divide the running total by the expected number of delimiters per line, and round down to the nearest integer
    RunningSums = List.Buffer(
                            ()=>[Counter=0, RunningTotal=0], 
                            each [Counter]<TotalLines, 
                            each [Counter=[Counter]+1, RunningTotal=[RunningTotal]+DelimiterCounts{[Counter]}],
                            each Number.IntegerDivide([RunningTotal]+DelimiterCounts{[Counter]},DelimitersPerLine))),
    // If the result of the division calculated in the previous step is
    // the same as on the previous line, then the line break can be ignored
    // so return null, otherwise return the position of the line break
    FindFullLines = List.Transform(LinePositions, each if _=0 or RunningSums{_}<>RunningSums{_-1} then LineBreaks{_} else null),
    // Remove nulls from the list
    RemoveNulls = List.RemoveNulls(FindFullLines),
    // Split the original text by the positions of the line breaks left
    LineSplitterFunction = Splitter.SplitTextByPositions(List.Combine({{0},RemoveNulls})),
    ActualLines = LineSplitterFunction(Source),
    // Convert the result to a table
    ConvertToTable = Table.FromList(ActualLines, Splitter.SplitTextByDelimiter(DelimiterCharacter), null, null, ExtraValues.Ignore),
    // Use the first rows as headers if desired
    PromoteHeaders = if UseFirstRowAsHeaders then Table.PromoteHeaders(ConvertToTable) else ConvertToTable

Assuming that you called this function LoadFile(), here’s an example of how to call it:

//File name
//Column delimiter character - in this case a comma
//The expected number of commas to be found per line
//ie one less than the number of columns
//If true then first row contains header values

The comments in the code should give you an idea of how it all works. A few other things to note:

  • If any of your lines contain extra columns, this approach won’t work
  • I’ve noticed a difference between how the Power BI Desktop and Excel 2016 behave with this code: the current Excel 2016 Query Editor will keep any line breaks after splitting on them, whereas Power BI Desktop does not. It’s easy to fix the Excel output though: just use the Clean transform to remove the line breaks.

I’m not really sure why it was so hard to come up with code that performed well, even after liberal use of List.Buffer(). It seemed like any time I referenced the original text (as returned by the Source step) in the List.Generate() function then performance died – so you’ll notice in the code above that I’m only referencing the list that contains the number of delimiters found per line.

14 responses

  1. Pingback: DAX equivalent of Excel CHAR() function?

  2. Can I perform this after filtering out some other data? My text file has a header and footer that need taken out prior to checking for the rogue line breaks. Can I load the file into power query and then buffer the result of my initial modifications? Then I could jump into this function.

  3. For some more exotic delimiters you might need to specify the codepage in Text.FromBinary. The line “Source = Text.FromBinary(BufferedBinary),” then has to be “Source = Text.FromBinary(BufferedBinary, 1252),”.

  4. Pingback: An In-Depth Look At The Csv.Document M Function – Chris Webb's BI Blog

  5. Thanks so much! I am new to Power BI.
    The file I am working on has 27 Millions rows. The error I get is “Buffered lists can support up to 2147483647 items and streamed lists can support up to 4503599627370496 items”.
    Could you suggest a way to modify the function to use streamed list?

  6. Thanks for this.
    I am using Powe BI so this indeed didn’t work for me as per your comments.
    In my case I edited the `Csv.Document` function and replaced `QuoteStyle=QuoteStyle.None` with `QuoteStyle=QuoteStyle.Csv`. Not sure it’s a solution for everybody but it worked for me

    • This works perfectly.
      Instead of
      = Csv.Document(Parameter2,[Delimiter=”,”, Columns=33, Encoding=1252, QuoteStyle=QuoteStyle.None])

      = Csv.Document(Parameter2,[Delimiter=”,”, Columns=33, Encoding=1252, QuoteStyle=QuoteStyle.Csv])

  7. Terrific! worked “out of the box” for a problem we’ve had for months – now we can fully-automate what had to be manually done every week – thank you!

  8. I may have stumbled across a simpler solution if you are combining and transforming csv files with rogue line breaks in text cells – from PowerQuery select the ‘Transform Sample file’ query, and the first step, select the “Settings” icon, and from the requester you have the option (for csv imports) to “ignore quoted line breaks”. This worked for a large-ish csv import I’m doing

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: