Removing HTML Tags From Text In Power Query/Power BI

Recently I found I needed to remove all the HTML tags from some text in Power Query. I searched and found a great – if complex – function from Marcel Beug here, but I realised that since that post was written the Html.Table M function has been added to the language and that it makes the task very easy. In fact it’s basically the same as the solution I blogged about here for extracting URLs from a web page.

Here’s an example of how to do it:

[sourcecode language=”text” padlinenumbers=”true”]
let
Source =
"<!DOCTYPE html>
<html>
<body>
<h1>A Heading</h1>
<p>A paragraph</p>
</body>
</html>",
RemoveTags = Html.Table(Source, {{"text",":root"}}),
GetText = RemoveTags[text]{0}
in
GetText
[/sourcecode]

Here’s the output of this query:

image

The important part is the call to Html.Table using the “:root” CSS selector:

[sourcecode language=”text”]
Html.Table(Source, {{"text",":root"}})
[/sourcecode]

This returns a table with one column and one row, containing the text; the GetText step in the query above just gets the contents of the only cell in that table.

14 thoughts on “Removing HTML Tags From Text In Power Query/Power BI

  1. Chris-
    Thank you for this post. Does this or anything else like it address ascii characters and symbols effectively? I’ve inherited text with both tags and lots of commas, quotes, colons, etc. that I’m trying to replace as well. I’m currently using a solution with List.Generate and iterating over lists with codes and symbols to replace, but I’m finding it’s not scaling well.

    Thank you!

    https://www.ascii.cl/htmlcodes.htm

      1. Thanks for the quick reply. Text.Select looks like a very handy function!

        From your article, I thought that Html.Table was scoped to tags only. Once I was able to test, it turns out symbols shown as “ ” for non-breaking space, “"” for double quotes, etc. are also converted by Html.Table. That’s what I was chasing. Thank you, Chris and good luck at Microsoft.

      2. The site converted my codes (rookie move in my part) so let’s try this to clarify.

        Was referring to things like “&_nbsp_;” and “&_#_34;” for non-breaking spaces and double quotes, respectively. Hopefully that works.

      3. Chris,
        The function you recommended, Html.Table, seems to have taken care of it. The way I originally interpreted your article was that it only applied to the tags. I think it’s settled.

  2. I’ve been using this for a while and it’s been working great in Power BI Desktop, but we just deployed our first report with this trick into Power BI Server and we ran into a really strange situation – it seems to return blank after a scheduled data refresh. It doesn’t fail or cause the data refresh to fail, but instead the refresh succeeds and then all results are blank.

    Any idea what’s going on here? I was able to create a test to isolate this function and it causes the issue – just deploy it to either the May or September 2019 release of Power BI Server, and verify that the data works – schedule a data refresh, and boom, the results of the function for all rows are blank. The attached example uses an Excel file, though it happened to us with SQL Server data, so the source doesn’t seem to matter.

    Thoughts? I don’t have access to Power BI Web, but given the lack of questions here, I’m guessing that works perfectly 🙂

    http://sqlryan.com/Downloads/HtmlRemovalTest.zip

  3. Thank you very much! this is very helpful.
    I am running into one problem, using your formula – Power BI is showing the value as a link named “Record” – I can click on that link to see the text, but is there a way to just return the text? instead of a link for Record or Table?

  4. If people are interested on how to apply this technique for a table full of html text (or text that contains html tags), you can use the following method (my apologies for the bad identation):

    Table.TransformColumns(Source,
    {
    {“ColumnName”, each Table.FirstValue(HTML.Table(_, {{“text”, “:root”}})}
    }

    )

    This applies the HTML.Table function to each record in the table.

Leave a Reply to Darin NicholsCancel reply