Removing HTML Tags From Text In Power Query/Power BI

Recently I found I needed to remove all the HTML tags from some text in Power Query. I searched and found a great – if complex – function from Marcel Beug here, but I realised that since that post was written the Html.Table M function has been added to the language and that it makes the task very easy. In fact it’s basically the same as the solution I blogged about here for extracting URLs from a web page.

Here’s an example of how to do it:

let
    Source = 
        "<!DOCTYPE html>
        <html>
        <body>
        <h1>A Heading</h1>
        <p>A paragraph</p>
        </body>
        </html>",
    RemoveTags = Html.Table(Source, {{"text",":root"}}),
    GetText = RemoveTags[text]{0}
in
    GetText

Here’s the output of this query:

image

The important part is the call to Html.Table using the “:root” CSS selector:

Html.Table(Source, {{"text",":root"}})

This returns a table with one column and one row, containing the text; the GetText step in the query above just gets the contents of the only cell in that table.

11 responses

  1. Chris-
    Thank you for this post. Does this or anything else like it address ascii characters and symbols effectively? I’ve inherited text with both tags and lots of commas, quotes, colons, etc. that I’m trying to replace as well. I’m currently using a solution with List.Generate and iterating over lists with codes and symbols to replace, but I’m finding it’s not scaling well.

    Thank you!

    https://www.ascii.cl/htmlcodes.htm

  2. Pingback: Removing HTML Tags From Text In Power Query/Power BI | Pardaan.com

  3. I’ve been using this for a while and it’s been working great in Power BI Desktop, but we just deployed our first report with this trick into Power BI Server and we ran into a really strange situation – it seems to return blank after a scheduled data refresh. It doesn’t fail or cause the data refresh to fail, but instead the refresh succeeds and then all results are blank.

    Any idea what’s going on here? I was able to create a test to isolate this function and it causes the issue – just deploy it to either the May or September 2019 release of Power BI Server, and verify that the data works – schedule a data refresh, and boom, the results of the function for all rows are blank. The attached example uses an Excel file, though it happened to us with SQL Server data, so the source doesn’t seem to matter.

    Thoughts? I don’t have access to Power BI Web, but given the lack of questions here, I’m guessing that works perfectly 🙂

    http://sqlryan.com/Downloads/HtmlRemovalTest.zip

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: