Removing HTML Tags From Text In Power Query/Power BI

Recently I found I needed to remove all the HTML tags from some text in Power Query. I searched and found a great – if complex – function from Marcel Beug here, but I realised that since that post was written the Html.Table M function has been added to the language and that it makes the task very easy. In fact it’s basically the same as the solution I blogged about here for extracting URLs from a web page.

Here’s an example of how to do it:

    Source = 
        "<!DOCTYPE html>
        <h1>A Heading</h1>
        <p>A paragraph</p>
    RemoveTags = Html.Table(Source, {{"text",":root"}}),
    GetText = RemoveTags[text]{0}

Here’s the output of this query:


The important part is the call to Html.Table using the “:root” CSS selector:

Html.Table(Source, {{"text",":root"}})

This returns a table with one column and one row, containing the text; the GetText step in the query above just gets the contents of the only cell in that table.

9 responses

  1. Chris-
    Thank you for this post. Does this or anything else like it address ascii characters and symbols effectively? I’ve inherited text with both tags and lots of commas, quotes, colons, etc. that I’m trying to replace as well. I’m currently using a solution with List.Generate and iterating over lists with codes and symbols to replace, but I’m finding it’s not scaling well.

    Thank you!

  2. Pingback: Removing HTML Tags From Text In Power Query/Power BI |

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: