Using Html.Table() To Extract URLs From A Web Page In Power BI/Power Query M

Last year I blogged about how to use the Text.BetweenDelimiters() function to extract all the links from the href attributes in the source of a web page. The code was reasonably simple but there’s now an even easier way to solve the same problem using the new Html.Table() function. This function doesn’t seem to be documented online yet, but the built-in documentation for the function available in the Query Editor is up-to-date:

image

Miguel Escobar also has a great post showing how to use it and the new Web.BrowserContents function here.

Here’s an example M query that extracts all the links that start with the letters “http” from my company homepage:

let
    Source = 
	 Web.BrowserContents("https://www.crossjoin.co.uk/"),
    Links = 
	 Html.Table(
	  Source, 
	  {{
	   "Link", 
	   "a[href^=""http""]", 
	   each [Attributes][href]}})
in
    Links

image

To explain what’s going on here:

  • Web.BrowserContents returns the text of the html DOM for the web page
  • In the second step Html.Table takes that text and searches for all <a> elements whose href attribute starts with the letters “http”. I found this CSS selector here.

17 responses

  1. Chris,

    This apparently doesn’t work the current version of Excel 365. Data, Get & Transform Data doesn’t recognize either of these functions:
    …Web.BrowserContents
    …Html.Table

    Charley

  2. Pingback: Dew Drop - August 31, 2018 (#2794) - Morning Dew

  3. Pingback: Web Scraping with Html.Table in Power Query – BI Polar

  4. Hi Chris. Great Tutorial!
    I have a issue with the refresh when I use Power BI online with the Web.BrowserContents. I see this error message:
    Query contains unknown funtion name: Web.BrowserContents.
    Query contains unknown funtion name: Html.Table.
    Do you have any idea to fix it?
    Thanks!

      • I’m using the lastest version of Power BI desktop. With the desktop i can refresh te webcontent, however when I publish to Power BI online I can’t refresh. The error message show that the query contains a unknown funcion related to HTML and Web Content. I’ll keep trying. Thanks!

  5. Pingback: Power BI/Power Query – Web Scraping (a full page with images) – Excel and BI

  6. Pingback: How to get the most out of Web.BrowserContents() - Foster BI

  7. Very interesting tutorial Chris, I tried to apply this but I’m still struggling to get data from Yahoo Finance – Income Statement Table. I need the Income Statement (Quarterly data) for Apple from this link: https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&amp;.tsrc=fin-srch

    Since the page address does not change, I have not managed to extract Quarterly data instead of Annual data which appears in the initial page.

    Do you you have a suggestion on how to manipulate the table selector (Annual /Quarterly)?

    Regards,
    Mark

  8. Pingback: Removing HTML Tags From Text In Power Query/Power BI « Chris Webb's BI Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: