M · Power BI · Power Query

Using Html.Table() To Extract URLs From A Web Page In Power BI/Power Query M

Last year I blogged about how to use the Text.BetweenDelimiters() function to extract all the links from the href attributes in the source of a web page. The code was reasonably simple but there’s now an even easier way to solve the same problem using the new Html.Table() function. This function doesn’t seem to be documented online yet, but the built-in documentation for the function available in the Query Editor is up-to-date:

image

Miguel Escobar also has a great post showing how to use it and the new Web.BrowserContents function here.

Here’s an example M query that extracts all the links that start with the letters “http” from my company homepage:

[sourcecode language=’text’ padlinenumbers=’true’]
let
Source =
Web.BrowserContents(“https://www.crossjoin.co.uk/”),
Links =
Html.Table(
Source,
{{
“Link”,
“a[href^=””http””]”,
each [Attributes][href]}})
in
Links
[/sourcecode]

To explain what’s going on here:

  • Web.BrowserContents returns the text of the html DOM for the web page
  • In the second step Html.Table takes that text and searches for all <a> elements whose href attribute starts with the letters “http”. I found this CSS selector here.

29 thoughts on “Using Html.Table() To Extract URLs From A Web Page In Power BI/Power Query M

  1. Chris,

    This apparently doesn’t work the current version of Excel 365. Data, Get & Transform Data doesn’t recognize either of these functions:
    …Web.BrowserContents
    …Html.Table

    Charley

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      Those functions obviously haven’t made their way to Excel yet; I guess they’ll be added in the next few months.

  2. Hi Chris. Great Tutorial!
    I have a issue with the refresh when I use Power BI online with the Web.BrowserContents. I see this error message:
    Query contains unknown funtion name: Web.BrowserContents.
    Query contains unknown funtion name: Html.Table.
    Do you have any idea to fix it?
    Thanks!

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      Are you trying to use this in Excel or an old version of Power BI Desktop? Not all versions of the Power Query engine include this function.

      1. I’m using the lastest version of Power BI desktop. With the desktop i can refresh te webcontent, however when I publish to Power BI online I can’t refresh. The error message show that the query contains a unknown funcion related to HTML and Web Content. I’ll keep trying. Thanks!

  3. Very interesting tutorial Chris, I tried to apply this but I’m still struggling to get data from Yahoo Finance – Income Statement Table. I need the Income Statement (Quarterly data) for Apple from this link: https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&amp;.tsrc=fin-srch

    Since the page address does not change, I have not managed to extract Quarterly data instead of Annual data which appears in the initial page.

    Do you you have a suggestion on how to manipulate the table selector (Annual /Quarterly)?

    Regards,
    Mark

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      No, sorry, I’m not good enough with CSS selectors to help here

  4. Hi Chris, if I’m trying to get a connection to a report that is intranet based (i.e. internal organizational http website) I’m having issues to connect. Do you have any suggestion on how to get that type of connection?

  5. Hi Chris
    Is there any function in Power Query that can do the task .

    .

    i have used this
    ————————————————————————–
    (URL) as table =>
    let
    Source = Web.Page(Web.Contents(URL)),
    Data0 = Source{0}[Data],
    #”Changed Type” = Table.TransformColumnTypes(Data0,{{“Column1”, Int64.Type}, {“Column2”, type text}, {“Column3″, type text}}),
    #”Filtered Rows” = Table.SelectRows(#”Changed Type”, each ([Column1] null)),
    #”Removed Columns” = Table.RemoveColumns(#”Filtered Rows”,{“Column2″})
    in
    #”Removed Columns”
    —————————————————————————-

    let
    Source = Excel.CurrentWorkbook(){[Name=”LINKS”]}[Content],
    #”Changed Type” = Table.TransformColumnTypes(Source,{{“URL”, type text}}),
    #”Added Custom” = Table.AddColumn(#”Changed Type”, “Fetchdata”, each Companies([URL])),
    #”Expanded Fetchdata” = Table.ExpandTableColumn(#”Added Custom”, “Fetchdata”, {“Column1”, “Column3”}, {“Column1”, “Column3″})
    in
    #”Expanded Fetchdata”

    ———————————————————————————

    to import data from the web page but it imports companies names only not link address
    want to insert a new column and get the link address to get the profile of each company .

    Thank You

  6. Hello Chris!

    My example its in a Space Missions website (https://nextspaceflight.com/).
    In each page https://nextspaceflight.com/launches/past/?page=1 there are 30 launches (max) and each has 1 button called “details” that leads to another page, for example:

    Page 1, the 1st launch (at the moment) Antares 230+ | CRS NG-14 on the details button leads to https://nextspaceflight.com/launches/details/2614 with alot of details about the mission.

    How can I get this link to PowerQuery? It doesn’t work with

    {
    “Link”,
    “a[href^=””http””]”,
    each [Attributes][href]}

    because it is a buttoon? I tried replacing “href” with “onclick” or “location.href” but it is not working.

    What I want to do, is it possible? And to get all the 30 links from the 30 launches that are displayed in each of the pages? I already know how to get the bulk of the 152 pages but I am missing getting the details of the missions :(((((((

    Thank you! 😀

  7. let
    Source = Web.BrowserContents(“https://www.law360.com/compliance/news?page=2″),
    #”Extracted Table From Html” = Html.Table(Source, {{“Column1”, “.updated”}, {“Column2”, “H3 *”}, {“Column3”, “.col-md-8 .entry-content”}, {“Column4”, “:not(*)”}}, [RowSelector=”.col-md-8 .hnews”]),
    #”Changed Type” = Table.TransformColumnTypes(#”Extracted Table From Html”,{{“Column1”, type date}, {“Column2”, type text}, {“Column3”, type text}, {“Column4″, type text}})
    in
    #”Changed Type”

    I am trying to scrape data from the above website but the given instruction by Chris didn’t work for me.
    I don’t know if I’m doing it right but would anyone help me on to how get the related URL link in front of every line of the news title ?
    Also, I may want to get more than one page news, so I would be very grateful as well, if you help me on how to get or iterate to the next page of the web in Power bi.

    Thank you

    Mohamed

  8. I am trying to extract all the Video links (pop-up box, url etc) from my company site efi.com. Due to some reason, I am unable to do so with this query. Could you please advise?
    Thank you!

  9. Can we use web url link without using http or www in Power BI
    because i have one web page and i want to redirect to that page directly according to ID number selected in Power BI Table Visual
    Now that web page link has given without https or www
    How can i redirect to that page

Leave a ReplyCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.