Using Text.BetweenDelimiters() To Extract URLs From A Web Page In Power BI/Power Query M

The Add Column By Example functionality that appeared in the April 2017 release of Power BI Desktop is, I have to say, very cool: I’ve used it quite a bit in the last few weeks and it really does work well. One thing I noticed while using it is that three new functions have been added to the M language to support this feature:

  • Text.BetweenDelimiters() which takes a text value and extracts the part that appears between two given delimiter values
  • Text.BeforeDelimiter() which takes a text value and extracts the part that appears before a given delimiter value
  • Text.AfterDelimiter() which takes a text value and extracts the part that appears after a given delimiter value

The functions themselves are quite straightforward and the online documentation has some good examples of how to use them. To save you the click here’s an extra example – the expression:

[sourcecode language=”text” padlinenumbers=”true”]
Text.BetweenDelimiters("Hello *world!??", "*", "!")
[/sourcecode]

…returns the text “world”:

image

As it happens last week I received an email from a reader who wanted to know if it was possible to extract all the links from the href attributes in the source of a web page using M, and I realised that Text.BetweenDelimiters() would be very useful for doing this. I wrote the following M function to demonstrate:

[sourcecode language=”text”]
(SourceURL as text, AttributeDelimiter as text) =>
let
//Get HTML source
Source = Text.FromBinary(Web.Contents(SourceURL)),
//Function to find each link
GetLink = (Counter as number) =>
let
CurrentLink =
Text.BetweenDelimiters(
Source,
"href=" & AttributeDelimiter,
AttributeDelimiter,
Counter
)
in
if CurrentLink=""
then
{}
else
List.Combine({
{CurrentLink},
@GetLink(Counter+1)}
),
//Call function
Output = GetLink(0)
in
Output
[/sourcecode]

A few things to note:

  • I’m using a combination of Text.FromBinary() and Web.Contents() to get the HTML source for the web page whose links we’re extracting
  • Since HTML allows the use of single and double quotes for attributes, I’ve added a parameter to my function called AttributeDelimiter to allow either to be passed in
  • Text.BetweenDelimiters only extracts one piece of text at a time, but you can specify which occurrence of the start delimiter it uses. I therefore used recursion to extract the contents of every href attribute in the HTML: I declare a function called GetLink, and from within that function I can make a recursive call by putting an @ before the function name as in line 22 above. It would probably be better to use List.Generate() instead of recursion here though.

Assuming the query that returns this function is called GetAllLinks

image

…then it can be called in a new query like so:

[sourcecode language=”text”]
GetAllLinks(
"https://msdn.microsoft.com/en-us/library/mt798303.aspx",
""""
)
[/sourcecode]

One other thing to point out is how, in order to pass a double quote character to the function as text, since text has itself to be enclosed in double quotes I need to use four double quotes: “”””

The output of this query is a list containing all of the links from the href attributes on the page that are enclosed in double quotes:

image

I guess this could be taken even further to create a function that crawls a series of web pages and returns the links in all of them, then uses the Force Directed Graph custom visual or better still NodeXL in Excel to show which pages link to each other. I’ll leave that to someone else to do though…!

You can download a pbix file with all of the examples in this post here.

13 thoughts on “Using Text.BetweenDelimiters() To Extract URLs From A Web Page In Power BI/Power Query M

    1. Extracting email addresses is going to be a hard problem to solve. If you want to just search for any complete word that contains an @ symbol, you could write a query that does something similar to the one in this blog post https://blog.crossjoin.co.uk/2017/12/14/removing-punctuation-from-text-with-the-text-select-m-function-in-power-bi-power-query-excel-gettransform/ that gets all the words from the Complete Works Of Shakespeare and instead of counting each word, filter them so you only get the words that contain an @. However if you read articles like this one https://www.regular-expressions.info/email.html on using regular expressions to extract email addresses you’ll see it’s almost impossible to work out what the rules for identifying email addresses should be, and even implementing the rules described in this article in M will be pretty complex.

  1. Hello Chris,

    Do mind explaining how exactly your recursion logic works here. I know you you mentioned that PQ team suggest using List.Generate() instead but it will be still very interesting to know how you got it to work with recursion and why is GetLink(0)

    Thanks

    1. GetLink(0) is the first call to the GetLink() function, defined in the previous step. The first call GetLink(0) finds the first link in the text; inside the function, when the function calls itself, you’ll see that the call @GetLink(Counter+1) is to find the link that is after the link for the current call.

  2. Is it possible to iterate through many pages on a website using this method?
    eg:
    = GetAllLinks(
    {“https://www.formula1.com/en/results.html/1992/races.html”,”https://www.formula1.com/en/results.html/1993/races.html”,”https://www.formula1.com/en/results.html/1994/races.html”} ,
    “”””
    )

    or somthing like that?

  3. Hello Chris, I used your function in a personal project and it worked great, now I’m making an entry in my blog talking about it. My question is, can I quote your function?, giving you the credits and linking it to this page?.

  4. Chris, although I find your experience invaluable in helping me understand the workings of Power Query and M, I don’t share your confidence in the use of List.BetweenDelimiters.
    In principle, most people would not need to deal with nested delimiters, and I found that List.BetweenDelimiters does not work as desirable.

    API: Text.BetweenDelimiters(text as nullable text, startDelimiter as text, endDelimiter as text, optional startIndex as any, optional endIndex as any) as any

    Basic usage:
    Text.BetweenDelimiters(Text, startDelimiter, endDelimiter)

    This captures the text between the first startDelimiter and the immediate following endDelimiter. This would be equivalent to:

    Text.BetweenDelimiters(Text, startDelimiter, endDelimiter, 0, 0)
    OR
    Text.BetweenDelimiters(Text, startDelimiter, endDelimiter, {0, RelativePosition.FromStart}, {0, RelativePosition.FromStart})

    An example would transform for:
    TestInput = “AAAA (BBBB) (CCCC) (DDDD) (EEEE) (FFFF) GGGG”,
    TestOutput = “BBBB”,

    This works as expected for all 3 variations of API.

    In essence, I would consider Text.BetweenDelimiters to be similar to the application of Text.AfterDelimiter and then Text.BeforeDelimiter (or vice-versa), i.e.:
    Text.BeforeDelimiter(Text.AfterDelimiters(Text, startDelimiter, {0, RelativePosition.FromStart}), endDelimiter, {0, RelativePosition.FromStart})

    And this works as desired.

    However, if I wanted used RelativePosition.FromEnd for the endDelimiter position:
    Text.BetweenDelimiters(Text, startDelimiter, endDelimiter, {0, RelativePosition.FromStart}, {0, RelativePosition.FromEnd})

    I would naturaly expect the output:
    BBBB) (CCCC) (DDDD) (EEEE) (FFFF
    Howerver, I get:
    AAAA

    But using the extended variation I suggested:
    Text.BeforeDelimiter(Text.AfterDelimiters(Text, startDelimiter, {0, RelativePosition.FromStart}), endDelimiter, {0, RelativePosition.FromEnd})
    We actually get the desired value:
    BBBB) (CCCC) (DDDD) (EEEE) (FFFF

    The best use for demonstrating why anybody would use RelativePosition.FromEnd is for nested usage of delimiters, e.g:
    TestInput = “AAAA (BBBB (CCCC (DDDD) EEEE) FFFF) GGGG”,
    TestOutput = “BBBB (CCCC (DDDD) EEEE) FFFF”,

    Using Text.BetweenDelimiters with RelativePosition.FromEnd for endDelimiter, we get, again:
    AAAA
    But using the expanded variation with Before and After Delimiter methods we get the desired out.

    Hence, the usage of the RelativePosition has not been comprehensive.

    Below are the queries to test.

    // Test_Text_BetweenDelimiters_Basic
    let
    TestInput = “AAAA (BBBB) (CCCC) (DDDD) (EEEE) (FFFF) GGGG”,
    TestOutput1 = “BBBB (CCCC) (DDDD) (EEEE) FFFF”,
    TestOutput2 = “BBBB”,
    Result = Test_Text_BetweenDelimiters_Fn(TestInput, TestOutput1, TestOutput2)
    in
    Result

    // Test_Text_BetweenDelimiters_Nested
    let
    TestInput = “AAAA (BBBB (CCCC (DDDD) EEEE) FFFF) GGGG”,
    TestOutput1 = “BBBB (CCCC (DDDD) EEEE) FFFF”,
    TestOutput2 = “BBBB (CCCC (DDDD”,
    Result = Test_Text_BetweenDelimiters_Fn(TestInput, TestOutput1, TestOutput2)
    in
    Result

    // Test_Text_BetweenDelimiters_Fn
    (TestInput as text, TestOutput1 as text, TestOutput2 as text) as text =>
    let
    //TestInput = “AAAA (BBBB (CCCC (DDDD) EEEE) FFFF) GGGG”,
    //TestOutput1 = “BBBB (CCCC (DDDD) EEEE) FFFF”,
    //TestOutput2 = “BBBB (CCCC (DDDD”,

    //TestInput = “AAAA (BBBB) (CCCC) (DDDD) (EEEE) (FFFF) GGGG”,
    //TestOutput1 = “BBBB) (CCCC) (DDDD) (EEEE) (FFFF”,
    //TestOutput2 = “BBBB”,

    #”Extract Prefix & Postfix – Workaround” = Text.BeforeDelimiter(Text.AfterDelimiter(TestInput, “(“, {0, RelativePosition.FromStart}), “)”, {0, RelativePosition.FromEnd}),
    TestResultValid1 = #”Extract Prefix & Postfix – Workaround” = TestOutput1,
    #”Extract Prefix & Postfix – Preferred” = Text.BetweenDelimiters(TestInput, “(“, “)”, {0, RelativePosition.FromStart}, {0, RelativePosition.FromEnd}),
    TestResultValid2 = #”Extract Prefix & Postfix – Preferred” = TestOutput1,

    #”Extract Prefix & Postfix – XXXX” = Text.BetweenDelimiters(TestInput, “(“, “)”, 0, 0),
    TestResultValid3 = #”Extract Prefix & Postfix – XXXX” = TestOutput2,
    #”Extract Prefix & Postfix – YYYY” = Text.BetweenDelimiters(TestInput, “(“, “)”, {0, RelativePosition.FromStart}, {0, RelativePosition.FromStart}),
    TestResultValid4 = #”Extract Prefix & Postfix – YYYY” = TestOutput2,

    Output = if TestResultValid2 then “Preferred Solution” else if TestResultValid1 then “Workaround Solution” else “Broken”
    in
    Output

Leave a Reply