Using Text.BetweenDelimiters() To Extract URLs From A Web Page In Power BI/Power Query M

Chris Webb M, Power BI, Power BI Desktop, Power Query April 25, 2017January 18, 2020 2 Minutes

The Add Column By Example functionality that appeared in the April 2017 release of Power BI Desktop is, I have to say, very cool: I’ve used it quite a bit in the last few weeks and it really does work well. One thing I noticed while using it is that three new functions have been added to the M language to support this feature:

Text.BetweenDelimiters() which takes a text value and extracts the part that appears between two given delimiter values
Text.BeforeDelimiter() which takes a text value and extracts the part that appears before a given delimiter value
Text.AfterDelimiter() which takes a text value and extracts the part that appears after a given delimiter value

The functions themselves are quite straightforward and the online documentation has some good examples of how to use them. To save you the click here’s an extra example – the expression:

[sourcecode language=”text” padlinenumbers=”true”]
Text.BetweenDelimiters("Hello *world!??", "*", "!")
[/sourcecode]

…returns the text “world”:

As it happens last week I received an email from a reader who wanted to know if it was possible to extract all the links from the href attributes in the source of a web page using M, and I realised that Text.BetweenDelimiters() would be very useful for doing this. I wrote the following M function to demonstrate:

[sourcecode language=”text”]
(SourceURL as text, AttributeDelimiter as text) =>
let
//Get HTML source
Source = Text.FromBinary(Web.Contents(SourceURL)),
//Function to find each link
GetLink = (Counter as number) =>
let
CurrentLink =
Text.BetweenDelimiters(
Source,
"href=" & AttributeDelimiter,
AttributeDelimiter,
Counter
)
in
if CurrentLink=""
then
{}
else
List.Combine({
{CurrentLink},
@GetLink(Counter+1)}
),
//Call function
Output = GetLink(0)
in
Output
[/sourcecode]

A few things to note:

I’m using a combination of Text.FromBinary() and Web.Contents() to get the HTML source for the web page whose links we’re extracting
Since HTML allows the use of single and double quotes for attributes, I’ve added a parameter to my function called AttributeDelimiter to allow either to be passed in
Text.BetweenDelimiters only extracts one piece of text at a time, but you can specify which occurrence of the start delimiter it uses. I therefore used recursion to extract the contents of every href attribute in the HTML: I declare a function called GetLink, and from within that function I can make a recursive call by putting an @ before the function name as in line 22 above. It would probably be better to use List.Generate() instead of recursion here though.

Assuming the query that returns this function is called GetAllLinks

…then it can be called in a new query like so:

[sourcecode language=”text”]
GetAllLinks(
"https://msdn.microsoft.com/en-us/library/mt798303.aspx",
""""
)
[/sourcecode]

One other thing to point out is how, in order to pass a double quote character to the function as text, since text has itself to be enclosed in double quotes I need to use four double quotes: “”””

The output of this query is a list containing all of the links from the href attributes on the page that are enclosed in double quotes:

I guess this could be taken even further to create a function that crawls a series of web pages and returns the links in all of them, then uses the Force Directed Graph custom visual or better still NodeXL in Excel to show which pages link to each other. I’ll leave that to someone else to do though…!

You can download a pbix file with all of the examples in this post here.

Published by Chris Webb

My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel. View all posts by Chris Webb

Published April 25, 2017January 18, 2020

13 thoughts on “Using Text.BetweenDelimiters() To Extract URLs From A Web Page In Power BI/Power Query M”

Pingback: daily 04/25/2017 | Cshonea's Blog
Davi Martins says:

January 6, 2018 at 10:40 pm

Congratulations as always.

How do I extract an email from a phrase using Power Query? Because all email has the “@” delimiter.

Example: “Hello, my name is David and this is my email: davi@hotmail.com to send the blog files.”

Loading...

Reply
1. Chris Webb says:
  
  January 8, 2018 at 3:44 pm
  
  Extracting email addresses is going to be a hard problem to solve. If you want to just search for any complete word that contains an @ symbol, you could write a query that does something similar to the one in this blog post https://blog.crossjoin.co.uk/2017/12/14/removing-punctuation-from-text-with-the-text-select-m-function-in-power-bi-power-query-excel-gettransform/ that gets all the words from the Complete Works Of Shakespeare and instead of counting each word, filter them so you only get the words that contain an @. However if you read articles like this one https://www.regular-expressions.info/email.html on using regular expressions to extract email addresses you’ll see it’s almost impossible to work out what the rules for identifying email addresses should be, and even implementing the rules described in this article in M will be pretty complex.
  
  Loading...
  
  Reply
Mike says:

January 26, 2018 at 10:06 pm

Hello Chris,

Do mind explaining how exactly your recursion logic works here. I know you you mentioned that PQ team suggest using List.Generate() instead but it will be still very interesting to know how you got it to work with recursion and why is GetLink(0)

Thanks

Loading...

Reply
1. Chris Webb says:
  
  January 27, 2018 at 5:56 pm
  
  GetLink(0) is the first call to the GetLink() function, defined in the previous step. The first call GetLink(0) finds the first link in the text; inside the function, when the function calls itself, you’ll see that the call @GetLink(Counter+1) is to find the link that is after the link for the current call.
  
  Loading...
  
  Reply
Pingback: Using Html.Table() To Extract URLs From A Web Page In Power BI/Power Query M « Chris Webb's BI Blog
Chris Mc says:

April 18, 2020 at 11:20 am

Is it possible to iterate through many pages on a website using this method?
eg:
= GetAllLinks(
{“https://www.formula1.com/en/results.html/1992/races.html”,”https://www.formula1.com/en/results.html/1993/races.html”,”https://www.formula1.com/en/results.html/1994/races.html”} ,
“”””
)

or somthing like that?

Loading...

Reply
1. Chris Webb says:
  
  April 20, 2020 at 4:12 pm
  
  You need to create a table with one row for each url, then call the function once for each row. There are lots of blog posts out there showing how to do this, for example https://adatis.co.uk/loop-through-multiple-web-pages-using-power-query/
  
  Loading...
  
  Reply
Pingback: Chris Webb's BI Blog: Using Html.Table() To Extract URLs From A Web Page In Power BI/Power Query M Chris Webb's BI Blog
Guillermo Leon Rodriguez says:

April 27, 2022 at 3:50 pm

Hello Chris, I used your function in a personal project and it worked great, now I’m making an entry in my blog talking about it. My question is, can I quote your function?, giving you the credits and linking it to this page?.

Loading...

Reply
1. Chris Webb says:
  
  April 27, 2022 at 5:32 pm
  
  Yes, go ahead!
  
  Loading...
  
  Reply
Pingback: Power Query: una herramienta potente - Savingl
Wally says:

February 13, 2023 at 6:38 am

Chris, although I find your experience invaluable in helping me understand the workings of Power Query and M, I don’t share your confidence in the use of List.BetweenDelimiters.
In principle, most people would not need to deal with nested delimiters, and I found that List.BetweenDelimiters does not work as desirable.

API: Text.BetweenDelimiters(text as nullable text, startDelimiter as text, endDelimiter as text, optional startIndex as any, optional endIndex as any) as any

Basic usage:
Text.BetweenDelimiters(Text, startDelimiter, endDelimiter)

This captures the text between the first startDelimiter and the immediate following endDelimiter. This would be equivalent to:

Text.BetweenDelimiters(Text, startDelimiter, endDelimiter, 0, 0)
OR
Text.BetweenDelimiters(Text, startDelimiter, endDelimiter, {0, RelativePosition.FromStart}, {0, RelativePosition.FromStart})

An example would transform for:
TestInput = “AAAA (BBBB) (CCCC) (DDDD) (EEEE) (FFFF) GGGG”,
TestOutput = “BBBB”,

This works as expected for all 3 variations of API.

In essence, I would consider Text.BetweenDelimiters to be similar to the application of Text.AfterDelimiter and then Text.BeforeDelimiter (or vice-versa), i.e.:
Text.BeforeDelimiter(Text.AfterDelimiters(Text, startDelimiter, {0, RelativePosition.FromStart}), endDelimiter, {0, RelativePosition.FromStart})

And this works as desired.

However, if I wanted used RelativePosition.FromEnd for the endDelimiter position:
Text.BetweenDelimiters(Text, startDelimiter, endDelimiter, {0, RelativePosition.FromStart}, {0, RelativePosition.FromEnd})

I would naturaly expect the output:
BBBB) (CCCC) (DDDD) (EEEE) (FFFF
Howerver, I get:
AAAA

But using the extended variation I suggested:
Text.BeforeDelimiter(Text.AfterDelimiters(Text, startDelimiter, {0, RelativePosition.FromStart}), endDelimiter, {0, RelativePosition.FromEnd})
We actually get the desired value:
BBBB) (CCCC) (DDDD) (EEEE) (FFFF

The best use for demonstrating why anybody would use RelativePosition.FromEnd is for nested usage of delimiters, e.g:
TestInput = “AAAA (BBBB (CCCC (DDDD) EEEE) FFFF) GGGG”,
TestOutput = “BBBB (CCCC (DDDD) EEEE) FFFF”,

Using Text.BetweenDelimiters with RelativePosition.FromEnd for endDelimiter, we get, again:
AAAA
But using the expanded variation with Before and After Delimiter methods we get the desired out.

Hence, the usage of the RelativePosition has not been comprehensive.

Below are the queries to test.

// Test_Text_BetweenDelimiters_Basic
let
TestInput = “AAAA (BBBB) (CCCC) (DDDD) (EEEE) (FFFF) GGGG”,
TestOutput1 = “BBBB (CCCC) (DDDD) (EEEE) FFFF”,
TestOutput2 = “BBBB”,
Result = Test_Text_BetweenDelimiters_Fn(TestInput, TestOutput1, TestOutput2)
in
Result

// Test_Text_BetweenDelimiters_Nested
let
TestInput = “AAAA (BBBB (CCCC (DDDD) EEEE) FFFF) GGGG”,
TestOutput1 = “BBBB (CCCC (DDDD) EEEE) FFFF”,
TestOutput2 = “BBBB (CCCC (DDDD”,
Result = Test_Text_BetweenDelimiters_Fn(TestInput, TestOutput1, TestOutput2)
in
Result

// Test_Text_BetweenDelimiters_Fn
(TestInput as text, TestOutput1 as text, TestOutput2 as text) as text =>
let
//TestInput = “AAAA (BBBB (CCCC (DDDD) EEEE) FFFF) GGGG”,
//TestOutput1 = “BBBB (CCCC (DDDD) EEEE) FFFF”,
//TestOutput2 = “BBBB (CCCC (DDDD”,

//TestInput = “AAAA (BBBB) (CCCC) (DDDD) (EEEE) (FFFF) GGGG”,
//TestOutput1 = “BBBB) (CCCC) (DDDD) (EEEE) (FFFF”,
//TestOutput2 = “BBBB”,

#”Extract Prefix & Postfix – Workaround” = Text.BeforeDelimiter(Text.AfterDelimiter(TestInput, “(“, {0, RelativePosition.FromStart}), “)”, {0, RelativePosition.FromEnd}),
TestResultValid1 = #”Extract Prefix & Postfix – Workaround” = TestOutput1,
#”Extract Prefix & Postfix – Preferred” = Text.BetweenDelimiters(TestInput, “(“, “)”, {0, RelativePosition.FromStart}, {0, RelativePosition.FromEnd}),
TestResultValid2 = #”Extract Prefix & Postfix – Preferred” = TestOutput1,

#”Extract Prefix & Postfix – XXXX” = Text.BetweenDelimiters(TestInput, “(“, “)”, 0, 0),
TestResultValid3 = #”Extract Prefix & Postfix – XXXX” = TestOutput2,
#”Extract Prefix & Postfix – YYYY” = Text.BetweenDelimiters(TestInput, “(“, “)”, {0, RelativePosition.FromStart}, {0, RelativePosition.FromStart}),
TestResultValid4 = #”Extract Prefix & Postfix – YYYY” = TestOutput2,

Output = if TestResultValid2 then “Preferred Solution” else if TestResultValid1 then “Workaround Solution” else “Broken”
in
Output

Loading...

Reply