The Add Column By Example functionality that appeared in the April 2017 release of Power BI Desktop is, I have to say, very cool: I’ve used it quite a bit in the last few weeks and it really does work well. One thing I noticed while using it is that three new functions have been added to the M language to support this feature:
- Text.BetweenDelimiters() which takes a text value and extracts the part that appears between two given delimiter values
- Text.BeforeDelimiter() which takes a text value and extracts the part that appears before a given delimiter value
- Text.AfterDelimiter() which takes a text value and extracts the part that appears after a given delimiter value
The functions themselves are quite straightforward and the online documentation has some good examples of how to use them. To save you the click here’s an extra example – the expression:
Text.BetweenDelimiters("Hello *world!??", "*", "!")
…returns the text “world”:
As it happens last week I received an email from a reader who wanted to know if it was possible to extract all the links from the href attributes in the source of a web page using M, and I realised that Text.BetweenDelimiters() would be very useful for doing this. I wrote the following M function to demonstrate:
(SourceURL as text, AttributeDelimiter as text) => let //Get HTML source Source = Text.FromBinary(Web.Contents(SourceURL)), //Function to find each link GetLink = (Counter as number) => let CurrentLink = Text.BetweenDelimiters( Source, "href=" & AttributeDelimiter, AttributeDelimiter, Counter ) in if CurrentLink="" then {} else List.Combine({ {CurrentLink}, @GetLink(Counter+1)} ), //Call function Output = GetLink(0) in Output
A few things to note:
- I’m using a combination of Text.FromBinary() and Web.Contents() to get the HTML source for the web page whose links we’re extracting
- Since HTML allows the use of single and double quotes for attributes, I’ve added a parameter to my function called AttributeDelimiter to allow either to be passed in
- Text.BetweenDelimiters only extracts one piece of text at a time, but you can specify which occurrence of the start delimiter it uses. I therefore used recursion to extract the contents of every href attribute in the HTML: I declare a function called GetLink, and from within that function I can make a recursive call by putting an @ before the function name as in line 22 above. It would probably be better to use List.Generate() instead of recursion here though.
Assuming the query that returns this function is called GetAllLinks
…then it can be called in a new query like so:
GetAllLinks( "https://msdn.microsoft.com/en-us/library/mt798303.aspx", """" )
One other thing to point out is how, in order to pass a double quote character to the function as text, since text has itself to be enclosed in double quotes I need to use four double quotes: “”””
The output of this query is a list containing all of the links from the href attributes on the page that are enclosed in double quotes:
I guess this could be taken even further to create a function that crawls a series of web pages and returns the links in all of them, then uses the Force Directed Graph custom visual or better still NodeXL in Excel to show which pages link to each other. I’ll leave that to someone else to do though…!
You can download a pbix file with all of the examples in this post here.
Pingback: daily 04/25/2017 | Cshonea's Blog
Congratulations as always.
How do I extract an email from a phrase using Power Query? Because all email has the “@” delimiter.
Example: “Hello, my name is David and this is my email: davi@hotmail.com to send the blog files.”
Extracting email addresses is going to be a hard problem to solve. If you want to just search for any complete word that contains an @ symbol, you could write a query that does something similar to the one in this blog post https://blog.crossjoin.co.uk/2017/12/14/removing-punctuation-from-text-with-the-text-select-m-function-in-power-bi-power-query-excel-gettransform/ that gets all the words from the Complete Works Of Shakespeare and instead of counting each word, filter them so you only get the words that contain an @. However if you read articles like this one https://www.regular-expressions.info/email.html on using regular expressions to extract email addresses you’ll see it’s almost impossible to work out what the rules for identifying email addresses should be, and even implementing the rules described in this article in M will be pretty complex.
Hello Chris,
Do mind explaining how exactly your recursion logic works here. I know you you mentioned that PQ team suggest using List.Generate() instead but it will be still very interesting to know how you got it to work with recursion and why is GetLink(0)
Thanks
GetLink(0) is the first call to the GetLink() function, defined in the previous step. The first call GetLink(0) finds the first link in the text; inside the function, when the function calls itself, you’ll see that the call @GetLink(Counter+1) is to find the link that is after the link for the current call.
Pingback: Using Html.Table() To Extract URLs From A Web Page In Power BI/Power Query M « Chris Webb's BI Blog
Is it possible to iterate through many pages on a website using this method?
eg:
= GetAllLinks(
{“https://www.formula1.com/en/results.html/1992/races.html”,”https://www.formula1.com/en/results.html/1993/races.html”,”https://www.formula1.com/en/results.html/1994/races.html”} ,
“”””
)
or somthing like that?
You need to create a table with one row for each url, then call the function once for each row. There are lots of blog posts out there showing how to do this, for example https://adatis.co.uk/loop-through-multiple-web-pages-using-power-query/
Pingback: Chris Webb's BI Blog: Using Html.Table() To Extract URLs From A Web Page In Power BI/Power Query M Chris Webb's BI Blog