Removing Punctuation From Text With The Text.Select M Function In Power BI/Power Query/Excel Get&Transform

A new, as-yet undocumented, M function appeared in the December 2017 release of Power BI Desktop (I assume it will appear in Excel soon): Text.Select. Here’s the documentation from the Query Editor:

 TextSelect

It’s very easy to use: the first parameter takes a text value, the second parameter takes either a text value containing a single text value or a list of single characters, and it returns the text from the first parameter minus all characters that are not in the second parameter. For example, the expression:

Text.Select("Hello", "l")

…returns the text value “ll”:

image

…and the expression:

Text.Select("Hello", {"H","e","o"})

…returns the text value “Heo”:

image

There are a lot of scenarios where Text.Select will be useful, and the one that I immediately thought of was to remove punctuation from text. In one of my earliest M posts on this blog I used Text.Remove to do this while trying to find Shakespeare’s favourite words, but the problem with this approach is that you have to explicitly specify all the characters you want to remove from your text – and there could be a lot of characters that need to be excluded. Text.Select is a much better option here because it allows you to specify the characters you want to keep.

The first step to doing this is to understand how to construct the list of the characters you do want to keep. You can do this very easily in M when declaring a list using the range technique I blogged about here, so you should read that post before carrying on. The following expression returns a list containing all 26 uppercase and lowercase letters in the alphabet plus a space:

List.Combine({{"A".."Z"},{"a".."z"},{" "}})

image

Of course depending on the scenario or language you’re working with you may want to include other characters, for example apostrophes or letters with accents, too. Here’s a slightly more complex example of how this list can be used with Text.Select:

let
    SourceText = "Hi! Stop, please. What is your name?",
    CharsToInclude = List.Combine({{"A".."Z"},{"a".."z"},{" "}}),
    RemovePunc = Text.Select(SourceText, CharsToInclude)
in
    RemovePunc

The query above takes the text “Hi! Stop, please. What is your name?” and returns the text “Hi Stop please What is your name”.

image

Finally, because I couldn’t read my old M code without cringing a little bit, here’s an updated version of my query that gets the top 100 words from the Complete Works Of Shakespeare (direct from the Project Gutenberg website):

let
  URL = "http://www.gutenberg.org/cache/epub/100/pg100.txt",
  Source = Text.FromBinary(Web.Contents(URL)),
  Lowercase = Text.Lower(Source),
  RemovePunctuation = Text.Select(Lowercase, 
	List.Combine({{"a".."z"},{" "}})),
  SplitText = Splitter.SplitTextByWhitespace(QuoteStyle.None),
  SplitIntoWords = SplitText(RemovePunctuation),
  RemoveBlanks = List.Select(SplitIntoWords, each _<>" "),
  TableFromList = Table.FromColumns({RemoveBlanks},
	type table [Word=text]),
  FindWordCounts = Table.Group(
	TableFromList, 
		{"Word"}, 
		{{"Count", each Table.RowCount(_), type number}}),
  SortedRows = Table.Sort(
	FindWordCounts,
	{{"Count", Order.Descending}}),
  KeptFirstRows = Table.FirstN(SortedRows,100)
in 
  KeptFirstRows

Here they are as a word cloud (yes I know it’s not good dataviz practice, but it’s for fun):

image

You can download the .pbix file with this example in here.

BONUS FACT: another new M function appeared recently too: Function.From. You can read all about it on this thread on the Power Query forum.

4 responses

  1. Pingback: Selecting Specific Characters In M – Curated SQL

  2. Thanks. It sounds very useful, but how do you use Text.Select on text in each row of a table? I get the error “Expression.Error: We cannot convert a value of type Table to type Text.”

  3. Pingback: Removing Punctuation From Text With The Text.Select M Function In Power BI/Power Query/Excel Get&Transform - SSWUG.ORG

  4. Pingback: Extraction of number (or text) from a column with both text and number – #PowerQuery #PowerBI | Erik Svensen

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: