Finding Shakespeare’s Favourite Words With Data Explorer (Power Query)

Chris Webb Data Explorer, Power Query March 15, 2013February 24, 2014 3 Minutes

NOTE: This post was written before Data Explorer was renamed as Power Query. All of the content is still relevant to Power Query.

The more I play with Data Explorer, the more I think my initial assessment of it as a self-service ETL tool was wrong. As Jamie pointed out recently, it’s really the M language with a GUI on top of it and the GUI itself, while good, doesn’t begin to expose the power of the underlying language: I’d urge you to take a look at the Formula Language Specification and Library Specification documents which can be downloaded from here to see for yourself. So while it can certainly be used for self-service ETL it can do much, much more than that…

In this post I’ll show you an example of what Data Explorer can do once you go beyond the UI. Starting off with a text file containing the complete works of William Shakespeare (which can be downloaded from here – it’s strange to think that it’s just a 5.3 MB text file) I’m going to find the top 100 most frequently used words and display them in a table in Excel.

Before I do that, though, some things to point out. First, there’s a new update of Data Explorer that appeared a few days ago – the Data Explorer team blog has the details. One of the new bits of functionality is a button that allows you to edit all of the expressions in your query at once:

Second, when you’re building a query, when you want to add a new step manually that refers to the previous step but doesn’t apply any calculations or transforms, you need to click on the fx button next to the expression on an existing step:

I’ve used this quite extensively to write custom steps that aren’t possible with the UI.

Here is the full code for my query to find the top 100 words:

Source = Lines.FromBinary(File.Contents("C:\Users\Chris\Documents\CompleteShakespeare.txt")),

RemoveLineBreaks = Lines.ToText(Source, " "),

RemovePunctuation =

Text.Remove(RemoveLineBreaks,{"," ,"." ,"?" ,";" ,":" ,";" ,"'"

                            ,"@" ,"#" ,"~" ,"{" ,"[" ,"}" ,"]" ,"(" ,")", "*"}),

Lowercase = Text.Lower(RemovePunctuation),

TurnIntoTable = Table.FromValue(Lowercase),

SplitIntoWords = Table.TransformColumns(TurnIntoTable, {"Value", Splitter.SplitTextByWhitespace()}),

ListOfWords = SplitIntoWords{0}[Value],

TableFromList = Table.FromList(ListOfWords, Splitter.SplitByNothing(), null, null, ExtraValues.Error),

RenameColumnToWord = Table.RenameColumns(TableFromList,{{"Column1", "Word"}}),

RemoveBlanks = Table.SelectRows(RenameColumnToWord, each ([Word] <;> "")),

FindWordCounts = Table.Group(RemoveBlanks, {"Word"}, {{"Count", each Table.RowCount(_), type number}}),

SortedRows = Table.Sort(FindWordCounts,{{"Count", Order.Descending}}),

KeptFirstRows = Table.FirstN(SortedRows,100)

Broken down step-by-step:

Source = Lines.FromBinary(File.Contents(“C:\Users\Chris\Documents\CompleteShakespeare.txt”))

This loads the contents of CompleteShakespeare.txt into Data Explorer and breaks it up into lines of text:

RemoveLineBreaks = Lines.ToText(Source, ” “)

This removes all the line breaks and creates a single chunk of text:

RemovePunctuation = Text.Remove(RemoveLineBreaks,{“,” ,”.” ,”?” ,”;” ,”:” ,”;” ,”‘” ,”@” ,”#” ,”~” ,”{” ,”[” ,”}” ,”]” ,”(” ,”)”, “*”})

Removes any punctuation, and

Lowercase = Text.Lower(RemovePunctuation)

Turns all the words to lowercase:

TurnIntoTable = Table.FromValue(Lowercase)

Takes the text and creates a table with one column and one row containing the text:

SplitIntoWords = Table.TransformColumns(TurnIntoTable, {“Value”, Splitter.SplitTextByWhitespace()})

Takes the text and creates a List object with one entry for every word in the text.

ListOfWords = SplitIntoWords{0}[Value]

TableFromList = Table.FromList(ListOfWords, Splitter.SplitByNothing(), null, null, ExtraValues.Error)

RenameColumnToWord = Table.RenameColumns(TableFromList,{{“Column1”, “Word”}})

Takes the list and turns it into a table with one column and one row for each word:

RemoveBlanks = Table.SelectRows(RenameColumnToWord, each ([Word] <> “”))

Removes any rows containing blanks.

FindWordCounts = Table.Group(RemoveBlanks, {“Word”}, {{“Count”, each Table.RowCount(_), type number}})

Does a ‘group by’ to return a table with one row for each distinct word, and a column containing the number of occurrences of that word:

SortedRows = Table.Sort(FindWordCounts,{{“Count”, Order.Descending}})

Sorts this table in descending order by word count.

KeptFirstRows = Table.FirstN(SortedRows,100)

Only returns the first 100 rows from this table.

And at this point, we have the top 100 words used by Shakespeare (admittedly including the text at the beginning and end of the file inserted by Project Gutenberg that I couldn’t be bothered to remove):

No surprises here really – ‘the’ comes in at #1 with 18444 occurrences, “lord” comes in at #45, “king” at #69 and so on.

While you probably aren’t interested in Shakespeare for your own business purposes, it shows what would be possible if you wanted to analyse what was being said in Facebook updates, Tweets, product reviews and so on. It wasn’t all that hard to write either, especially considering that I’m fairly new to Data Explorer, so more sophisticated variations on this would certainly be possible without much extra effort. And it’s worth pointing out that it’s fast too: it takes a couple of seconds to refresh, and I’m not even sure I’ve done everything in the most efficient way possible.

You can download the sample workbook (but not the Complete Works of Shakespeare) here. If you want to point it at a different text file, just modify the first line of the code.

UPDATE: it turns out there is either a bug in PQ or I’m hitting one of its limits, and the counts here aren’t all correct. See the comments from Frank below for a version that works.

Published by Chris Webb

My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel. View all posts by Chris Webb

Published March 15, 2013February 24, 2014

33 thoughts on “Finding Shakespeare’s Favourite Words With Data Explorer (Power Query)”

robvolk1

Rob Volk says:

March 15, 2013 at 8:01 pm

Very nice! I did something similar with Ulysses using full-text search, but this looks a lot more powerful. I guess I gotta install Excel 2013.

Reply
1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
  
  Chris Webb says:
  
  March 15, 2013 at 8:02 pm
  
  No, it works with Excel 2010 too I think
  
  Reply
MSBI Pro says:

March 15, 2013 at 9:53 pm

What is Data Explorer targeted to solve and where does it fit in the mix of Microsoft BI tools? It looks like it overlaps some with PowerPivot on the loading of data, and SSIS or Access on the ETL/transformation of data, and VBA or Visual Studio for coding with M(?). I am confused.

Reply
1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
  
  Chris Webb says:
  
  March 15, 2013 at 9:59 pm
  
  I don’t think it overlaps with PowerPivot on ETL, since PowerPivot doesn’t really have any ETL functionality. I don’t think it overlaps with SSIS either since SSIS isn’t a desktop, self-service tool. However I think it does overlap a bit with PowerPivot and VBA for certain types of calculation and reporting though; in some scenarios I think users will prefer to work with Data Explorer rather than PowerPivot because it’s easier to build up calculations with its step-based approach.
  
  Reply
  1. MSBI Pro says:
    
    March 15, 2013 at 11:51 pm
    
    Is it targeted for loading data only into Excel? Does it load into PowerPivot directly or are you limited to the Excel row limits? I really like that PowerPivot upgrades to SSAS Tabular when needed. Data Explorer seems like an Excel-only thing. Does the ETL or M(?) steps upgrade to SSIS? What are the realistic data set size and performance limits for a desktop user – are they similar to those in PowerPivot? I think I am missing the bigger picture of when/where Data Explorer makes sense and when I would not want customers to use it but suggest PowerPivot, SSIS or something else. It would be nice to see a diagram with it in the mix of MSBI tools, target use cases, and path from self-service to enterprise usage when one of these self-service Data Explorer projects becomes a mission critical project and then needs to be upgraded to be more of a enterprise ready versus personal solution.
  2. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
    
    Chris Webb says:
    
    March 16, 2013 at 12:20 pm
    
    To answer some of your questions:
    
    You can load data directly into PowerPivot and don’t have to load it into the worksheet.
    
    No, it doesn’t upgrade to SSIS.
    
    I don’t know what its scalability is like; PowerPivot is likely to be faster for some things, like aggregating and filtering, but PowerPivot is a different type of tool.
    
    Really, Data Explorer is being positioned as a self-service ETL tool and this positioning means it complements PowerPivot very well. The point I’ve been making here is that actually Data Explorer is capable of much more than ETL, but its more advanced capabilities are unlikely to be interesting to most users.
Pingback: Cool Things you can do with Data Explorer » Bacon Bits:
Koen Verbeeck (@Ko_Ver) says:

March 22, 2013 at 12:58 pm

Nice blog post Chris, very interesting.
There seems to be a small error in the copy-pastable code:
RemoveBlanks = Table.SelectRows(RenameColumnToWord, each ([Word] “”)),

The semicolon shouldn’t be there.

I recreated the same set-up in SSIS using the Term Extraction component (which is an Enterprise edition component) and it outperforms Data Explorer as expected (so I don’t have to fear for my job :). Nonetheless I’m very impressed with Data Explorer and its capabilities and I very much enjoy your blog posts about it in the past few weeks.

Reply
1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
  
  Chris Webb says:
  
  March 22, 2013 at 1:45 pm
  
  Is it the final comma you mean (there isn’t a semi-colon)? Yes, that’s a copy/paste error – I was getting the code from the full script.
  
  Reply
Jean-Pierre Riehl – Paris – MVP Data Platform, consultant, quelquefois architecte mais principalement geek

Djeepy1 says:

April 7, 2013 at 9:50 pm

Great job Chris. I did the same with some books of Emile Zola and top words are Nana and Madame 🙂

Reply
kakemphaton

kakemphaton says:

April 10, 2013 at 7:58 pm

Hi, this query is perfect. But, i’m quite bad in code. I would like know, if it’s possible to delete or remove some words listed in a .txt files or in a sheet of Excel, actually, i have created a new entry in this code with more than 100 words (filter text, does not contain) = Table.SelectRows(KeptFirstRows, each ([Word] “04″ and [Word] “11″ and [Word] “12″ and [Word] “20″…. But, i would like listed words in a file or a sheet. Thank you !

Reply
Frank says:

February 21, 2014 at 5:05 pm

Hi Chris, I tried to simplify your code using the function ‘Text.Split’.
Your count of ‘the’ and ‘and’ was 18.444 and 17.811, respectively.
My count of ‘the’ and ‘and’ is 27.816 and 26.777, respectively.
I have no idea why there is such a huge difference.
Have I missed something? Do you have an explanation?

let
Source = Lines.FromBinary(File.Contents(“C:\Users\Frank\Documents\Excel\PowerQuery\The Complete Works of William Shakespeare.txt”),null,null,1252),
RemoveLineBreaks = Lines.ToText(Source, ” “),
RemovePunctuation = Text.Remove(RemoveLineBreaks,{“,”, “.”, “?”, “;”, “:”, “;”, “‘”, “@”, “#”, “~”, “{“, “[“, “}”, “]”, “(“, “)”, “*”, “•”, “–”}),
Lowercase = Text.Lower(RemovePunctuation),
SplitTextToList= Text.Split(Lowercase,” “),
TableFromList = Table.FromList(SplitTextToList, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
RenameColumnToWord = Table.RenameColumns(TableFromList,{{“Column1”, “Word”}}),
RemoveBlanks = Table.SelectRows(RenameColumnToWord, each ([Word] “”)),
FindWordCounts = Table.Group(RemoveBlanks, {“Word”}, {{“Count”, each Table.RowCount(_), type number}}),
SortedRows = Table.Sort(FindWordCounts,{{“Count”, Order.Descending}}),
KeptFirstRows = Table.FirstN(SortedRows,100)
in
KeptFirstRows

Reply
Frank says:

February 21, 2014 at 5:26 pm

Hi Chris, your count of ‘cottages’ is 1, my count is 2. After searching the document in the editor, the latter is correct. There must be a leak somewhere …

Reply
1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
  
  Chris Webb says:
  
  February 21, 2014 at 5:27 pm
  
  Strange – I’ll take a look. Maybe there’s a bug in my code somewhere.
  
  Reply
2. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
  
  Chris Webb says:
  
  February 24, 2014 at 8:25 am
  
  The data in my example is being truncated somewhere very early on. It looks like a bug in PQ – I don’t think I am running into any of the published limits. Thanks for letting me know about this! I’ll see what the PQ team say about this problem.
  
  Reply
  1. Frank says:
    
    February 24, 2014 at 9:35 am
    
    I also think it is a bug in PQ.
    Table.FromValue(Lowercase) still has the correct length (5.3MB).
    But, Table.TransformColumns(TurnIntoTable, {“Value”, Splitter.SplitTextByWhitespace()}) only returns about 600k items, instead of about 900k.
    Thanks for reporting this issue to the PQ team.
    Hopefully, you will get an answer.
3. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
  
  Chris Webb says:
  
  February 27, 2014 at 8:23 pm
  
  I now have an answer from the nice people on the PQ team. It seems like the problem is with the Splitter function which, by default, ignores any delimiters inside quotes. Using QuoteStyle.None in
  = Table.TransformColumns(TurnIntoTable, {“Value”, Splitter.SplitTextByWhitespace(QuoteStyle.None)}),
  Should stop this happening.
  
  Reply
  1. Frank says:
    
    February 28, 2014 at 4:41 pm
    
    Thanks Chris, with QuoteStyle.None there is no difference anymore.
    Unfortunately, the nice people on the PQ team don’t have the time to improve the incomplete documentation.
  2. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
    
    Chris Webb says:
    
    February 28, 2014 at 7:14 pm
    
    To be fair, I think their documentation is better than most – and they only found out what was going on here after I asked them and they took the trouble to investigate.
Javier Recasens – Atlanta GA – Data Scientist

Javier Recasens Sanchez says:

June 20, 2014 at 11:51 pm

Thank you so much for sharing this Chris.

In the uploaded workbook I noticed that the count of “cottages” is now 2. I see you deleted:

TurnIntoTable = Table.FromValue(Lowercase),
SplitIntoWords = Table.TransformColumns(TurnIntoTable, {“Value”, Splitter.SplitTextByWhitespace()}),
ListOfWords = SplitIntoWords{0}[Value],

And replace it with:

SplitIntoWords = Text.Split(Lowercase, ” “),

How do you use the Splitter.SplitTextByWhitespace function in this example then?

PD: Im eager to get your book from Apress, 5 more days!

Reply
1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
  
  Chris Webb says:
  
  June 22, 2014 at 8:54 pm
  
  Did you read the comments above? There was/is a bug in Power Query which was the reason I changed the code.
  
  Reply
Pingback: Removing Punctuation From Text In Power Query | Chris Webb's BI Blog
Pingback: Removing Punctuation From Text In Power Query - SQL Server - SQL Server - Toad World
anwer says:

December 9, 2015 at 3:07 pm

very nice realy

Reply
Bjarke says:

September 27, 2016 at 12:16 pm

Hi. Thank you for a fine code for word counting. I have almost got it to work but I still need to remove ” (quote sign) from the text.
I have made this line to remove text but Power Query will not accept that I add “”” to the list to remove ” from the text:

RemovePunctuation = Text.Remove(RemoveLineBreaks,{“,”, “.”, “?”, “;”, “:”, “‘”, “@”, “#”, “~”, “{“, “[“, “}”, “]”, “(“, “)”, “_”, “!”, “”, “/”, “=”, “$”, “&”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, “9”, “0”, “©”}),

Have you got any advice of how to remove all “?

Regards
Bjarke

Reply
1. Bjarke says:
  
  September 27, 2016 at 1:22 pm
  
  Hi again. I solved the problem using your solution on https://blog.crossjoin.co.uk/2014/08/18/removing-punctuation-from-text-in-power-query/
  Thanks for that one as well.
  
  Reply
  1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
    
    Chris Webb says:
    
    September 27, 2016 at 10:59 pm
    
    Glad you solved it!
Bjarke Lyster says:

March 18, 2017 at 12:09 pm

Hi. Thanks for the fine code for word counting. I have made some more work on the coding and the output in Excel. Here is the final result: http://kortlink.dk/wordstatistics/q2zd

Reply
Pingback: Removing Punctuation From Text With The Text.Select M Function In Power BI/Power Query/Excel Get&Transform – Chris Webb's BI Blog
Ross Tucker says:

September 21, 2018 at 3:49 pm

I ran this on my text history as an interesting test. it works for the most part but I found that there were a lot of sections where it seemed to break down and fail to separate the words. there were hundreds of blocks of text with anywhere from 10 to >500 words that failed to split. can you imagine why that would happen? There isn’t consistency to the blocks, some large, some small, no certain words in common, some start with a space, others don’t…

Reply
1. Bjarke Lyster says:
  
  September 23, 2018 at 11:13 am
  
  Have you tried my version: http://kortlink.dk/wordstatistics/q2zd
  
  Reply
2. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
  
  Chris Webb says:
  
  September 27, 2018 at 11:27 am
  
  Strange – I guess it’s a problem with the splitter function, but it’s hard to say what’s going on without seeing the source data.
  
  Reply
Pingback: Chris Webb's BI Blog: Removing Punctuation From Text With The Text.Select M Function In Power BI/Power Query/Excel Get&Transform Chris Webb's BI Blog