NodeXL · Power BI · Power Query

Loading Twitter Archive Data In Power Query

If you’re a Twitter user (I’m @Technitrain if you aren’t following me already) you may be aware that you can download your entire Twitter history – all of the data from all of your tweets – as a series of .js files. All the details on how to do this are here:

https://support.twitter.com/articles/20170160-downloading-your-twitter-archive

Since Power Query can work with JSON data I thought it would be easy to use it to analyse my Twitter archive… but while it’s possible, it’s not entirely straightforward. The problem seems to be that Power Query doesn’t like the first line in each .js file that Twitter gives you. Removing that first line isn’t too difficult but it requires some M code, so here’s a function (I’ve called it GetTwitterArchiveFile) that handles that problem. You give it the binary data from the file and it returns a table containing all the data from that file:

[sourcecode language=”text”]
(TwitterFile as binary)=>
let
//Read data from file and interpret as Lines
Source = Lines.FromBinary(TwitterFile),
//Remove the first line
RemoveFirstRow = List.Skip(Source,1),
//Convert back to Binary
ConvertBackToBinary = Lines.ToBinary(RemoveFirstRow),
//Now convert to JSON
ConvertToJSON = Json.Document(ConvertBackToBinary),
//Flatten to a table
ConvertToTable = Table.FromList(ConvertToJSON, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
//Expand the first set of columns
ExpandColumns = Table.ExpandRecordColumn(ConvertToTable, "Column1", {"source", "entities", "geo", "id_str", "text", "id", "created_at", "user", "retweeted_status", "in_reply_to_status_id_str", "in_reply_to_user_id", "in_reply_to_status_id", "in_reply_to_screen_name", "in_reply_to_user_id_str"}, {"source", "entities", "geo", "id_str", "text", "id", "created_at", "user", "retweeted_status", "in_reply_to_status_id_str", "in_reply_to_user_id", "in_reply_to_status_id", "in_reply_to_screen_name", "in_reply_to_user_id_str"})
in
ExpandColumns
[/sourcecode]

Here’s an example of how to use the above function: it’s another function (called GetTwitterFullArchive) which, when you pass it the path of your tweets folder (this will be wherever you unzipped the download that you get from Twitter) returns the combined contents of all of the .js files in that format by calling GetTwitterArchiveFile() for each one:

[sourcecode language=”text”]
(TweetsFolderPath as text) =>
let
//Connect to Tweets folder
Source = Folder.Files(TweetsFolderPath),
//Remove everything but Content column
#"Removed Other Columns" = Table.SelectColumns(Source,{"Content"}),
//Use Custom Column to call GetTwitterArchiveFile for each .js file in the folder
#"Added Custom" = Table.AddColumn(#"Removed Other Columns", "Custom", each GetTwitterArchiveFile([Content])),
//Remove the Content columns
#"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Content"}),
//Expand all columns in the Custom column
#"Expanded Custom" = Table.ExpandTableColumn(#"Removed Columns", "Custom", {"source", "entities", "geo", "id_str", "text", "id", "created_at", "user", "retweeted_status", "in_reply_to_status_id_str", "in_reply_to_user_id", "in_reply_to_status_id", "in_reply_to_screen_name", "in_reply_to_user_id_str"}, {"source", "entities", "geo", "id_str", "text", "id", "created_at", "user", "retweeted_status", "in_reply_to_status_id_str", "in_reply_to_user_id", "in_reply_to_status_id", "in_reply_to_screen_name", "in_reply_to_user_id_str"})
in
#"Expanded Custom"
[/sourcecode]

Invoking this function in a query, for example like this:

[sourcecode language=”text”]
let
Source = GetFullTwitterArchive("C:\Users\Chris\Downloads\TwitterArchive\data\js\tweets")
in
Source
[/sourcecode]

Gives you the following output:

image

As you can see, there are plenty of other columns that can themselves be expanded, but this is a good starting point for any analysis.

There’s nothing really ground-breaking in what I’ve done here – it’s a fairly standard example of how you can use Power Query functions to combine data from multiple files, very similar to this example of combining data from multiple Excel files.

There’s absolutely loads of interesting data that you can play with here, but to start with here’s a query that finds the top 10 people I have replied to on Twitter:

[sourcecode language=”text”]
let
Source = GetFullTwitterArchive("C:\Users\Chris\Downloads\TwitterArchive\data\js\tweets"),
#"Removed Other Columns" = Table.SelectColumns(Source,{"in_reply_to_screen_name"}),
#"Grouped Rows" = Table.Group(#"Removed Other Columns", {"in_reply_to_screen_name"}, {{"Count", each Table.RowCount(_), type number}}),
#"Filtered Rows" = Table.SelectRows(#"Grouped Rows", each ([in_reply_to_screen_name] <> null)),
#"Sorted Rows" = Table.Sort(#"Filtered Rows",{{"Count", Order.Descending}}),
#"Kept First Rows" = Table.FirstN(#"Sorted Rows",10)
in
#"Kept First Rows"
[/sourcecode]

Here’s the output as a table and chart:

Looks like Jamie Thomson is the winner by a wide margin!

Here’s an example of a  NodeXL graph I built from this data, showing the relationships between users who I have mentioned together in a single tweet:

You can download the sample workbook for this post, containing all the functions (but not my data) here.

Leave a ReplyCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.