Loading Twitter Archive Data In Power Query

If you’re a Twitter user (I’m @Technitrain if you aren’t following me already) you may be aware that you can download your entire Twitter history – all of the data from all of your tweets – as a series of .js files. All the details on how to do this are here:


Since Power Query can work with JSON data I thought it would be easy to use it to analyse my Twitter archive… but while it’s possible, it’s not entirely straightforward. The problem seems to be that Power Query doesn’t like the first line in each .js file that Twitter gives you. Removing that first line isn’t too difficult but it requires some M code, so here’s a function (I’ve called it GetTwitterArchiveFile) that handles that problem. You give it the binary data from the file and it returns a table containing all the data from that file:

(TwitterFile as binary)=>
    //Read data from file and interpret as Lines
    Source = Lines.FromBinary(TwitterFile),
    //Remove the first line
    RemoveFirstRow = List.Skip(Source,1),
    //Convert back to Binary
    ConvertBackToBinary = Lines.ToBinary(RemoveFirstRow),
    //Now convert to JSON
    ConvertToJSON = Json.Document(ConvertBackToBinary),
    //Flatten to a table
    ConvertToTable = Table.FromList(ConvertToJSON, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    //Expand the first set of columns
    ExpandColumns = Table.ExpandRecordColumn(ConvertToTable, "Column1", {"source", "entities", "geo", "id_str", "text", "id", "created_at", "user", "retweeted_status", "in_reply_to_status_id_str", "in_reply_to_user_id", "in_reply_to_status_id", "in_reply_to_screen_name", "in_reply_to_user_id_str"}, {"source", "entities", "geo", "id_str", "text", "id", "created_at", "user", "retweeted_status", "in_reply_to_status_id_str", "in_reply_to_user_id", "in_reply_to_status_id", "in_reply_to_screen_name", "in_reply_to_user_id_str"})

Here’s an example of how to use the above function: it’s another function (called GetTwitterFullArchive) which, when you pass it the path of your tweets folder (this will be wherever you unzipped the download that you get from Twitter) returns the combined contents of all of the .js files in that format by calling GetTwitterArchiveFile() for each one:

(TweetsFolderPath as text) =>
    //Connect to Tweets folder
    Source = Folder.Files(TweetsFolderPath),
    //Remove everything but Content column
    #"Removed Other Columns" = Table.SelectColumns(Source,{"Content"}),
    //Use Custom Column to call GetTwitterArchiveFile for each .js file in the folder
    #"Added Custom" = Table.AddColumn(#"Removed Other Columns", "Custom", each GetTwitterArchiveFile([Content])),
    //Remove the Content columns
    #"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Content"}),
    //Expand all columns in the Custom column
    #"Expanded Custom" = Table.ExpandTableColumn(#"Removed Columns", "Custom", {"source", "entities", "geo", "id_str", "text", "id", "created_at", "user", "retweeted_status", "in_reply_to_status_id_str", "in_reply_to_user_id", "in_reply_to_status_id", "in_reply_to_screen_name", "in_reply_to_user_id_str"}, {"source", "entities", "geo", "id_str", "text", "id", "created_at", "user", "retweeted_status", "in_reply_to_status_id_str", "in_reply_to_user_id", "in_reply_to_status_id", "in_reply_to_screen_name", "in_reply_to_user_id_str"})
    #"Expanded Custom"

Invoking this function in a query, for example like this:

    Source = GetFullTwitterArchive("C:\Users\Chris\Downloads\TwitterArchive\data\js\tweets")

Gives you the following output:


As you can see, there are plenty of other columns that can themselves be expanded, but this is a good starting point for any analysis.

There’s nothing really ground-breaking in what I’ve done here – it’s a fairly standard example of how you can use Power Query functions to combine data from multiple files, very similar to this example of combining data from multiple Excel files.

There’s absolutely loads of interesting data that you can play with here, but to start with here’s a query that finds the top 10 people I have replied to on Twitter:

    Source = GetFullTwitterArchive("C:\Users\Chris\Downloads\TwitterArchive\data\js\tweets"),
    #"Removed Other Columns" = Table.SelectColumns(Source,{"in_reply_to_screen_name"}),
    #"Grouped Rows" = Table.Group(#"Removed Other Columns", {"in_reply_to_screen_name"}, {{"Count", each Table.RowCount(_), type number}}),
    #"Filtered Rows" = Table.SelectRows(#"Grouped Rows", each ([in_reply_to_screen_name] <> null)),
    #"Sorted Rows" = Table.Sort(#"Filtered Rows",{{"Count", Order.Descending}}),
    #"Kept First Rows" = Table.FirstN(#"Sorted Rows",10)
    #"Kept First Rows"

Here’s the output as a table and chart:



Looks like Jamie Thomson is the winner by a wide margin!

Here’s an example of a  NodeXL graph I built from this data, showing the relationships between users who I have mentioned together in a single tweet:


You can download the sample workbook for this post, containing all the functions (but not my data) here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s