Things to do with Twitter Archive

December 16, 2013

Twitter, Inc provides a backup of all your Twitter archive. In 5-years I have tweeted #1,811 times (my firt date from 2008 and I didn’t tweet a lot).

After download the content, checking the raw content (.csv) I decided do play with data. Ideas come into my mind:

generate a tag (word) cloud
which my average tweets per year?
which hour of the day I tweet more? (this particularly info is only available from late-2010)
all urls listed in tweets (and open it in browser)

Also if you are a social media manager that like programming you can study how to improve your campaign. Said that it’s time to write some code (all examples uses Python 2.x). All code are avaliable on gist, link at end of post.

First thing to do is read content from CSV file, this is straightforward with python’s csv module.

    csv.reader(open("tweets.csv"), delimiter=',', quotechar='"')

It returns an iterable _csv.reader object. Besides the content itself, each tweet contains the following metadata.

    ['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp', 
    'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id', 
    'retweeted_status_timestamp', 'expanded_urls']

With this information you can use your creativity to better “interpret” your tweets. Starting with data visualization I choose an example using tag cloud (tags are usually single words, and the importance of each tag is shown with font size or color). Using PyTagCloud it’s easy generate the word cloud. In order to consider the relevant data I excluded articles, pronouns, interjections and other common words like “www”, “bi.ly”, etc; an example on how to do this.

    re.sub("(www|the|a|we|ours|this|some|who|bit.ly)", " ", buffer)

The image below represent the 25 most important words from my tweets.

With this visualization it’s easy to conclude that most of my tweets are tech related, especially linux and android :)

Which my average tweets per year?

To get stats from date we can data from [‘timestamp’] value. An example of data: 2013-12-13 06:33:25 +0000 but it’s type is str and not datetime, the first thing to do is convert and then filter relevant information.

    tweets_per_year = { 2008:0, 2009:0, 2010:0, 2011:0, 2012:0, 2013:0 }

    format = '%Y-%m-%d %H:%M:%S' #example 2013-12-13 06:33:25

    for tweet in tweets:
        date = tweet[3]
        if date != "timestamp": 
            date_object = datetime.strptime(date[:-6],format) # -6 to remove " +0000"
            year = date_object.year
            for y in tweets_per_year:
                if y == year:
                    tweets_per_year[year] += 1


    print tweets_per_year

Here my stats:

— #223
— #380
— #675
— #368
— #77
— #87

I need more years like 2010.

Which hour of the day I tweet more?

Remember that feature was incorporated late (to me appeared on tweets from late-2010). Using the same snippet above we can add it:

    tweets_per_hour  = { 0:0, 1:0, 2:0, 3:0, 4:0, 5:0, 6:0, 
                         7:0, 8:0, 9:0, 10:0, 11:0, 12:0,
                         13:0, 14:0, 15:0, 16:0, 17:0, 18:0, 
                         19:0, 20:0, 21:0, 22:0, 23:0 }
    	
    # (...)

    hour = date_object.hour
    minute = date_object.minute
    second = date_object.second

    if hour == 0 and minute == 0 and second == 0:
        pass # feature not implemeted
    else:
         print date_object
         for t in tweets_per_hour:
             if t == hour:
                 tweets_per_hour[t] += 1

It produces: {0: 25, 1: 43, 2: 30, 3: 15, 4: 8, 5: 4, 6: 4, 7: 1, 8: 0, 9: 2, 10: 3, 11: 24, 12: 50, 13: 36, 14: 42, 15: 34, 16: 42, 17: 49, 18: 36, 19: 36, 20:56, 21: 39, 22: 23, 23: 28}

    from 12:00 a.m to 2 a.m — 98
    from 12:00 p.m to 2 p.m — #128
    from 20:00 p.m to 22 p.m — #118

This data can be particularly interesting if you associate with Twitter API to evaluate the “rate” of interactions (retweets, mentions).

Finally, let’s get the URLs in this case the can use two approaches: filter the [‘text’] content or directly from [‘expanded_urls’] (this was introduced propably after t.co feature). Since I have links from 2008 when this feature wasn’t enable I’ll use the first one.

    exp = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    for tweet in tweets:
        urls = re.findall(exp, tweet)
        for url in urls:
            if url != "":
            print url # do something

You can use Python’s webbrowser module to open URLs.

    import webbrowser
    webbrowser.open_new_tab("www.google.com")

CODE ready to go

some gists read to go

tagcloud (raw)
stats like average tweets per year and hour (raw)
listed urls (raw)

If you like this post you can follow me