Real-time Twitter heat map with MongoDB

Over the last few weeks I got in touch with the fascinating field of data visualisation which offers great ways to play around with the perception of information.

In a more formal approach data visualisation denotes “The representation and presentation of data that exploits our visual perception abilities in order to amplify cognition

Nowadays there is a huge flood of information that hit’s us everyday. Enormous amounts of data collected from various sources are freely available on the internet. One of these data gargoyles is Twitter producing around 400 million (400 000 000!) tweets per day!

Tweets basically offer two “layers” of information. The obvious direct information within the text of the Tweet itself and also a second layer that is not directly perceived which is the Tweets’ metadata. In this case Twitter offers a large number of additional information like user data, retweet count, hashtags, etc. This metadata can be leveraged to experience data from Twitter in a lot of exciting new ways!

So as a little weekend project I have decided to build a small piece of software that generates real-time heat maps of certain keywords from Twitter data.

This is a first static peek on what it’s gonna look like (apparently the friendly floatees use Twitter, too):

See a screencast here: screencast.com

To get this working I have used lots of shiny things:

The app is written in Python and consists of mainly three components:

tstream.py:
A small service based on tweepy that implements a StreamListener which inserts incoming data in a MongoDB capped collection. Here you can also set filter terms. This example uses mostly terms related to “Big Data”.

tweet_service.py:
A Flask based web app which get’s new data from MongoDB and makes use of the publish-subscribe pattern. Being “tailable” MongoDB’s capped collections come in handy. There is no need to remember which messages a client has already received, the cursor itself yields new documents on arrival. Also, capped collections are of a fixed size which is appropriate for this use case but your mileage may vary. Incoming Tweets are published to a redis channel for which there is also a listener that returns a “text/event-stream” “Content-Type” header for connecting clients.

map.html:
A few lines of HTML and JavaScript which bring up a Google Maps canvas and a listener for server-sent events. When new data, basically consisting of Lat/Lon tuples, arrives the new point is added to a heat map overlay based on heatmap-gmaps.js in real-time.

Rough overview:

Conclusion

With a relatively small amount of code it is possible to turn text data in astonishing visualizations. With only a little more effort different hash tags could be illustrated by different colors. Or one could count tweets on a topic in certain regions and then compare activity based on number of citizens. There are lots of interpretations possibel through the underlying data set.

You can find the code on github, feel free to fork and play around!

About these ads

18 thoughts on “Real-time Twitter heat map with MongoDB

  1. Big Data Doctor

    Awesome post, Johannes! Adding a semantic engine to the real time Twitter feed is going to result in the killer application to show the potential of big data and the semantic web working together… Instead of a geospacial heatmap you could create a conceptual network heatmap and use it to monitor trends and exploit relationship between semantic categories! Pure gold!
    I’m really keen on seing the way it is going to develop!

    Reply
  2. Pingback: Real-time Twitter heat map with MongoDB « Another Word For It

  3. Kalil

    So I’m trying to run the HeatMap, and it gives me an error. I don’t doubt that it’s PEBCAK (problem exists between chair and keyboard), but I was curious as to whether or not you had any insight:

    ConnectionError: Error 61 connecting localhost:6379. Connection refused.
    127.0.0.1 – – [23/Jul/2012 13:20:18] “GET /tweets?callback=loomit HTTP/1.1″ 200 –
    Debugging middleware caught exception in streamed response at a point where response headers were already sent.
    Traceback (most recent call last):
    File “/Library/Python/2.6/site-packages/werkzeug/wsgi.py”, line 513, in next
    return self._next()
    File “/Library/Python/2.6/site-packages/werkzeug/wrappers.py”, line 829, in iter_encoded
    for item in self.response:
    File “/Users/kalil/Desktop/Austin/Heatmap/tweet_service.py”, line 54, in event_stream
    pubsub.subscribe(‘chat’)
    File “/Library/Python/2.6/site-packages/redis/client.py”, line 1316, in subscribe
    return self.execute_command(‘SUBSCRIBE’, *channels)
    File “/Library/Python/2.6/site-packages/redis/client.py”, line 1268, in execute_command
    connection.connect()
    File “/Library/Python/2.6/site-packages/redis/connection.py”, line 204, in connect
    raise ConnectionError(self._error_message(e))

    Reply
    1. jbrandstettercs Post author

      Hi,

      it seems as if you haven’t started redis which would listen on port 6379!

      Cheers,

      Johannes

      Reply
      1. Kalil

        Ah, I am unfamiliar with Redis – it is now listening as intended.
        I truly wish I had another location to ask questions in, haha – could pester you for another?
        I set up a collection called tweets_tail under a database named “test” in mongodb…
        Yet when I run tweet_service.py, it gives me a “OperationFailure: database error: tailable cursor requested on non capped collection”.

        Looking at tweet_service.py, it seems as though this shouldn’t cause a problem dependent on the name of the database, as it explicitly assigns the collection as a member of the database.

        I ran a query on “db.tweets_tail.isCapped” and it returned:
        function () {
        var e = this.exists();
        return e && e.options && e.options.capped ? true : false;
        }
        Which is the correct format.

        Any ideas?

  4. jbrandstettercs Post author

    Hi,

    did you configure the collection to be capped? Just run “db.tweets_tail.isCapped()” (mind the braces) and it should tell!

    Cheers,

    Johannes

    Reply
  5. Kalil

    I’m still working on it, and I can’t determine if it’s an issue with the database itself, or perhaps a bug?
    I’m more leaning towards the idea that I’m doing something incorrectly.

    Reply
  6. Kalil

    Actually, I hate to spam the comment section with questions, but now that I have created a database named tstream with a capped collection named tweets_tail, redis and the program itself no longer throw me errors.

    However, when I run it, I get:

    avon:heatmap kalil$ python tweet_service.py
    * Running on http://0.0.0.0:5000/
    * Restarting with reloader
    beginning to tail…
    1
    127.0.0.1 – – [24/Jul/2012 14:57:54] “GET /tweets?callback=loomit HTTP/1.1″ 200 –

    And the map never gains any new tweets. I’m under the impression that no tweets are actually being saved to the database; upon querying db.tweets_tail.count(), it has 0 documents, and, logically, db.tweets_tail.find({}) returns no documents.

    Redis appears to be fine:
    “[227] 24 Jul 14:57:54 – Accepted 127.0.0.1:52186
    [227] 24 Jul 14:57:58 – 1 clients connected (0 slaves), 931152 bytes in use”

    When tstream.py runs, none of the functions in the StreamListener class appear to ever be called. I placed print statements before and after the class, and they print. The only thing that does not print is the statement after: “streamer.filter(track = setTerms)”

    I know that Tweepy has a notoriously broken Streaming API – is that where the problem lies?
    I’d really like to try this out, and I appreciate your help.

    Reply
  7. jpramirezangeloni

    Ey Kalil i have been playing around with the code and i think that your issue is the size of the capped collection.
    i had the same problem, although i had more that a million tweets on a couchdb database, i tried everything on my simple script to make it work to import to mongo, but simply did’t worked at all.

    Until i increase the size of the capped collection with this.
    db.runCommand({“convertToCapped”: “tweets_tail”, size: 10000000});

    Its all seems that depending on the tweet, the size of the collection changes, so if you have a fixed capped collection mongodb will simply ignore it.

    Hope that Helps!!

    Reply
    1. zara

      Hi, if you increase the size of the capped collection, you end up with a disconnection from the client. The exception thrown is:

      —————————————-
      Exception happened during processing of request from (‘127.0.0.1′, 52411)
      Traceback (most recent call last):
      File “/usr/lib/python2.7/SocketServer.py”, line 284, in _handle_request_noblock
      self.process_request(request, client_address)
      File “/usr/lib/python2.7/SocketServer.py”, line 310, in process_request
      self.finish_request(request, client_address)
      File “/usr/lib/python2.7/SocketServer.py”, line 323, in finish_request
      self.RequestHandlerClass(request, client_address, self)
      File “/usr/lib/python2.7/SocketServer.py”, line 640, in __init__
      self.finish()
      File “/usr/lib/python2.7/SocketServer.py”, line 693, in finish
      self.wfile.flush()
      File “/usr/lib/python2.7/socket.py”, line 303, in flush
      self._sock.sendall(view[write_offset:write_offset+buffer_size])
      error: [Errno 32] Broken pipe
      —————————————-

      Reply
  8. jpramirezangeloni

    Ey Guys, besides of that Redis method listen is blocking the entire app. So in order to work “as is”
    you need to
    1./ Launch app,
    2./ load the url, wait until the entire map is loaded (no heatmap showing)
    3./ restart the app.

    that way you could see the heatmap filling.
    Between the json callback and the tail and listen the app only listen once before it stop, so any new request will not work because never call again the GET call from redis, and redis, is on his last row, so, if new clients are coming up will never answer the entire set unless the app is restarted.

    jbrandstettercs: if you dont mind i would like to add and make mods to your code. :)

    Reply
    1. jbrandstettercs Post author

      Yes, of course you can tinker with the code! It’s on github so feel free to fork!

      Reply
  9. zara

    I’ve followed the example exactly and quite literally copied and pasted the code, but the heat map isn’t generated. Is there a problem with the code? I have no error when I run tstream.py and tweet_service.py and then running the map.html on local host as per the instructions. I’ve also created a capped mongo collection.

    Reply
  10. zara

    Furthermore I get this error:

    Is it because the collection is capped too high amount?

    Exception happened during processing of request from (‘127.0.0.1′, 60681)
    Traceback (most recent call last):
    File “/usr/lib/python2.7/SocketServer.py”, line 284, in _handle_request_noblock
    self.process_request(request, client_address)
    File “/usr/lib/python2.7/SocketServer.py”, line 310, in process_request
    self.finish_request(request, client_address)
    File “/usr/lib/python2.7/SocketServer.py”, line 323, in finish_request
    self.RequestHandlerClass(request, client_address, self)
    File “/usr/lib/python2.7/SocketServer.py”, line 640, in __init__
    self.finish()
    File “/usr/lib/python2.7/SocketServer.py”, line 693, in finish
    self.wfile.flush()
    File “/usr/lib/python2.7/socket.py”, line 303, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
    error: [Errno 32] Broken pipe
    —————————————-

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s