DataWeave

Sep 04

Building a Twitter Sentiment Analysis App using R

[This post was written by Dipanjan. Dipanjan is a Data Engineer at DataWeave who works with Mandar addressing some of the semantics related problems like product clustering and data normalization.]

Twitter, as we know, is a highly popular social networking and micro-blogging service used by millions worldwide. Each status or tweet as we call it is a 140 character text message. Registered users can read and post tweets, but unregistered users can only view them. Text mining and sentiment analysis are some of the hottest topics in the analytics domain these days. Analysts are always looking to crunch thousands of tweets to gain insights on different topics, be it popular sporting events such as the FIFA World Cup or to know when the next product is going to be launched by Apple.

Today, we are going to see how we can build a web app for doing sentiment analysis of tweets using R, the most popular statistical language. For building the front end, we are going to be using the ‘Shiny' package to make our life easier and we will be running R code in the backend for getting tweets from twitter and analyzing their sentiment.

The first step would be to establish an authorized connection with Twitter for getting tweets based on different search parameters. For doing that, you can follow the steps mentioned in this document which includes the R code necessary to achieve that: http://rpubs.com/dipanzan/twitterAccess

After obtaining a connection, the next step would be to use the ‘shiny’ package to develop our app. This is a web framework for R, developed by RStudio. Each app contains a server file ( server.R ) for the backend computation and a user interface file ( ui.R ) for the frontend user interface. You can get the code for the app from my github repository here which is fairly well documented but I will explain the main features anyway.

The first step would be to develop the UI of the application, you can take a look at the ui.R file, we have a left sidebar, where we take input from the user in two text fields for either twitter hashtags or handles for comparing the sentiment. We also create a slider for selecting the number of tweets we want to retrieve from twitter. The right panel consists of four tabs, here we display the sentiment plots, word clouds and raw tweets for both the entities in respective tabs as shown below.

Coming to the backend, remember to also copy the two dictionary files, ‘negative_words.txt’ and ‘positive_words.txt’ from the repository because we will be using them for analyzing and scoring terms from tweets. On taking a close look at the server.R file, you can notice the following operations taking place.

- The ‘TweetFrame’ function sends the request query to Twitter, retrieves the tweets and aggregates it into a data frame.
- The ‘CleanTweets’ function runs a series of regexes to clean tweets and extract proper words from them.
- The ‘numoftweets’ function calculates the number of tweets.
- The ‘wordcloudentity’ function creates the word clouds from the tweets.
- The ‘sentimentalanalysis’ and ‘score.sentiment’ functions performs the sentiment analysis for the tweets.

These functions are called in reactive code segments to enable the app to react instantly to change in user input. The functions are documented extensively but I’ll explain the underlying concept for sentiment analysis and word clouds which are generated.

For word clouds, we get the text from all the tweets, remove punctuation and stop words and then form a term document frequency matrix and sort it in decreasing order to get the terms which occur the most frequently in all the tweets and then form a word cloud figure based on those tweets. An example obtained from the app is shown below for hashtags ‘#thrilled’ and ‘#frustrated’.

For sentiment analysis, we use Jeffrey Breen’s sentiment analysis algorithm cited here, where we clean the tweets, split tweets into terms and compare them with our positive and negative dictionaries and determine the overall score of the tweet from the different terms. A positive score denoted positive sentiment, a score of 0 denotes neutral sentiment and a negative score denotes negative sentiment. A more extensive and advanced n-gram analysis can also be done but that story is for another day. An example obtained from the app is shown below for hashtags ‘#thrilled’ and ‘#frustrated’.

After getting the server and UI code, the next step is to deploy it in the server, we will be using shinyapps.io (https://www.shinyapps.io/) server which allows you to host your R web apps free of charge. If you already have the code loaded up in RStudio, you can deploy it from there using the ‘deployApp()’ command. For more details on app building and hosting check out the official tutorial here.

You can check out a live demo of the app here: https://dipanjan.shinyapps.io/twitter-analysis/

It’s still under development so suggestions are always welcome.

Aug 09

A Peek into GNU Parallel

[This post was written by Jyotiska. Jyotiska is a Data Engineer at DataWeave with keen interest in Python and experimentation. He works with Sanket in the Products and Strategy team.]

GNU Parallel is a tool that can be deployed from a shell to parallelize job execution. A job can be anything from simple shell scripts to complex interdependent Python/Ruby/Perl scripts. The simplicity of ‘Parallel’ tool lies in it usage. A modern day computer with multicore processors should be enough to run your jobs in parallel. A single core computer can also run the tool, but the user won’t be able to see any difference as the jobs will be context switched by the underlying OS.

At DataWeave, we use Parallel for automating and parallelizing a number of resource extensive processes ranging from crawling to data extraction. All our servers have 8 cores with capability of executing 4 threads in each. So, we experienced huge performance gain after deploying Parallel. Our in-house image processing algorithms used to take more than a day to process 200,000 high resolution images. After using Parallel, we have brought the time down to a little over 40 minutes!

GNU Parallel can be installed on any Linux box and does not require sudo access. The following command will install the tool:
(wget -O - pi.dk/3 || curl pi.dk/3/) | bash
GNU Parallel can read inputs from a number of sources — a file or command line or stdin. The following simple example takes the input from the command line and executes parallely:
parallel echo ::: A B C
The following takes the input from a file:
parallel -a somefile.txt echo
Or STDIN:
cat somefile.txt | parallel echo
The inputs can be from multiple files too:
parallel -a somefile.txt -a anotherfile.txt echo
The number of simultaneous jobs can be controlled using the --jobs or -j switch. The following command will run 5 jobs at once:
parallel --jobs 5 echo ::: A B C D E F G H I J
By default, the number of jobs will be equal to the number of CPU cores. However, this can be overridden using percentages. The following will run 2 jobs per CPU core:
parallel --jobs 200% echo ::: A B C D
If you do not want to set any limit, then the following will use all the available CPU cores in the machine. However, this is NOT recommended in production environment as other jobs running on the machine will be vastly slowed down.
parallel --jobs 0 echo ::: A B C
Enough with the toy examples. The following will show you how to bulk insert JSON documents in parallel in a MongoDB cluster. Almost always we need to insert millions of document quickly in our MongoDB cluster and inserting documents serially doesn’t cut it. Moreover, MongoDB can handle parallel inserts.

The following is a snippet of a file with JSON document. Let’s assume that there are a million similar records in the file with one JSON document per line.

{“name”: “John”, “city”: “Boston”, “age”: 23}
{“name”: “Alice”, “city”: “Seattle”, “age”: 31}
{“name”: “Patrick”, “city”: “LA”, “age”: 27}
...
...
The following Python script will get each JSON document and insert into “people” collection under “dw” database.

import json
import pymongo
import sys

document = json.loads(sys.argv[1])
client = pymongo.MongoClient(“mongodb://somehost:someport”)
db = client[“dw”]
collection = db[“people”]

try:
    collection.insert(document)
except Exception as e:
    print “Could not insert document in db”, repr(e)
Now to run this parallely, the following command should do the magic:
cat people.json | parallel ‘python insertDB.py {}’
That’s it! There are many switches and options available for advanced processing. They can be accessed by doing a man parallel on the shell. Also the following page has a set of tutorials: GNU Parallel Tutorials.

Jun 02

JSON vs simpleJSON vs ultraJSON

[This post was written by Jyotiska. Jyotiska is a Data Engineer at DataWeave with keen interest in Python and experimentation. He works with Sanket in the Products and Strategy team.]

Without argument, the most used file format used in DataWeave is JSON. We use JSON everywhere — managing product lists, storing data extracted from crawl dumps, generating assortment reports, and almost everywhere else. So it is very important to make sure that the libraries and packages we are using are fast enough to handle large data sets. There are two popular packages used for handling json — first is the stock json package that comes with default installation of Python, the other one is simplejson which is an optimized and maintained json package for Python. The goal of this blog post is to introduce ultrajson or Ultra JSON, a JSON library written mostly in C and built to be extremely fast.

We have done the benchmark on three popular operations — load, loads and dumps. We have a dictionary with 3 keys - id, name and address. We will dump this dictionary using json.dumps() and store it in a file. Then we will use json.loads() and json.load() separately to load the dictionaries from the file. We have performed this experiment on 10000, 50000, 100000, 200000, 1000000 dictionaries and observed how much time it takes to perform the operation by each library.

dumps operation line by line

Here is the result we received using the json.dumps() operations. We have dumped the content dictionary by dictionary.

Number of dicts json (in ms) simplejson (in ms) ujson (in ms)
10000 84.106 115.876 25.891
50000 395.163 579.576 122.626
100000 820.280 1147.962 246.721
200000 1620.239 2277.786 487.402
500000 3998.736 5682.641 1218.653
1000000 7847.999 11501.038 2530.791

We notice that json performs better than simplejson but ultrajson wins the game with almost 400% speedup than stock json.

dumps operation (all dictionaries at once)

In this experiment, we have stored all the dictionaries in a list and dumped the list using json.dumps().

Number of dicts json (in ms) simplejson (in ms) ujson (in ms)
10000 21.674 21.941 14.484
50000 102.739 118.851 67.215
100000 199.454 240.849 138.830
200000 401.376 476.392 270.667
500000 1210.664 1376.511 834.013
1000000 2729.538 2983.563 1830.707

simplejson is almost as good as stock json, but again ultrajson outperforms them by 150% speedup. Now lets see how they perform for load and loads operation.

load operation on a list of dictionaries

Now we do the load operation on a list of dictionaries and compare the results.

Number of dicts json (in ms) simplejson (in ms) ujson (in ms)
10000 47.040 8.932 10.706
50000 165.877 44.065 45.629
100000 356.405 97.277 105.948
200000 718.873 185.917 205.120
500000 1699.623 461.605 503.203
1000000 3441.075 949.905 1055.966

Surprisingly, simplejson beats other two, with ultrajson being almost close to simplejson. Here, we observe that simplejson is almost 400% faster than stock json, same with ultrajson.

loads operation on dictionaries

In this experiment, we load dictionaries from the file one by one and pass them to the json.loads() function.

Number of dicts json (in ms) simplejson (in ms) ujson (in ms)
10000 99.501 69.517 15.917
50000 407.873 246.794 76.369
100000 893.989 526.257 157.272
200000 1746.961 1025.824 318.306
500000 8446.053 2497.081 791.374
1000000 8281.018 4939.498 1642.416

Again ultrajson steals the show, being almost 600% faster than stock json and 400% faster than simplejson.

That is all the benchmarks we have here. The verdict is pretty clear. Use simplejson instead of stock json in any case, since simplejson is well maintained repository. If you really want something extremely fast, then go for ultrajson. In that case, keep in mind that ultrajson only works with well defined collections and will not work for un-serializable collections. But if you are dealing with texts, this should not be a problem.

May 01

James Gleick’s The Information — The Beginnings

[The author of this post is Sanket. Sanket heads Products and Strategy at DataWeave. When he manages to find time he plays with his daughter or reads two more pages of a book he’s been trying to finish since several months.]

James Gleick’s 2011 book The Information (ISBN: 0375423729, Pages: 527, Publisher: Knopf Doubleday Publishing Group) is an important book. It is an essential reading of our time—-the fabled Information Age. The subtitle of the book is very apt: A History, a Theory, a Flood. As it suggests, the book traces the quest of human being towards understanding the nature of “information”, representing it (in various forms spoken word, phoneme, written symbols, electric pulses, etc.), and communicating it.

The rest of this blog post is not a review of the book (just go read the book instead of this!), but a series of thoughts and ideas formed upon reading the book. Just like the book, I am going to write them in three parts.

Talking Drums

While there are no distinct sub parts to the book, we can deem the initial chapters as The History. Gleick starts with a fascinating account of the “talking drums of Africa”. Drums speaking the numerous African tongues and dialects developed by illiterate primitives were used to communicate information over hundreds of miles within minutes, a feat unachievable by the fastest runner or the best horse in Europe.

Europeans used smoke signals that typically communicated binary (or at best a very limited number of) status messages. Whereas the drums communicated complex messages not just warnings and quick messages, but “songs, jokes, etc”. A drummer added sufficient contextual redundancy to ensure error correction (while such concepts as we know of them today did not exist at that time!). African languages being tonal, and the drums limited by conveying just the tonal element, the redundancy was essential to resolve context. For instance, if a drummer wanted to say something about a chicken (koko), he would drum: "koko, the one that does the sound kiokio".

The Written Word

While the oral culture was wondrous in itself (Homer’s epics possibly evolved from oral transfer across generations), the invention of writing presented possibilities and difficulties, hitherto unheard of. Firstly, it had its detractors among which were influential people such as Plato. "Writing diminishes the faculty of memory among men. […]" Compare this to the qualm of our age: “Is Google making us stupid?

Gleick argues on the basis of dialectics of earlier historians that writing was essential for the possibility of conscious thought. Plato and other Greek philosophers, while being skeptical about writing, developed logic through the faculty of writing. The oral culture lacked the comprehension of syllogisms as shown by sociological experiments.

Babylonians developed advanced mathematics for their time as indicated by their cuneiform tablets. Regardless of the initial resistance to writing, it marched on to produce the sciences, arts, and philosophy of the highest quality. The notion of “history”, leave aside “history of information”, was inconceivable without first inventing writing.

The Telegraph

Information also presented the problem of communication. Beacons, smoke signals, and the rest were all but limited by their range, both of vocabulary and distance. Then came along telegraph. Of course, the first telegraph systems were mechanical, wherein an intricate system of wooden arms was used to relay messages from one station to the next. The positions of the arms of the telegraph implied different messages. The number of states in which the telegraph could exist increased quite a bit. However, the efficacy of a mechanical system was quite limited.

When Morse developed his electrical telegraph system, he put a lot of thought into one of the primary problems of communication—-that of “encoding”. That is to say, representing the spoken words into written symbols at one level; then representing the written symbol into a completely new form, that of electric pulses, the so called “dots and dashes”.

Morse and his student Veil wanted to improve the efficiency of the encoding. Veil visited the local newspaper office and found that, “there were 15000 E’s, followed by 12000 T’s, and only 200 Z’s”. He had indeed discovered a version of Zipf’s law. It is shown today that the Morse code is within 15% of the optimal representation of the English language.

It took a while to get the right intuition to develop the encoding. People thought about as many wires as there were letters in the alphabet. Then came the idea of using needles deflected by electromagnetism—again one per letter. The number of needles were reduced by Baron Schilling in Russia, first to five and then to just one. He used combinations of left and right signals to denote letters. Gauss and Weber came up with a binary code using the direction and the number of needle deflections.

Alfred Vail devised a simple spring-loaded lever which had the minimum functionality of closing and opening of a circuit. He called it a key. With the key an operator could send signals at the rate of hundreds per minute. Once the relay was invented (which occurred almost simultaneously), long distance electric telegraphy became a reality.

"A net-work of nerves of iron wire"

Even the electric telegraph was not beyond the scorn of some of the thinkers of the day. Apparently, someone assigned to assess the technology—he was a physician and scientist—was not impressed: "What can we expect of a few wretched wires?"

Regardless, newspapers immediately saw the benefits of the new technology. Numerous newspapers across the globe called themselves The Telegraph or a variant of it. News items were described as “Communicated by Electric Telegraph” (a la “Breaking News” or “Exclusive Story”). Very soon fire brigades and police stations started linking their communications. Shopkeepers started advertising their ability to take orders by telegraphs. That was perhaps the beginning of online shopping!

People, especially writers need an analogue, a way of describing new things in comparison to what they see in nature. The telegraph reminded them of spider webs, labyrinths and mazes. But perhaps for the first time the word network was used, as something that connects and encompasses, and can be used as a means of communication. "The whole net-work of wires, all quivering from end to end with signals of human intelligence."

Apr 22

Extract dominant colors from an image with ColorWeave

[This post was written by Jyotiska with contributions from Sanket. Jyotiska is a Data Engineer at DataWeave with keen interest in Python and experimentation. He works with Sanket in the Products and Strategy team.]


We have taken a special interest in colors in recent times. Some of us can even identify and name a couple of dozen different colors! The genesis for this project was PriceWeave’s Color Analytics offering. With Color Analytics, we provide detailed analysis in colors and other attributes related to retailers and brands in Apparel and Lifestyle products space.

The Idea

The initial idea was to simply extract the dominating colors from an image and generate a color palette. Fashion blogs and Pinterest pages are updated regularly by popular fashion brands and often feature their latest offerings for the current season and their newly released products. So, we thought if we can crawl these blogs periodically after every few days/weeks, we can plot the trends in graphs using the extracted colors. This timeline is very helpful for any online/offline merchant to visualize the current trend in the market and plan out their own product offerings.

We expanded this to include Apparel and Lifestyle products from eCommerce websites like Jabong, Myntra, Flipkart, and Yebhi, and stores of popular brands like Nike, Puma, and Reebok. We also used their Pinterest pages.

Color Extraction

The core of this work was to build a robust color extraction algorithm. We developed a couple of algorithms by extending some well known techniques. One approach we followed was to use standard unsupervised machine learning techniques. We ran k-means clustering against our images data. Here k refers to the number of colors we are trying to extract from the image.

In another algorithm, we extracted all the possible color points from the image and used heuristics to come up with a final set of colors as a palette.

Another of our algorithms was built on top of the Python Image Library (PIL) and the Colorific package to extract and produce the color palette from the image.

Regardless of the approach we used, we soon found out that both speed and accuracy were a problem. Our k-means implementation produced decent results but it took 3-4 seconds to process an entire image! This might not seem much for a small set of images, but the script took 2 days to process 40,000 products from Myntra.

Post this, we did a lot of tweaking in our algorithms and came up with a faster and more accurate model which we are using currently.

ColorWeave API

We have open sourced an early version of our implementation. It is available of github here. You can also download the Python package from the Python Package Index here. Find below examples to understand its usage.

Retrieve dominant colors from an image URL


from colorweave import palette
print palette(url="image_url")

Retrive n dominant colors from a local image and print as json:

print palette(url="image_url", n=6, output="json")

Print a dictionary with each dominant color mapped to its CSS3 color name

print palette(url="image_url", n=6, format="css3")

Print the list of dominant colors using k-means clustering algorithm

print palette(url="image_url", n=6, mode="kmeans")

Data Storage

The next challenge was to come up with an ideal data model to store the data which will also let us query on it. Initially, all the processed data was indexed by Solr and we used its REST API for all our querying. Soon we realized that we have to come up with better data model to store, index and query the data.

We looked at a few NoSQL databases, especially column oriented stores like Cassandra and HBase and document stores like MongoDB. Since the details of a single product can be represented as a JSON object, and key-value storage can prove to be quite useful in querying, we settled on MongoDB. We imported our entire data (~ 160,000 product details) to MongoDB, where each product represents a single document.

Color Mapping

We still had one major problem we needed to resolve. Our color extraction algorithm produces the color palette in hexadecimal format. But in order to build a useful query interface, we had to translate the hexcodes to human readable color names. We had two options. Either we could use a CSS 2.0 web color names consisting on 16 basic colors (White, Silver, Gray, Black, Red, Maroon, Yellow, Olive, Lime, Green, Aqua, Teal, Blue, Navy, Fuchsia, Purple) or we could use CSS 3.0 web color names consisting of 140 colors. We used both to map colors and stored those colors along with each image.

Color Hierarchy

We mapped the hexcodes to CSS 3.1 which has every possible shades for the basic colors. Then we assigned a parent basic color for every shades and stored them separately. Also, we created two fields - one for the primary colors and the other one for the extended colors which will help us in indexing and querying. At the end, each product had 24 properties associated with it! MongoDB made it easier to query on the data using the aggregation framework.

What next?

A few things. An advanced version of color extraction (with a number of other exciting features) is being integrated into PriceWeave. We are also working on building a small consumer facing product where users will be able to query and find products based on color and other attributes. There are many other possibilities some of which we will discuss when the time is ripe. Signing off for now!