DataWeave

Aug 09

A Peek into GNU Parallel

[This post was written by Jyotiska. Jyotiska is a Data Engineer at DataWeave with keen interest in Python and experimentation. He works with Sanket in the Products and Strategy team.]

GNU Parallel is a tool that can be deployed from a shell to parallelize job execution. A job can be anything from simple shell scripts to complex interdependent Python/Ruby/Perl scripts. The simplicity of ‘Parallel’ tool lies in it usage. A modern day computer with multicore processors should be enough to run your jobs in parallel. A single core computer can also run the tool, but the user won’t be able to see any difference as the jobs will be context switched by the underlying OS.

At DataWeave, we use Parallel for automating and parallelizing a number of resource extensive processes ranging from crawling to data extraction. All our servers have 8 cores with capability of executing 4 threads in each. So, we experienced huge performance gain after deploying Parallel. Our in-house image processing algorithms used to take more than a day to process 200,000 high resolution images. After using Parallel, we have brought the time down to a little over 40 minutes!

GNU Parallel can be installed on any Linux box and does not require sudo access. The following command will install the tool:
(wget -O - pi.dk/3 || curl pi.dk/3/) | bash
GNU Parallel can read inputs from a number of sources — a file or command line or stdin. The following simple example takes the input from the command line and executes parallely:
parallel echo ::: A B C
The following takes the input from a file:
parallel -a somefile.txt echo
Or STDIN:
cat somefile.txt | parallel echo
The inputs can be from multiple files too:
parallel -a somefile.txt -a anotherfile.txt echo
The number of simultaneous jobs can be controlled using the --jobs or -j switch. The following command will run 5 jobs at once:
parallel --jobs 5 echo ::: A B C D E F G H I J
By default, the number of jobs will be equal to the number of CPU cores. However, this can be overridden using percentages. The following will run 2 jobs per CPU core:
parallel --jobs 200% echo ::: A B C D
If you do not want to set any limit, then the following will use all the available CPU cores in the machine. However, this is NOT recommended in production environment as other jobs running on the machine will be vastly slowed down.
parallel --jobs 0 echo ::: A B C
Enough with the toy examples. The following will show you how to bulk insert JSON documents in parallel in a MongoDB cluster. Almost always we need to insert millions of document quickly in our MongoDB cluster and inserting documents serially doesn’t cut it. Moreover, MongoDB can handle parallel inserts.

The following is a snippet of a file with JSON document. Let’s assume that there are a million similar records in the file with one JSON document per line.

{“name”: “John”, “city”: “Boston”, “age”: 23}
{“name”: “Alice”, “city”: “Seattle”, “age”: 31}
{“name”: “Patrick”, “city”: “LA”, “age”: 27}
...
...
The following Python script will get each JSON document and insert into “people” collection under “dw” database.

import json
import pymongo
import sys

document = json.loads(sys.argv[1])
client = pymongo.MongoClient(“mongodb://somehost:someport”)
db = client[“dw”]
collection = db[“people”]

try:
    collection.insert(document)
except Exception as e:
    print “Could not insert document in db”, repr(e)
Now to run this parallely, the following command should do the magic:
cat people.json | parallel ‘python insertDB.py {}’
That’s it! There are many switches and options available for advanced processing. They can be accessed by doing a man parallel on the shell. Also the following page has a set of tutorials: GNU Parallel Tutorials.

Jun 02

JSON vs simpleJSON vs ultraJSON

[This post was written by Jyotiska. Jyotiska is a Data Engineer at DataWeave with keen interest in Python and experimentation. He works with Sanket in the Products and Strategy team.]

Without argument, the most used file format used in DataWeave is JSON. We use JSON everywhere — managing product lists, storing data extracted from crawl dumps, generating assortment reports, and almost everywhere else. So it is very important to make sure that the libraries and packages we are using are fast enough to handle large data sets. There are two popular packages used for handling json — first is the stock json package that comes with default installation of Python, the other one is simplejson which is an optimized and maintained json package for Python. The goal of this blog post is to introduce ultrajson or Ultra JSON, a JSON library written mostly in C and built to be extremely fast.

We have done the benchmark on three popular operations — load, loads and dumps. We have a dictionary with 3 keys - id, name and address. We will dump this dictionary using json.dumps() and store it in a file. Then we will use json.loads() and json.load() separately to load the dictionaries from the file. We have performed this experiment on 10000, 50000, 100000, 200000, 1000000 dictionaries and observed how much time it takes to perform the operation by each library.

dumps operation line by line

Here is the result we received using the json.dumps() operations. We have dumped the content dictionary by dictionary.

Number of dicts json (in ms) simplejson (in ms) ujson (in ms)
10000 84.106 115.876 25.891
50000 395.163 579.576 122.626
100000 820.280 1147.962 246.721
200000 1620.239 2277.786 487.402
500000 3998.736 5682.641 1218.653
1000000 7847.999 11501.038 2530.791

We notice that json performs better than simplejson but ultrajson wins the game with almost 400% speedup than stock json.

dumps operation (all dictionaries at once)

In this experiment, we have stored all the dictionaries in a list and dumped the list using json.dumps().

Number of dicts json (in ms) simplejson (in ms) ujson (in ms)
10000 21.674 21.941 14.484
50000 102.739 118.851 67.215
100000 199.454 240.849 138.830
200000 401.376 476.392 270.667
500000 1210.664 1376.511 834.013
1000000 2729.538 2983.563 1830.707

simplejson is almost as good as stock json, but again ultrajson outperforms them by 150% speedup. Now lets see how they perform for load and loads operation.

load operation on a list of dictionaries

Now we do the load operation on a list of dictionaries and compare the results.

Number of dicts json (in ms) simplejson (in ms) ujson (in ms)
10000 47.040 8.932 10.706
50000 165.877 44.065 45.629
100000 356.405 97.277 105.948
200000 718.873 185.917 205.120
500000 1699.623 461.605 503.203
1000000 3441.075 949.905 1055.966

Surprisingly, simplejson beats other two, with ultrajson being almost close to simplejson. Here, we observe that simplejson is almost 400% faster than stock json, same with ultrajson.

loads operation on dictionaries

In this experiment, we load dictionaries from the file one by one and pass them to the json.loads() function.

Number of dicts json (in ms) simplejson (in ms) ujson (in ms)
10000 99.501 69.517 15.917
50000 407.873 246.794 76.369
100000 893.989 526.257 157.272
200000 1746.961 1025.824 318.306
500000 8446.053 2497.081 791.374
1000000 8281.018 4939.498 1642.416

Again ultrajson steals the show, being almost 600% faster than stock json and 400% faster than simplejson.

That is all the benchmarks we have here. The verdict is pretty clear. Use simplejson instead of stock json in any case, since simplejson is well maintained repository. If you really want something extremely fast, then go for ultrajson. In that case, keep in mind that ultrajson only works with well defined collections and will not work for un-serializable collections. But if you are dealing with texts, this should not be a problem.

May 01

James Gleick’s The Information — The Beginnings

[The author of this post is Sanket. Sanket heads Products and Strategy at DataWeave. When he manages to find time he plays with his daughter or reads two more pages of a book he’s been trying to finish since several months.]

James Gleick’s 2011 book The Information (ISBN: 0375423729, Pages: 527, Publisher: Knopf Doubleday Publishing Group) is an important book. It is an essential reading of our time—-the fabled Information Age. The subtitle of the book is very apt: A History, a Theory, a Flood. As it suggests, the book traces the quest of human being towards understanding the nature of “information”, representing it (in various forms spoken word, phoneme, written symbols, electric pulses, etc.), and communicating it.

The rest of this blog post is not a review of the book (just go read the book instead of this!), but a series of thoughts and ideas formed upon reading the book. Just like the book, I am going to write them in three parts.

Talking Drums

While there are no distinct sub parts to the book, we can deem the initial chapters as The History. Gleick starts with a fascinating account of the “talking drums of Africa”. Drums speaking the numerous African tongues and dialects developed by illiterate primitives were used to communicate information over hundreds of miles within minutes, a feat unachievable by the fastest runner or the best horse in Europe.

Europeans used smoke signals that typically communicated binary (or at best a very limited number of) status messages. Whereas the drums communicated complex messages not just warnings and quick messages, but “songs, jokes, etc”. A drummer added sufficient contextual redundancy to ensure error correction (while such concepts as we know of them today did not exist at that time!). African languages being tonal, and the drums limited by conveying just the tonal element, the redundancy was essential to resolve context. For instance, if a drummer wanted to say something about a chicken (koko), he would drum: "koko, the one that does the sound kiokio".

The Written Word

While the oral culture was wondrous in itself (Homer’s epics possibly evolved from oral transfer across generations), the invention of writing presented possibilities and difficulties, hitherto unheard of. Firstly, it had its detractors among which were influential people such as Plato. "Writing diminishes the faculty of memory among men. […]" Compare this to the qualm of our age: “Is Google making us stupid?

Gleick argues on the basis of dialectics of earlier historians that writing was essential for the possibility of conscious thought. Plato and other Greek philosophers, while being skeptical about writing, developed logic through the faculty of writing. The oral culture lacked the comprehension of syllogisms as shown by sociological experiments.

Babylonians developed advanced mathematics for their time as indicated by their cuneiform tablets. Regardless of the initial resistance to writing, it marched on to produce the sciences, arts, and philosophy of the highest quality. The notion of “history”, leave aside “history of information”, was inconceivable without first inventing writing.

The Telegraph

Information also presented the problem of communication. Beacons, smoke signals, and the rest were all but limited by their range, both of vocabulary and distance. Then came along telegraph. Of course, the first telegraph systems were mechanical, wherein an intricate system of wooden arms was used to relay messages from one station to the next. The positions of the arms of the telegraph implied different messages. The number of states in which the telegraph could exist increased quite a bit. However, the efficacy of a mechanical system was quite limited.

When Morse developed his electrical telegraph system, he put a lot of thought into one of the primary problems of communication—-that of “encoding”. That is to say, representing the spoken words into written symbols at one level; then representing the written symbol into a completely new form, that of electric pulses, the so called “dots and dashes”.

Morse and his student Veil wanted to improve the efficiency of the encoding. Veil visited the local newspaper office and found that, “there were 15000 E’s, followed by 12000 T’s, and only 200 Z’s”. He had indeed discovered a version of Zipf’s law. It is shown today that the Morse code is within 15% of the optimal representation of the English language.

It took a while to get the right intuition to develop the encoding. People thought about as many wires as there were letters in the alphabet. Then came the idea of using needles deflected by electromagnetism—again one per letter. The number of needles were reduced by Baron Schilling in Russia, first to five and then to just one. He used combinations of left and right signals to denote letters. Gauss and Weber came up with a binary code using the direction and the number of needle deflections.

Alfred Vail devised a simple spring-loaded lever which had the minimum functionality of closing and opening of a circuit. He called it a key. With the key an operator could send signals at the rate of hundreds per minute. Once the relay was invented (which occurred almost simultaneously), long distance electric telegraphy became a reality.

"A net-work of nerves of iron wire"

Even the electric telegraph was not beyond the scorn of some of the thinkers of the day. Apparently, someone assigned to assess the technology—he was a physician and scientist—was not impressed: "What can we expect of a few wretched wires?"

Regardless, newspapers immediately saw the benefits of the new technology. Numerous newspapers across the globe called themselves The Telegraph or a variant of it. News items were described as “Communicated by Electric Telegraph” (a la “Breaking News” or “Exclusive Story”). Very soon fire brigades and police stations started linking their communications. Shopkeepers started advertising their ability to take orders by telegraphs. That was perhaps the beginning of online shopping!

People, especially writers need an analogue, a way of describing new things in comparison to what they see in nature. The telegraph reminded them of spider webs, labyrinths and mazes. But perhaps for the first time the word network was used, as something that connects and encompasses, and can be used as a means of communication. "The whole net-work of wires, all quivering from end to end with signals of human intelligence."

Apr 22

Extract dominant colors from an image with ColorWeave

[This post was written by Jyotiska with contributions from Sanket. Jyotiska is a Data Engineer at DataWeave with keen interest in Python and experimentation. He works with Sanket in the Products and Strategy team.]


We have taken a special interest in colors in recent times. Some of us can even identify and name a couple of dozen different colors! The genesis for this project was PriceWeave’s Color Analytics offering. With Color Analytics, we provide detailed analysis in colors and other attributes related to retailers and brands in Apparel and Lifestyle products space.

The Idea

The initial idea was to simply extract the dominating colors from an image and generate a color palette. Fashion blogs and Pinterest pages are updated regularly by popular fashion brands and often feature their latest offerings for the current season and their newly released products. So, we thought if we can crawl these blogs periodically after every few days/weeks, we can plot the trends in graphs using the extracted colors. This timeline is very helpful for any online/offline merchant to visualize the current trend in the market and plan out their own product offerings.

We expanded this to include Apparel and Lifestyle products from eCommerce websites like Jabong, Myntra, Flipkart, and Yebhi, and stores of popular brands like Nike, Puma, and Reebok. We also used their Pinterest pages.

Color Extraction

The core of this work was to build a robust color extraction algorithm. We developed a couple of algorithms by extending some well known techniques. One approach we followed was to use standard unsupervised machine learning techniques. We ran k-means clustering against our images data. Here k refers to the number of colors we are trying to extract from the image.

In another algorithm, we extracted all the possible color points from the image and used heuristics to come up with a final set of colors as a palette.

Another of our algorithms was built on top of the Python Image Library (PIL) and the Colorific package to extract and produce the color palette from the image.

Regardless of the approach we used, we soon found out that both speed and accuracy were a problem. Our k-means implementation produced decent results but it took 3-4 seconds to process an entire image! This might not seem much for a small set of images, but the script took 2 days to process 40,000 products from Myntra.

Post this, we did a lot of tweaking in our algorithms and came up with a faster and more accurate model which we are using currently.

ColorWeave API

We have open sourced an early version of our implementation. It is available of github here. You can also download the Python package from the Python Package Index here. Find below examples to understand its usage.

Retrieve dominant colors from an image URL


from colorweave import palette
print palette(url="image_url")

Retrive n dominant colors from a local image and print as json:

print palette(url="image_url", n=6, output="json")

Print a dictionary with each dominant color mapped to its CSS3 color name

print palette(url="image_url", n=6, format="css3")

Print the list of dominant colors using k-means clustering algorithm

print palette(url="image_url", n=6, mode="kmeans")

Data Storage

The next challenge was to come up with an ideal data model to store the data which will also let us query on it. Initially, all the processed data was indexed by Solr and we used its REST API for all our querying. Soon we realized that we have to come up with better data model to store, index and query the data.

We looked at a few NoSQL databases, especially column oriented stores like Cassandra and HBase and document stores like MongoDB. Since the details of a single product can be represented as a JSON object, and key-value storage can prove to be quite useful in querying, we settled on MongoDB. We imported our entire data (~ 160,000 product details) to MongoDB, where each product represents a single document.

Color Mapping

We still had one major problem we needed to resolve. Our color extraction algorithm produces the color palette in hexadecimal format. But in order to build a useful query interface, we had to translate the hexcodes to human readable color names. We had two options. Either we could use a CSS 2.0 web color names consisting on 16 basic colors (White, Silver, Gray, Black, Red, Maroon, Yellow, Olive, Lime, Green, Aqua, Teal, Blue, Navy, Fuchsia, Purple) or we could use CSS 3.0 web color names consisting of 140 colors. We used both to map colors and stored those colors along with each image.

Color Hierarchy

We mapped the hexcodes to CSS 3.1 which has every possible shades for the basic colors. Then we assigned a parent basic color for every shades and stored them separately. Also, we created two fields - one for the primary colors and the other one for the extended colors which will help us in indexing and querying. At the end, each product had 24 properties associated with it! MongoDB made it easier to query on the data using the aggregation framework.

What next?

A few things. An advanced version of color extraction (with a number of other exciting features) is being integrated into PriceWeave. We are also working on building a small consumer facing product where users will be able to query and find products based on color and other attributes. There are many other possibilities some of which we will discuss when the time is ripe. Signing off for now!

Apr 11

Implementing DataWeave’s Social API for Social Data Analysis

[The author of this post is Apoorv. Apoorv did his internship at DataWeave during January and February, 2014. Here he shares his experiences with Twitter API, MongoDB, and implementing REST APIs.]

In today’s world, the analysis of any social media stream can reap invaluable information about, well, pretty much everything. If you are a business catering to a large number of consumers, it is a very important tool for understanding and analyzing the market’s perception about you, and how your audience reacts to whatever you present before them.

At DataWeave, we sat down to create a setup that would do this for some e-commerce stores and retail brands. And the first social network we decided to track was the micro-blogging giant, Twitter. Twitter is a great medium for engaging with your audience. It’s also a very efficient marketing channel to reach out to a large number of people.

Data Collection

The very first issue that needs to be tackled is collecting the data itself. Now quite understandably, Twitter protects its data vigorously. However, it does have a pretty solid REST API for data distribution purposes too. The API is simple, nothing too complex, and returns data in the easy to use JSON format. Take a look at the timeline API, for example. That’s quite straightforward and has a lot of detailed information.

The issue with the Twitter API however, is that it is seriously rate limited. Every function can be called in a range of 15-180 times in a 15-minute window. While this is good enough for small projects not needing much data, for any real-world application however, these rate limits can be really frustrating. To avoid this, we used the Streaming API, which creates a long-lived HTTP GET request that continuously streams tweets from the public timeline.

Also, Twitter seems to suddenly return null values in the middle of the stream, which can make the streamer crash if we don’t take care. As for us, we simply threw away all null data before it reached the analysis phase, and as an added precaution, designed a simple e-mail alert for when the streamer crashed.

Data Storage

Next is data storage. Data is traditionally stored in tables, using RDBMS. But for this, we decided to use MongoDB, as a document store seemed quite suitable for our needs. While I didn’t have much clue about MongoDB or what purpose it’s going to serve at first, I realized that is a seriously good alternative to MySQL, PostgreSQL and other relational schema-based data stores for a lot of applications.

Some of its advantages that I very soon found out were: documents-based data model that are very easy to handle analogous to Python dictionaries, and support for expressive queries. I recommend using this for some of your DB projects. You can play about with it here.

Data Processing

Next comes data processing. While data processing in MongoDB is simple, it can also be a hard thing to learn, especially for someone like me, who had no experience anywhere outside SQL. But MongoDB queries are simple to learn once the basics are clear.

For example, in a DB DWSocial with a collection tweets, the syntax for getting all tweets would be something like this in a Python environment:

rt = list(db.tweets.find())
The list type-cast here is necessary, because without it, the output is simply a MongoDB reference, with no value. Now, to find all tweets where user_id is 1234, we have
rt = list(db.retweets.find({ 'user_id': 1234 })

Apart from this, we used regexes to detect specific types of tweets, if they were, for example, “offers”, “discounts”, and “deals”. For this, we used the Python re library, that deals with regexes. Suffice is to say, my reaction to regexes for the first two days was much like

Once again, its just initial stumbles. After some (okay, quite some) help from Thothadri, Murthy and Jyotiska, I finally managed a basic parser that could detect which tweets were offers, discounts and deals. A small code snippet is here for this purpose.


def deal(id):
     re_offers = re.compile(r'''
     \b 
     (?: 
       deals? 
       | 
       offers? 
       |
       discount
       |
       promotion
       |
       sale
       |
       rs?
       |
       rs\? 
       |
       inr\s*([\d\.,])+ 
       |
       ([\d\.,])+\s*inr
     ) 
    \b 
    | 
    \b\d+% 
    |
    \$\d+\b 
    ''',
   re.I|re.X)
   x = list(tweets.find({'user_id' : id,'created_at': { '$gte': fourteen_days_ago }}))
   mylist = []
   newlist = []
   for a in x:
       b = re_offers.findall(a.get('text'))
       if b:
       print a.get('id')
       mylist.append(a.get('id'))
       w = list(db.retweets.find( { 'id' : a.get('id') } ))
       if w:
       mydict = {'id' : a.get('id'), 'rt_count' : w[0].get('rt_count'), 'text' : a.get('text'), 'terms' : b}
       else:
           mydict = {'id' : a.get('id'), 'rt_count' : 0, 'text' : a.get('text'), 'terms' : b}
       track.insert(mydict)

This is much less complicated than it seems. And it also brings us to our final step—integrating all our queries into a REST-ful API.

Data Serving

For this, mulitple web-frameworks are available. The ones we did consider were Flask, Django and Bottle.

Weighing the pros and cons of every framework can be tedious. I did find this awesome presentation on slideshare though, that succinctly summarizes each framework. You can go through it here.

We finally settled on Bottle as our choice of framework. The reasons are simple. Bottle is monolithic, i.e., it uses the one-file approach. For small applications, this makes for code that is easier to read and maintainable.

Some sample web address routes are shown here:


#show all tracked accounts
id_legend = {57947109 : 'Flipkart', 183093247: 'HomeShop18', 89443197: 'Myntra', 431336956: 'Jabong'}

@route('/ids')
def get_ids():
    result = json.dumps(id_legend)
    return result

#show all user mentions for a particular account
@route('/user_mentions')
def user_mention():
    m = request.query.id
    ac_id = int(m)
    t = list(tweets.find({'created_at': { '$gte': fourteen_days_ago }, 'retweeted': 'no', 'user_id': { '$ne': ac_id} }))
    a = len(t)
    mylist = []
    for i in t:
        mylist.append({i.get('user_id'): i.get('id')})
    x = { 'num_of_mentions': a, 'mentions_details': mylist }
    result = json.dumps(x)
    return result
This is how the DataWeave Social API came into being. I had a great time doing this, with special credits to Sanket, Mandar and Murthy for all the help that they gave me for this. That’s all for now, folks!