Jun 02

JSON vs simpleJSON vs ultraJSON

[This post was written by Jyotiska. Jyotiska is a Data Engineer at DataWeave with keen interest in Python and experimentation. He works with Sanket in the Products and Strategy team.]

Without argument, the most used file format used in DataWeave is JSON. We use JSON everywhere — managing product lists, storing data extracted from crawl dumps, generating assortment reports, and almost everywhere else. So it is very important to make sure that the libraries and packages we are using are fast enough to handle large data sets. There are two popular packages used for handling json — first is the stock json package that comes with default installation of Python, the other one is simplejson which is an optimized and maintained json package for Python. The goal of this blog post is to introduce ultrajson or Ultra JSON, a JSON library written mostly in C and built to be extremely fast.

We have done the benchmark on three popular operations — load, loads and dumps. We have a dictionary with 3 keys - id, name and address. We will dump this dictionary using json.dumps() and store it in a file. Then we will use json.loads() and json.load() separately to load the dictionaries from the file. We have performed this experiment on 10000, 50000, 100000, 200000, 1000000 dictionaries and observed how much time it takes to perform the operation by each library.

dumps operation line by line

Here is the result we received using the json.dumps() operations. We have dumped the content dictionary by dictionary.

Number of dicts json (in ms) simplejson (in ms) ujson (in ms)
10000 84.106 115.876 25.891
50000 395.163 579.576 122.626
100000 820.280 1147.962 246.721
200000 1620.239 2277.786 487.402
500000 3998.736 5682.641 1218.653
1000000 7847.999 11501.038 2530.791

We notice that json performs better than simplejson but ultrajson wins the game with almost 400% speedup than stock json.

dumps operation (all dictionaries at once)

In this experiment, we have stored all the dictionaries in a list and dumped the list using json.dumps().

Number of dicts json (in ms) simplejson (in ms) ujson (in ms)
10000 21.674 21.941 14.484
50000 102.739 118.851 67.215
100000 199.454 240.849 138.830
200000 401.376 476.392 270.667
500000 1210.664 1376.511 834.013
1000000 2729.538 2983.563 1830.707

simplejson is almost as good as stock json, but again ultrajson outperforms them by 150% speedup. Now lets see how they perform for load and loads operation.

load operation on a list of dictionaries

Now we do the load operation on a list of dictionaries and compare the results.

Number of dicts json (in ms) simplejson (in ms) ujson (in ms)
10000 47.040 8.932 10.706
50000 165.877 44.065 45.629
100000 356.405 97.277 105.948
200000 718.873 185.917 205.120
500000 1699.623 461.605 503.203
1000000 3441.075 949.905 1055.966

Surprisingly, simplejson beats other two, with ultrajson being almost close to simplejson. Here, we observe that simplejson is almost 400% faster than stock json, same with ultrajson.

loads operation on dictionaries

In this experiment, we load dictionaries from the file one by one and pass them to the json.loads() function.

Number of dicts json (in ms) simplejson (in ms) ujson (in ms)
10000 99.501 69.517 15.917
50000 407.873 246.794 76.369
100000 893.989 526.257 157.272
200000 1746.961 1025.824 318.306
500000 8446.053 2497.081 791.374
1000000 8281.018 4939.498 1642.416

Again ultrajson steals the show, being almost 600% faster than stock json and 400% faster than simplejson.

That is all the benchmarks we have here. The verdict is pretty clear. Use simplejson instead of stock json in any case, since simplejson is well maintained repository. If you really want something extremely fast, then go for ultrajson. In that case, keep in mind that ultrajson only works with well defined collections and will not work for un-serializable collections. But if you are dealing with texts, this should not be a problem.

May 01

James Gleick’s The Information — The Beginnings

[The author of this post is Sanket. Sanket heads Products and Strategy at DataWeave. When he manages to find time he plays with his daughter or reads two more pages of a book he’s been trying to finish since several months.]

James Gleick’s 2011 book The Information (ISBN: 0375423729, Pages: 527, Publisher: Knopf Doubleday Publishing Group) is an important book. It is an essential reading of our time—-the fabled Information Age. The subtitle of the book is very apt: A History, a Theory, a Flood. As it suggests, the book traces the quest of human being towards understanding the nature of “information”, representing it (in various forms spoken word, phoneme, written symbols, electric pulses, etc.), and communicating it.

The rest of this blog post is not a review of the book (just go read the book instead of this!), but a series of thoughts and ideas formed upon reading the book. Just like the book, I am going to write them in three parts.

Talking Drums

While there are no distinct sub parts to the book, we can deem the initial chapters as The History. Gleick starts with a fascinating account of the “talking drums of Africa”. Drums speaking the numerous African tongues and dialects developed by illiterate primitives were used to communicate information over hundreds of miles within minutes, a feat unachievable by the fastest runner or the best horse in Europe.

Europeans used smoke signals that typically communicated binary (or at best a very limited number of) status messages. Whereas the drums communicated complex messages not just warnings and quick messages, but “songs, jokes, etc”. A drummer added sufficient contextual redundancy to ensure error correction (while such concepts as we know of them today did not exist at that time!). African languages being tonal, and the drums limited by conveying just the tonal element, the redundancy was essential to resolve context. For instance, if a drummer wanted to say something about a chicken (koko), he would drum: "koko, the one that does the sound kiokio".

The Written Word

While the oral culture was wondrous in itself (Homer’s epics possibly evolved from oral transfer across generations), the invention of writing presented possibilities and difficulties, hitherto unheard of. Firstly, it had its detractors among which were influential people such as Plato. "Writing diminishes the faculty of memory among men. […]" Compare this to the qualm of our age: “Is Google making us stupid?

Gleick argues on the basis of dialectics of earlier historians that writing was essential for the possibility of conscious thought. Plato and other Greek philosophers, while being skeptical about writing, developed logic through the faculty of writing. The oral culture lacked the comprehension of syllogisms as shown by sociological experiments.

Babylonians developed advanced mathematics for their time as indicated by their cuneiform tablets. Regardless of the initial resistance to writing, it marched on to produce the sciences, arts, and philosophy of the highest quality. The notion of “history”, leave aside “history of information”, was inconceivable without first inventing writing.

The Telegraph

Information also presented the problem of communication. Beacons, smoke signals, and the rest were all but limited by their range, both of vocabulary and distance. Then came along telegraph. Of course, the first telegraph systems were mechanical, wherein an intricate system of wooden arms was used to relay messages from one station to the next. The positions of the arms of the telegraph implied different messages. The number of states in which the telegraph could exist increased quite a bit. However, the efficacy of a mechanical system was quite limited.

When Morse developed his electrical telegraph system, he put a lot of thought into one of the primary problems of communication—-that of “encoding”. That is to say, representing the spoken words into written symbols at one level; then representing the written symbol into a completely new form, that of electric pulses, the so called “dots and dashes”.

Morse and his student Veil wanted to improve the efficiency of the encoding. Veil visited the local newspaper office and found that, “there were 15000 E’s, followed by 12000 T’s, and only 200 Z’s”. He had indeed discovered a version of Zipf’s law. It is shown today that the Morse code is within 15% of the optimal representation of the English language.

It took a while to get the right intuition to develop the encoding. People thought about as many wires as there were letters in the alphabet. Then came the idea of using needles deflected by electromagnetism—again one per letter. The number of needles were reduced by Baron Schilling in Russia, first to five and then to just one. He used combinations of left and right signals to denote letters. Gauss and Weber came up with a binary code using the direction and the number of needle deflections.

Alfred Vail devised a simple spring-loaded lever which had the minimum functionality of closing and opening of a circuit. He called it a key. With the key an operator could send signals at the rate of hundreds per minute. Once the relay was invented (which occurred almost simultaneously), long distance electric telegraphy became a reality.

"A net-work of nerves of iron wire"

Even the electric telegraph was not beyond the scorn of some of the thinkers of the day. Apparently, someone assigned to assess the technology—he was a physician and scientist—was not impressed: "What can we expect of a few wretched wires?"

Regardless, newspapers immediately saw the benefits of the new technology. Numerous newspapers across the globe called themselves The Telegraph or a variant of it. News items were described as “Communicated by Electric Telegraph” (a la “Breaking News” or “Exclusive Story”). Very soon fire brigades and police stations started linking their communications. Shopkeepers started advertising their ability to take orders by telegraphs. That was perhaps the beginning of online shopping!

People, especially writers need an analogue, a way of describing new things in comparison to what they see in nature. The telegraph reminded them of spider webs, labyrinths and mazes. But perhaps for the first time the word network was used, as something that connects and encompasses, and can be used as a means of communication. "The whole net-work of wires, all quivering from end to end with signals of human intelligence."

Apr 22

Extract dominant colors from an image with ColorWeave

[This post was written by Jyotiska with contributions from Sanket. Jyotiska is a Data Engineer at DataWeave with keen interest in Python and experimentation. He works with Sanket in the Products and Strategy team.]

We have taken a special interest in colors in recent times. Some of us can even identify and name a couple of dozen different colors! The genesis for this project was PriceWeave’s Color Analytics offering. With Color Analytics, we provide detailed analysis in colors and other attributes related to retailers and brands in Apparel and Lifestyle products space.

The Idea

The initial idea was to simply extract the dominating colors from an image and generate a color palette. Fashion blogs and Pinterest pages are updated regularly by popular fashion brands and often feature their latest offerings for the current season and their newly released products. So, we thought if we can crawl these blogs periodically after every few days/weeks, we can plot the trends in graphs using the extracted colors. This timeline is very helpful for any online/offline merchant to visualize the current trend in the market and plan out their own product offerings.

We expanded this to include Apparel and Lifestyle products from eCommerce websites like Jabong, Myntra, Flipkart, and Yebhi, and stores of popular brands like Nike, Puma, and Reebok. We also used their Pinterest pages.

Color Extraction

The core of this work was to build a robust color extraction algorithm. We developed a couple of algorithms by extending some well known techniques. One approach we followed was to use standard unsupervised machine learning techniques. We ran k-means clustering against our images data. Here k refers to the number of colors we are trying to extract from the image.

In another algorithm, we extracted all the possible color points from the image and used heuristics to come up with a final set of colors as a palette.

Another of our algorithms was built on top of the Python Image Library (PIL) and the Colorific package to extract and produce the color palette from the image.

Regardless of the approach we used, we soon found out that both speed and accuracy were a problem. Our k-means implementation produced decent results but it took 3-4 seconds to process an entire image! This might not seem much for a small set of images, but the script took 2 days to process 40,000 products from Myntra.

Post this, we did a lot of tweaking in our algorithms and came up with a faster and more accurate model which we are using currently.

ColorWeave API

We have open sourced an early version of our implementation. It is available of github here. You can also download the Python package from the Python Package Index here. Find below examples to understand its usage.

Retrieve dominant colors from an image URL

from colorweave import palette
print palette(url="image_url")

Retrive n dominant colors from a local image and print as json:

print palette(url="image_url", n=6, output="json")

Print a dictionary with each dominant color mapped to its CSS3 color name

print palette(url="image_url", n=6, format="css3")

Print the list of dominant colors using k-means clustering algorithm

print palette(url="image_url", n=6, mode="kmeans")

Data Storage

The next challenge was to come up with an ideal data model to store the data which will also let us query on it. Initially, all the processed data was indexed by Solr and we used its REST API for all our querying. Soon we realized that we have to come up with better data model to store, index and query the data.

We looked at a few NoSQL databases, especially column oriented stores like Cassandra and HBase and document stores like MongoDB. Since the details of a single product can be represented as a JSON object, and key-value storage can prove to be quite useful in querying, we settled on MongoDB. We imported our entire data (~ 160,000 product details) to MongoDB, where each product represents a single document.

Color Mapping

We still had one major problem we needed to resolve. Our color extraction algorithm produces the color palette in hexadecimal format. But in order to build a useful query interface, we had to translate the hexcodes to human readable color names. We had two options. Either we could use a CSS 2.0 web color names consisting on 16 basic colors (White, Silver, Gray, Black, Red, Maroon, Yellow, Olive, Lime, Green, Aqua, Teal, Blue, Navy, Fuchsia, Purple) or we could use CSS 3.0 web color names consisting of 140 colors. We used both to map colors and stored those colors along with each image.

Color Hierarchy

We mapped the hexcodes to CSS 3.1 which has every possible shades for the basic colors. Then we assigned a parent basic color for every shades and stored them separately. Also, we created two fields - one for the primary colors and the other one for the extended colors which will help us in indexing and querying. At the end, each product had 24 properties associated with it! MongoDB made it easier to query on the data using the aggregation framework.

What next?

A few things. An advanced version of color extraction (with a number of other exciting features) is being integrated into PriceWeave. We are also working on building a small consumer facing product where users will be able to query and find products based on color and other attributes. There are many other possibilities some of which we will discuss when the time is ripe. Signing off for now!

Apr 11

Implementing DataWeave’s Social API for Social Data Analysis

[The author of this post is Apoorv. Apoorv did his internship at DataWeave during January and February, 2014. Here he shares his experiences with Twitter API, MongoDB, and implementing REST APIs.]

In today’s world, the analysis of any social media stream can reap invaluable information about, well, pretty much everything. If you are a business catering to a large number of consumers, it is a very important tool for understanding and analyzing the market’s perception about you, and how your audience reacts to whatever you present before them.

At DataWeave, we sat down to create a setup that would do this for some e-commerce stores and retail brands. And the first social network we decided to track was the micro-blogging giant, Twitter. Twitter is a great medium for engaging with your audience. It’s also a very efficient marketing channel to reach out to a large number of people.

Data Collection

The very first issue that needs to be tackled is collecting the data itself. Now quite understandably, Twitter protects its data vigorously. However, it does have a pretty solid REST API for data distribution purposes too. The API is simple, nothing too complex, and returns data in the easy to use JSON format. Take a look at the timeline API, for example. That’s quite straightforward and has a lot of detailed information.

The issue with the Twitter API however, is that it is seriously rate limited. Every function can be called in a range of 15-180 times in a 15-minute window. While this is good enough for small projects not needing much data, for any real-world application however, these rate limits can be really frustrating. To avoid this, we used the Streaming API, which creates a long-lived HTTP GET request that continuously streams tweets from the public timeline.

Also, Twitter seems to suddenly return null values in the middle of the stream, which can make the streamer crash if we don’t take care. As for us, we simply threw away all null data before it reached the analysis phase, and as an added precaution, designed a simple e-mail alert for when the streamer crashed.

Data Storage

Next is data storage. Data is traditionally stored in tables, using RDBMS. But for this, we decided to use MongoDB, as a document store seemed quite suitable for our needs. While I didn’t have much clue about MongoDB or what purpose it’s going to serve at first, I realized that is a seriously good alternative to MySQL, PostgreSQL and other relational schema-based data stores for a lot of applications.

Some of its advantages that I very soon found out were: documents-based data model that are very easy to handle analogous to Python dictionaries, and support for expressive queries. I recommend using this for some of your DB projects. You can play about with it here.

Data Processing

Next comes data processing. While data processing in MongoDB is simple, it can also be a hard thing to learn, especially for someone like me, who had no experience anywhere outside SQL. But MongoDB queries are simple to learn once the basics are clear.

For example, in a DB DWSocial with a collection tweets, the syntax for getting all tweets would be something like this in a Python environment:

rt = list(db.tweets.find())
The list type-cast here is necessary, because without it, the output is simply a MongoDB reference, with no value. Now, to find all tweets where user_id is 1234, we have
rt = list(db.retweets.find({ 'user_id': 1234 })

Apart from this, we used regexes to detect specific types of tweets, if they were, for example, “offers”, “discounts”, and “deals”. For this, we used the Python re library, that deals with regexes. Suffice is to say, my reaction to regexes for the first two days was much like

Once again, its just initial stumbles. After some (okay, quite some) help from Thothadri, Murthy and Jyotiska, I finally managed a basic parser that could detect which tweets were offers, discounts and deals. A small code snippet is here for this purpose.

def deal(id):
     re_offers = re.compile(r'''
   x = list(tweets.find({'user_id' : id,'created_at': { '$gte': fourteen_days_ago }}))
   mylist = []
   newlist = []
   for a in x:
       b = re_offers.findall(a.get('text'))
       if b:
       print a.get('id')
       w = list(db.retweets.find( { 'id' : a.get('id') } ))
       if w:
       mydict = {'id' : a.get('id'), 'rt_count' : w[0].get('rt_count'), 'text' : a.get('text'), 'terms' : b}
           mydict = {'id' : a.get('id'), 'rt_count' : 0, 'text' : a.get('text'), 'terms' : b}

This is much less complicated than it seems. And it also brings us to our final step—integrating all our queries into a REST-ful API.

Data Serving

For this, mulitple web-frameworks are available. The ones we did consider were Flask, Django and Bottle.

Weighing the pros and cons of every framework can be tedious. I did find this awesome presentation on slideshare though, that succinctly summarizes each framework. You can go through it here.

We finally settled on Bottle as our choice of framework. The reasons are simple. Bottle is monolithic, i.e., it uses the one-file approach. For small applications, this makes for code that is easier to read and maintainable.

Some sample web address routes are shown here:

#show all tracked accounts
id_legend = {57947109 : 'Flipkart', 183093247: 'HomeShop18', 89443197: 'Myntra', 431336956: 'Jabong'}

def get_ids():
    result = json.dumps(id_legend)
    return result

#show all user mentions for a particular account
def user_mention():
    m =
    ac_id = int(m)
    t = list(tweets.find({'created_at': { '$gte': fourteen_days_ago }, 'retweeted': 'no', 'user_id': { '$ne': ac_id} }))
    a = len(t)
    mylist = []
    for i in t:
        mylist.append({i.get('user_id'): i.get('id')})
    x = { 'num_of_mentions': a, 'mentions_details': mylist }
    result = json.dumps(x)
    return result
This is how the DataWeave Social API came into being. I had a great time doing this, with special credits to Sanket, Mandar and Murthy for all the help that they gave me for this. That’s all for now, folks!

Apr 05

Web Scraping at Scale: Python vs Go

[The author of this post is Jyotiska. Jyotiska is a Data Engineer at DataWeave. He is a Python charmer and claims to have measured the size of big data, with a mere ruler. He posted this post originally on his personal blog. He has reposted it here with minor edits.]

Scaling is a common challenge when you have to deal with a lot of data everyday. This is one thing we face at DataWeave a lot and have managed to tackle it fairly successfully. Often there is no one best solution. You have to keep looking and testing to find out what suits you the best. One thing we do a lot in DW, is crawl and extract data from HTML pages to get meaningful information out of them. And by a lot, I mean in millions. Speed becomes an important factor here. For instance, if your script is taking one tenth of a second to process a single HTML page, it is not good enough as this will turn out to be a huge bottleneck when you are trying to scale your system.

We use Python to take care of all our day to day operations, and things are going smooth. But as I mentioned, there is no best solution, and we need to keep exploring. Thanks to Murthy, I became curious in Go. I knew about Go for a long time, probably since it came out but I never paid much attention to it. Last weekend, I started playing around with Go and liked it immediately. It seemed like a perfect marriage between C and Python (probably C++ too). Then I decided to do a small experiment. I wanted to see how Go performs against Python for scraping webpages at scale. The Regular Expression package of Go is as good as Python, so I didn’t face much problem building a basic parser. The task of the parser is: I will feed it an apparel product webpage from an Indian online shopping website and it will give me the ‘Title’, ‘MRP’ (price), and ‘Fabric’ information of the apparel and the product ‘Thumbnail URL’.The parser function looked something like this:

func parse(filename string) {
    bs, err := ioutil.ReadFile(filename)
    if err != nil {
    data := string(bs)

    var title = regexp.MustCompile(`(?i)(?s)<div class="[^<]*?prd-brand-detail">(.*?)<div class="mid-row mb5 full-width"`)
    var mrp = regexp.MustCompile(`(?i)(?s)\{"simple_price":(.*?),"simple_special_price":(.*?)\}`)
    var thumbnail = regexp.MustCompile(`(?i)(?s)<meta property="og:image" content="(.*?)"`)
    var fabric = regexp.MustCompile(`(?i)(?s)<td>Fabric</td>\s*<td[^>]*?>(.*?)</td>`)

    titleString := HTML(title.FindStringSubmatch(data)[1])
    fabricString := fabric.FindStringSubmatch(data)[1]
    mrpString := mrp.FindStringSubmatch(data)[1]
    thumbnailString := thumbnail.FindStringSubmatch(data)[1]

    fmt.Println(filename, titleString, fabricString, mrpString, thumbnailString)
def parse(filename):
    htmlfile = open(filename).read()

    titleregex = re.compile('''<div class="[^<]*?prd-brand-detail">(.*?)<div class="mid-row mb5 full-width"''', re.S|re.I)
    fabricregex = re.compile('''\{"simple_price":(.*?),"simple_special_price":(.*?)\}''', re.S|re.I)
    mrpregex = re.compile('''<meta property="og:image" content="(.*?)"''', re.S|re.I)
    thumbnailregex = re.compile('''Fabric</td>\s*<td[^>]*?>(.*?)</td>''', re.S|re.I)

    result =, htmlfile)
    title =

    result =, htmlfile)
    fabric =

    result =, htmlfile)
    mrp =

    result =, htmlfile)
    thumbnail =

    print filename, title, fabric, mrp, thumbnail

Nothing fancy, just 4 simple regexes. My testbed is a Mid-2012 MacBook Pro with 2.5 GHz i5 processor and 4 gigs of RAM. Now, I had two options. First, run the script with 100 HTML files sequentially and monitor the running time. Or, I could create 100 goroutines and give each of them a single HTML file and let them handle its own parsing using Go’s built in concurrency feature. I decided to do both. Also, I made a Python script which does a similar thing, to compare its performance against the Go script. Each experiment was performed 3-4 times and the best running time was taken into account. Following is the result I received:

Number of HTML Files Go Concurrent (in sec) Go Sequential (in sec) Python (in sec)
100 0.39 0.40 0.95
200 0.83 0.80 1.93
500 2.22 2.18 4.78
1000 4.68 10.46 14.58
2000 10.62 20.87 30.51
5000 21.47 43 71.55
10000 42 79 157.78
20000 84 174 313

It is obvious that using Go beats Python in all the experiments and more than 2x speed in every cases. What is interesting, using concurrency results in much faster performance than processing the HTML files sequentially. Possibly, deploying 500-1000 goroutines together actually speeds up the execution process as each goroutine can work without coming in each other’s way. However, I was able to deploy maximum 5000 goroutines at once. My machine could not handle more than that. I believe using powerful a CPU will let you process the files faster than what I have done. This begs for more experiments and benchmarking, something that I absolutely love to do! The following is a graphical view of the same experiment: image If the differences are not visible properly, perhaps this one will be helpful: image 

If you are not satisfied yet, keep reading!

I was able to get Murthy intrigued into this. He started playing with the Python code and kept optimizing it. He felt that compiling regexes every time the loop is executed adds overheads and takes unnecessary time. Globally compiled and initialized regexes will be a better option. He tried out two experiments, one with plain CPython and the other with PyPy which has JIT compiler for Python. Following is the result he observed. He tested this on a CentOS server with Python 2.6 running, 8 cores and 16 gigs of RAM.

Test Case (on 1000 HTML files) Execution Time (in secs)
CPython (without function calls, regexes compiled globally) 16.091
CPython (with function calls, regexes compiled globally) 20.483
CPython (with function calls, globally compiled regexes passed as function parameters) 16.898
CPython (with function calls, regexes compiled locally inside function) 17.582
PyPy (with function calls, globally compiled regexes passed as function parameters) 4.510
PyPy (without function calls) 3.567
CPython + re2 (with function calls, globally compiled regexes passed as function parameters) 1.020
CPython + re2 (without function calls) 0.946
Not bad at all! PyPy does it under 4 seconds. Although the CPython result is same as mine. Go provides a runtime parameter GOMAXPROCS which can be passed as a parameter while running the go code. The variable sets and limits the number of OS threads that can execute user-level Go code simultaneously. I believe that by default the limit is set as 1. But to parallelize the process using multiple threads I set it as 4. This is the result I got.

Number of HTML Files Go Concurrent Single Thread Go Concurrent 4 Threads
100 0.397 0.196
200 0.827 0.396
500 2.22 0.988
1000 4.68 1.98
2000 8.27 4.03
So, 1000 HTML files under 2 seconds and 2000 files in around 4 seconds. As you can observe from the results, using multiple threads gives you 2x more gain than the previous result, more than 4x gain over sequential processing result and more than 7x gain from CPython result. However, I was not able to run 5000 goroutines concurrently as my OS did not permit. Still, if you can run your job in batches, the advantages will be good enough. Google if you are reading this, can you invite me over to your Mountain View office so that I can run this on your 16000 core machine?