DataWeave

Apr 22

Extract dominant colors from an image with ColorWeave

[This post was written by Jyotiska with contributions from Sanket. Jyotiska is a Data Engineer at DataWeave with keen interest in Python and experimentation. He works with Sanket in the Products and Strategy team.]


We have taken a special interest in colors in recent times. Some of us can even identify and name a couple of dozen different colors! The genesis for this project was PriceWeave’s Color Analytics offering. With Color Analytics, we provide detailed analysis in colors and other attributes related to retailers and brands in Apparel and Lifestyle products space.

The Idea

The initial idea was to simply extract the dominating colors from an image and generate a color palette. Fashion blogs and Pinterest pages are updated regularly by popular fashion brands and often feature their latest offerings for the current season and their newly released products. So, we thought if we can crawl these blogs periodically after every few days/weeks, we can plot the trends in graphs using the extracted colors. This timeline is very helpful for any online/offline merchant to visualize the current trend in the market and plan out their own product offerings.

We expanded this to include Apparel and Lifestyle products from eCommerce websites like Jabong, Myntra, Flipkart, and Yebhi, and stores of popular brands like Nike, Puma, and Reebok. We also used their Pinterest pages.

Color Extraction

The core of this work was to build a robust color extraction algorithm. We developed a couple of algorithms by extending some well known techniques. One approach we followed was to use standard unsupervised machine learning techniques. We ran k-means clustering against our images data. Here k refers to the number of colors we are trying to extract from the image.

In another algorithm, we extracted all the possible color points from the image and used heuristics to come up with a final set of colors as a palette.

Another of our algorithms was built on top of the Python Image Library (PIL) and the Colorific package to extract and produce the color palette from the image.

Regardless of the approach we used, we soon found out that both speed and accuracy were a problem. Our k-means implementation produced decent results but it took 3-4 seconds to process an entire image! This might not seem much for a small set of images, but the script took 2 days to process 40,000 products from Myntra.

Post this, we did a lot of tweaking in our algorithms and came up with a faster and more accurate model which we are using currently.

ColorWeave API

We have open sourced an early version of our implementation. It is available of github here. You can also download the Python package from the Python Package Index here. Find below examples to understand its usage.

Retrieve dominant colors from an image URL


from colorweave import palette
print palette(url="image_url")

Retrive n dominant colors from a local image and print as json:

print palette(url="image_url", n=6, output="json")

Print a dictionary with each dominant color mapped to its CSS3 color name

print palette(url="image_url", n=6, format="css3")

Print the list of dominant colors using k-means clustering algorithm

print palette(url="image_url", n=6, mode="kmeans")

Data Storage

The next challenge was to come up with an ideal data model to store the data which will also let us query on it. Initially, all the processed data was indexed by Solr and we used its REST API for all our querying. Soon we realized that we have to come up with better data model to store, index and query the data.

We looked at a few NoSQL databases, especially column oriented stores like Cassandra and HBase and document stores like MongoDB. Since the details of a single product can be represented as a JSON object, and key-value storage can prove to be quite useful in querying, we settled on MongoDB. We imported our entire data (~ 160,000 product details) to MongoDB, where each product represents a single document.

Color Mapping

We still had one major problem we needed to resolve. Our color extraction algorithm produces the color palette in hexadecimal format. But in order to build a useful query interface, we had to translate the hexcodes to human readable color names. We had two options. Either we could use a CSS 2.0 web color names consisting on 16 basic colors (White, Silver, Gray, Black, Red, Maroon, Yellow, Olive, Lime, Green, Aqua, Teal, Blue, Navy, Fuchsia, Purple) or we could use CSS 3.0 web color names consisting of 140 colors. We used both to map colors and stored those colors along with each image.

Color Hierarchy

We mapped the hexcodes to CSS 3.1 which has every possible shades for the basic colors. Then we assigned a parent basic color for every shades and stored them separately. Also, we created two fields - one for the primary colors and the other one for the extended colors which will help us in indexing and querying. At the end, each product had 24 properties associated with it! MongoDB made it easier to query on the data using the aggregation framework.

What next?

A few things. An advanced version of color extraction (with a number of other exciting features) is being integrated into PriceWeave. We are also working on building a small consumer facing product where users will be able to query and find products based on color and other attributes. There are many other possibilities some of which we will discuss when the time is ripe. Signing off for now!

Apr 11

Implementing DataWeave’s Social API for Social Data Analysis

[The author of this post is Apoorv. Apoorv did his internship at DataWeave during January and February, 2014. Here he shares his experiences with Twitter API, MongoDB, and implementing REST APIs.]

In today’s world, the analysis of any social media stream can reap invaluable information about, well, pretty much everything. If you are a business catering to a large number of consumers, it is a very important tool for understanding and analyzing the market’s perception about you, and how your audience reacts to whatever you present before them.

At DataWeave, we sat down to create a setup that would do this for some e-commerce stores and retail brands. And the first social network we decided to track was the micro-blogging giant, Twitter. Twitter is a great medium for engaging with your audience. It’s also a very efficient marketing channel to reach out to a large number of people.

Data Collection

The very first issue that needs to be tackled is collecting the data itself. Now quite understandably, Twitter protects its data vigorously. However, it does have a pretty solid REST API for data distribution purposes too. The API is simple, nothing too complex, and returns data in the easy to use JSON format. Take a look at the timeline API, for example. That’s quite straightforward and has a lot of detailed information.

The issue with the Twitter API however, is that it is seriously rate limited. Every function can be called in a range of 15-180 times in a 15-minute window. While this is good enough for small projects not needing much data, for any real-world application however, these rate limits can be really frustrating. To avoid this, we used the Streaming API, which creates a long-lived HTTP GET request that continuously streams tweets from the public timeline.

Also, Twitter seems to suddenly return null values in the middle of the stream, which can make the streamer crash if we don’t take care. As for us, we simply threw away all null data before it reached the analysis phase, and as an added precaution, designed a simple e-mail alert for when the streamer crashed.

Data Storage

Next is data storage. Data is traditionally stored in tables, using RDBMS. But for this, we decided to use MongoDB, as a document store seemed quite suitable for our needs. While I didn’t have much clue about MongoDB or what purpose it’s going to serve at first, I realized that is a seriously good alternative to MySQL, PostgreSQL and other relational schema-based data stores for a lot of applications.

Some of its advantages that I very soon found out were: documents-based data model that are very easy to handle analogous to Python dictionaries, and support for expressive queries. I recommend using this for some of your DB projects. You can play about with it here.

Data Processing

Next comes data processing. While data processing in MongoDB is simple, it can also be a hard thing to learn, especially for someone like me, who had no experience anywhere outside SQL. But MongoDB queries are simple to learn once the basics are clear.

For example, in a DB DWSocial with a collection tweets, the syntax for getting all tweets would be something like this in a Python environment:

rt = list(db.tweets.find())
The list type-cast here is necessary, because without it, the output is simply a MongoDB reference, with no value. Now, to find all tweets where user_id is 1234, we have
rt = list(db.retweets.find({ 'user_id': 1234 })

Apart from this, we used regexes to detect specific types of tweets, if they were, for example, “offers”, “discounts”, and “deals”. For this, we used the Python re library, that deals with regexes. Suffice is to say, my reaction to regexes for the first two days was much like

Once again, its just initial stumbles. After some (okay, quite some) help from Thothadri, Murthy and Jyotiska, I finally managed a basic parser that could detect which tweets were offers, discounts and deals. A small code snippet is here for this purpose.


def deal(id):
     re_offers = re.compile(r'''
     \b 
     (?: 
       deals? 
       | 
       offers? 
       |
       discount
       |
       promotion
       |
       sale
       |
       rs?
       |
       rs\? 
       |
       inr\s*([\d\.,])+ 
       |
       ([\d\.,])+\s*inr
     ) 
    \b 
    | 
    \b\d+% 
    |
    \$\d+\b 
    ''',
   re.I|re.X)
   x = list(tweets.find({'user_id' : id,'created_at': { '$gte': fourteen_days_ago }}))
   mylist = []
   newlist = []
   for a in x:
       b = re_offers.findall(a.get('text'))
       if b:
       print a.get('id')
       mylist.append(a.get('id'))
       w = list(db.retweets.find( { 'id' : a.get('id') } ))
       if w:
       mydict = {'id' : a.get('id'), 'rt_count' : w[0].get('rt_count'), 'text' : a.get('text'), 'terms' : b}
       else:
           mydict = {'id' : a.get('id'), 'rt_count' : 0, 'text' : a.get('text'), 'terms' : b}
       track.insert(mydict)

This is much less complicated than it seems. And it also brings us to our final step—integrating all our queries into a REST-ful API.

Data Serving

For this, mulitple web-frameworks are available. The ones we did consider were Flask, Django and Bottle.

Weighing the pros and cons of every framework can be tedious. I did find this awesome presentation on slideshare though, that succinctly summarizes each framework. You can go through it here.

We finally settled on Bottle as our choice of framework. The reasons are simple. Bottle is monolithic, i.e., it uses the one-file approach. For small applications, this makes for code that is easier to read and maintainable.

Some sample web address routes are shown here:


#show all tracked accounts
id_legend = {57947109 : 'Flipkart', 183093247: 'HomeShop18', 89443197: 'Myntra', 431336956: 'Jabong'}

@route('/ids')
def get_ids():
    result = json.dumps(id_legend)
    return result

#show all user mentions for a particular account
@route('/user_mentions')
def user_mention():
    m = request.query.id
    ac_id = int(m)
    t = list(tweets.find({'created_at': { '$gte': fourteen_days_ago }, 'retweeted': 'no', 'user_id': { '$ne': ac_id} }))
    a = len(t)
    mylist = []
    for i in t:
        mylist.append({i.get('user_id'): i.get('id')})
    x = { 'num_of_mentions': a, 'mentions_details': mylist }
    result = json.dumps(x)
    return result
This is how the DataWeave Social API came into being. I had a great time doing this, with special credits to Sanket, Mandar and Murthy for all the help that they gave me for this. That’s all for now, folks!

Apr 05

Web Scraping at Scale: Python vs Go

[The author of this post is Jyotiska. Jyotiska is a Data Engineer at DataWeave. He is a Python charmer and claims to have measured the size of big data, with a mere ruler. He posted this post originally on his personal blog. He has reposted it here with minor edits.]

Scaling is a common challenge when you have to deal with a lot of data everyday. This is one thing we face at DataWeave a lot and have managed to tackle it fairly successfully. Often there is no one best solution. You have to keep looking and testing to find out what suits you the best. One thing we do a lot in DW, is crawl and extract data from HTML pages to get meaningful information out of them. And by a lot, I mean in millions. Speed becomes an important factor here. For instance, if your script is taking one tenth of a second to process a single HTML page, it is not good enough as this will turn out to be a huge bottleneck when you are trying to scale your system.

We use Python to take care of all our day to day operations, and things are going smooth. But as I mentioned, there is no best solution, and we need to keep exploring. Thanks to Murthy, I became curious in Go. I knew about Go for a long time, probably since it came out but I never paid much attention to it. Last weekend, I started playing around with Go and liked it immediately. It seemed like a perfect marriage between C and Python (probably C++ too). Then I decided to do a small experiment. I wanted to see how Go performs against Python for scraping webpages at scale. The Regular Expression package of Go is as good as Python, so I didn’t face much problem building a basic parser. The task of the parser is: I will feed it an apparel product webpage from an Indian online shopping website and it will give me the ‘Title’, ‘MRP’ (price), and ‘Fabric’ information of the apparel and the product ‘Thumbnail URL’.The parser function looked something like this:

func parse(filename string) {
    bs, err := ioutil.ReadFile(filename)
    if err != nil {
        fmt.Println(err)
        return
    }
    data := string(bs)

    var title = regexp.MustCompile(`(?i)(?s)<div class="[^<]*?prd-brand-detail">(.*?)<div class="mid-row mb5 full-width"`)
    var mrp = regexp.MustCompile(`(?i)(?s)\{"simple_price":(.*?),"simple_special_price":(.*?)\}`)
    var thumbnail = regexp.MustCompile(`(?i)(?s)<meta property="og:image" content="(.*?)"`)
    var fabric = regexp.MustCompile(`(?i)(?s)<td>Fabric</td>\s*<td[^>]*?>(.*?)</td>`)

    titleString := HTML(title.FindStringSubmatch(data)[1])
    fabricString := fabric.FindStringSubmatch(data)[1]
    mrpString := mrp.FindStringSubmatch(data)[1]
    thumbnailString := thumbnail.FindStringSubmatch(data)[1]

    fmt.Println(filename, titleString, fabricString, mrpString, thumbnailString)
}
def parse(filename):
    htmlfile = open(filename).read()

    titleregex = re.compile('''<div class="[^<]*?prd-brand-detail">(.*?)<div class="mid-row mb5 full-width"''', re.S|re.I)
    fabricregex = re.compile('''\{"simple_price":(.*?),"simple_special_price":(.*?)\}''', re.S|re.I)
    mrpregex = re.compile('''<meta property="og:image" content="(.*?)"''', re.S|re.I)
    thumbnailregex = re.compile('''Fabric</td>\s*<td[^>]*?>(.*?)</td>''', re.S|re.I)

    result = re.search(titleregex, htmlfile)
    title = result.group(1)

    result = re.search(fabricregex, htmlfile)
    fabric = result.group(1)

    result = re.search(mrpregex, htmlfile)
    mrp = result.group(1)

    result = re.search(thumbnailregex, htmlfile)
    thumbnail = result.group(1)

    print filename, title, fabric, mrp, thumbnail

Nothing fancy, just 4 simple regexes. My testbed is a Mid-2012 MacBook Pro with 2.5 GHz i5 processor and 4 gigs of RAM. Now, I had two options. First, run the script with 100 HTML files sequentially and monitor the running time. Or, I could create 100 goroutines and give each of them a single HTML file and let them handle its own parsing using Go’s built in concurrency feature. I decided to do both. Also, I made a Python script which does a similar thing, to compare its performance against the Go script. Each experiment was performed 3-4 times and the best running time was taken into account. Following is the result I received:

Number of HTML Files Go Concurrent (in sec) Go Sequential (in sec) Python (in sec)
100 0.39 0.40 0.95
200 0.83 0.80 1.93
500 2.22 2.18 4.78
1000 4.68 10.46 14.58
2000 10.62 20.87 30.51
5000 21.47 43 71.55
10000 42 79 157.78
20000 84 174 313

It is obvious that using Go beats Python in all the experiments and more than 2x speed in every cases. What is interesting, using concurrency results in much faster performance than processing the HTML files sequentially. Possibly, deploying 500-1000 goroutines together actually speeds up the execution process as each goroutine can work without coming in each other’s way. However, I was able to deploy maximum 5000 goroutines at once. My machine could not handle more than that. I believe using powerful a CPU will let you process the files faster than what I have done. This begs for more experiments and benchmarking, something that I absolutely love to do! The following is a graphical view of the same experiment: image If the differences are not visible properly, perhaps this one will be helpful: image 

If you are not satisfied yet, keep reading!
image

I was able to get Murthy intrigued into this. He started playing with the Python code and kept optimizing it. He felt that compiling regexes every time the loop is executed adds overheads and takes unnecessary time. Globally compiled and initialized regexes will be a better option. He tried out two experiments, one with plain CPython and the other with PyPy which has JIT compiler for Python. Following is the result he observed. He tested this on a CentOS server with Python 2.6 running, 8 cores and 16 gigs of RAM.

Test Case (on 1000 HTML files) Execution Time (in secs)
CPython (without function calls, regexes compiled globally) 16.091
CPython (with function calls, regexes compiled globally) 20.483
CPython (with function calls, globally compiled regexes passed as function parameters) 16.898
CPython (with function calls, regexes compiled locally inside function) 17.582
PyPy (with function calls, globally compiled regexes passed as function parameters) 4.510
PyPy (without function calls) 3.567
CPython + re2 (with function calls, globally compiled regexes passed as function parameters) 1.020
CPython + re2 (without function calls) 0.946
Not bad at all! PyPy does it under 4 seconds. Although the CPython result is same as mine. Go provides a runtime parameter GOMAXPROCS which can be passed as a parameter while running the go code. The variable sets and limits the number of OS threads that can execute user-level Go code simultaneously. I believe that by default the limit is set as 1. But to parallelize the process using multiple threads I set it as 4. This is the result I got.

Number of HTML Files Go Concurrent Single Thread Go Concurrent 4 Threads
100 0.397 0.196
200 0.827 0.396
500 2.22 0.988
1000 4.68 1.98
2000 8.27 4.03
So, 1000 HTML files under 2 seconds and 2000 files in around 4 seconds. As you can observe from the results, using multiple threads gives you 2x more gain than the previous result, more than 4x gain over sequential processing result and more than 7x gain from CPython result. However, I was not able to run 5000 goroutines concurrently as my OS did not permit. Still, if you can run your job in batches, the advantages will be good enough. Google if you are reading this, can you invite me over to your Mountain View office so that I can run this on your 16000 core machine?

Mar 14

India’s Air Quality in numbers and how it affects us all

India has the most number of deaths by chronic respiratory diseases per 100,000. Yes, you heard that right. Worse than Zambia, Uganda and Pakistan, Mozambique, Ethiopia or Somalia.

A common measure of air pollution is the mean annual concentration of fine suspended particles of less than 10 microns in diameters in the air. 10 microns is about 100 times thinner than a human hair. The size of such particles in the air is directly linked to their potential for causing health problems. Particles less than 10 micrometers in diameter, pose the greatest problems because they can get deep into our lungs, and may even get into our bloodstream.

The world average of PM10 (Particulate Matter up to 10 micrometers in size) hovers around 71 µg/m3 but the quality might range from 21 to 142 µg/m3. Here is an image from an 8 year study of air quality in 1100 urban areas released by the WHO.


image

The contrast of air quality between the developing and the developed world is visibly stark. People living in the urban areas in and surrounding India, China, Mexico and large parts of the middle east, Africa and few south american countries are considerably economically backward than their counterparts in the US and Europe, who enjoy cleaner air.

According to WHO numbers from 2008, the number of premature deaths attributable to urban outdoor air pollution is estimated to 1.34 million worldwide. It’s no surprise where those deaths are coming from.

image

Source:http://gamapserver.who.int/gho/interactive_charts/ncd/mortality/chronic_respiratory_diseases/atlas.html

India is on top of this list with 175 (Male) and 125 (Female) deaths per 100,000 (age standardised). Other regions include large parts of China, Africa and south east asia.

The WHO claims that of the 1.34 million deaths worldwide 1.09 million deaths could be avoided if the mean annual Air Quality Guideline values of PM10=20μg/m3 along with (the smaller particulate matter 2.5 micrometers) PM2.5=10 μg/m3 are implemented.

The DPCC (Delhi Pollution Control Committee) has near real time Air Pollution Level data on their site. The clear break up of the data along with the prescribed limit helps us get a clear sense of the pollution in the air. It also lets us monitor this data by localities.

Numbers might lack the intimacy and urgency of images but they help us see things clearly and take measurable actions. The availability of data on the DPCC site is a step in the right direction. This can help focus on making positive changes to improve the air quality in India.

Feb 21

10 Free Data Visualization Tools

“Visualizations act as a campfire around which we gather to tell stories.” Al Shalloway (2011)

Data is most effective when visualized. Why should we suffer the dull monotony of a sheet of numbers when the same data can be visualized in colorful graphs brimming with vitality? Over the past few years a number of useful and free visualization tools have made it more and more easy for us to consume data in meaningful and insightful ways. Here is a list of 10 such free data visualization tools with examples:

1. D3

D3.js is a JavaScript library for visualizing simple and complex data. D3 helps visualize data using HTML, SVG and CSS. D3’s strength is its ability to utilize the full capabilities of modern browsers without tying itself to a proprietary framework. D3 combines powerful visualization components and a data-driven approach to DOM manipulation. We’re huge fans of D3 at DataWeave and rely on the extensive list of resources on its website: http://d3js.org/


[ A dashboard on PriceWeave made using D3]

2. Tableau Public

Tableau aims to make databases and spreadsheets understandable to ordinary people. It has hundreds of visualization types, such as maps, bar and line charts, lists, heat maps and more. Two important features of Tableau are its built-in mapping which automatically geocodes down to the city and zip code level and its built-in date functionality, which lets readers filter to a time period or drill down from months to days to hours. You can even combine views into a dashboard to show different sides of the same story.

Some features on Tableau Public are reserved for paid users. More about Tableau Public at:  https://www.tableausoftware.com/products/public

3. Many Eyes:

Many Eyes is probably the simplest tool to start using. It combines graphical analysis with community, encouraging users to upload, share and discuss information. It is very well documented and includes valuable suggestions on when to use what kind of visual data representation.

Many Eyes includes more than a dozen output options — from charts, graphics and word clouds to treemaps, plots, network diagrams and some limited geographic maps. The only seeming drawback of Many Eyes is that both visualizations and data sets are public on the Many Eyes site and can be easily downloaded, shared, reposted and commented upon by others. Access Many Eyes here

 


4. Processing:

Processing is a stand out interactive visualizations tool. It enables you to write much simpler code which is in turn compiled into Java. There is also a Processing.js project to make it easier for websites to use Processing without Java applets, plus a port to Objective-C so you can use it on iOS. It is a desktop application, but can be run on all platforms.

Ex: http://www.openprocessing.org/sketch/77118

5. Google Chart Tools:

Google Chart Tools offers an API for creating Web graphics from data. Google offers a comparison of data size, page load, skills needed and other factors to help decide which option to use.The visualization API includes various types of charts, maps, tables and other options and helps pull  data in from a Google spreadsheet.

6. Weave

Weave is a web-based visualization platform designed to enable visualization of any available data by anyone for any purpose. It is an application development platform supporting multiple levels of users – novice to advanced – as well as the ability to integrate, analyze and visualize data at “nested” levels of geography, and to disseminate the results in a web page. Learn more about Weave at http://oicweave.org/

Here is a visualization on trees in Boston made using Weave: http://demo.oicweave.org/weave.html?file=demo_Boston_Trees_meters_2011.weave

7. Datawrapper:

DataWrapper is an open source tool that covers the entire cycle of cleaning, visualizing and publishing data.  Visualizations using Datawrapper have appeared in high profile publications such as Washington Post, Der Spiegel and The Guardian. You can find more details on the DataWrapper site: http://datawrapper.de/

8. Chartbuilder

Chartbuilder began as an inhouse project of the design loving team of Quartz who were frustrated with the sub standard charts they found online. Chartbuilder works best when you have to turn out beautiful and aesthetically pleasing graphs in a short time. It is open source and the full code can be found here: https://github.com/Quartz/Chartbuilder/


9. Jolicharts

Jolicharts helps users build and share dashboards and export data. You can upload your data in JSON, CSV or XLS to Jolicharts and your data will be automatically prepared for analysis. It is free to use for up to 50 MB of calculation power and 5 datasources per dashboard. Learn more about it here: https://jolicharts.com/


10. Plotly

Plotly is a collaborative data analysis and graphing tool. It provides online graphing, analytics, a Python command line, and stats tools for individuals and collaboration, as well as scientific graphing libraries for Python, R, MATLAB, Perl, Julia, Arduino, and REST. Plotly claims to streamline one’s workflow, all in one place and comes very close to doing it. Take a look at the plotly gallery.


Source: Plotly Visual

What visualization tools do you use on your projects? How has your experience been with them? Let us know in the comments below.