We at DataWeave had a fast-paced 2012 with a large, at times, mind-bogglingly so ,number of threads running in parallel. As we draw close to the New Year, we want to pause a moment and look back at the year that was.
We launched DataWeave and PriceWeave in February 2012.
The number of data points that we served as on 28th December is 44,64,310.
This approximates to about 14,000 data points every day on average, or about 10 every minute. We are proud of these stats, but we are not ones to be complacent. We instead want to keep pushing at those numbers constantly, and keep treading deeper waters.
Hoopos, Rediff, Indiatimes Shopping, CrowdAnalytix, Snapdeal, Puma, UrbanJourney, Jabong, OfficeYes, EasyRation, Zigwheels are some of our esteemed customers.
Applications
CostShaker, Lowprice, Gaadiwala, Localbazaar.
Comparison Shopping Engines, eCommerce Portals, Brands, Developers, Travel Companies and Media Houses.
While most of our initial focus has been on developing offerings for businesses, we also want to provide an active space for the developer community. There is a clear need for mobile apps powered by high quality datasets, which we aim to fulfill. A couple of Apps have already been powered by us, lending teeth to our capabilities.
LowPrice was the first Android app that is powered by our commodity prices datasets. It lets you scan a book and find the lowest priced seller. Pranay Airan, who developed this, shares his thoughts here
GaadiWala was developed by Vamsi Krishna to keep users informed about the car and bike prices across cities in India.
We graced every event where DataWeave’s presence was needed:
HackNight, The Fifth Elephant, Pycon, Droidcon, Techsparks 2012, Nasscom Product Conclave (2012) and GSFIndia.
We were named as one of the 3 best Tech Product Startups in India for 2012 by TechSparks.
The media coverage we have received this year has been heartening and validates much of our very own beliefs. Some of the write ups about us are on TechCircle, and we were featured twice on YourStory and once on CNBC
http://techcircle.vccircle.com/500/tlabs-startup-accelerator-report-card-1-dead-2-up-for-funding-and-4-evolving/
http://yourstory.in/2012/04/do-you-find-data-on-the-web-too-cumbersome-dataweave-in-simplifiesthe-problem/
http://yourstory.in/2012/09/techsparks-2012-unveils-the-top-30-tech-product-startups-from-india/
We added two awesome team members towards the end of the year. Sachin and Abhishek
We would like to thank everybody at Headstart for their support and organizational competence and in helping us secure two bright talents.
More people coming on board
A few more exciting stuff … Some of which we’ll only be able to let you in on officially at the beginning of 2013!
Tlabs – for bootstrapping and continuing to support us. A shout out to Abhishek and Pankaj (now at 500 startups)!
Thanks a lot to our advisors: Gautam Sinha and Miten Sampat.
Special thanks to the numerous people who gave us time to meet them and discuss our ideas.
Happy New Year! Let us look forward to an exciting 2013!
[Guest post from Pranay Airan who’s an early adopter of DataWeave. Pranay is enthusiastic about app development. Below he shares his experiences of developing LowPrice, an Android app powered by DataWeave’s pricing API. This is the second and final part of his post.]
In part-1 of this post, I discussed about the BookSearch API provided by DataWeave. In this part, I will discuss the Scan & Search functionality and integrating these to complete our Android App.
SCAN:
The ISBN number of a book can be obtained by scanning the bar code directly using the camera on your android phone. LowPrice, by itself is not bundled with this capability. To achieve this functionality, we use a native concept on Android called Intents. Intents allows us to delegate required functionality to already existing apps, thus eliminating the need to reinvent the wheel. Google provides an open source barcode scanner app ZXING. We will now use Intents to delegate the barcode scanning functionality to ZXING.
Intent intent = new Intent("com.google.zxing.client.android.SCAN");
intent.putExtra("SCAN_MODE", "QR_CODE_MODE");
startActivityForResult(intent, 0);
We create an intent to use the barcode scanning functionality of ZXING in the first line. Then we define the mode in which the scanning has to be started. Different modes for different uses are defined on the ZXING website. Once the intent is created, we now start the intent and listen for results from the barcode scanner. Starting an intent will launch the corresponding application.
To get data from the barcode scanner:
public void onActivityResult(int requestCode, int resultCode, Intent intent) {
if (requestCode == 0) {
if (resultCode == RESULT_OK) {
String contents = intent.getStringExtra("SCAN_RESULT");
String format = intent.getStringExtra("SCAN_RESULT_FORMAT");
Once ZXING detects a valid barcode, the function defined above will be invoked, which will read the barcode and convert it to string for compatibility with the BookSearch API. This simple piece of code will provide our app with all the functionalities of a barcode scanner.
Integrating both
Now that we have the ISBN number from the Scan module, we can use the searchByIsbn method from DataWeave’s BookSearch API to fetch the price of the book across various vendors in India.This is described in detail in part-1 of this post.
This is the story of LowPrice for now. Creating an Android App made easy. Do expect more in the future.
I can be reached at pranay.airan@iiitb.net . Feel free to provide your feedback at contact@dataweave.in .
Download lowprice android app now https://play.google.com/store/apps/details?id=com.binarybricks.lowprice
[Guest post from Pranay Airan who’s an early adopter of DataWeave. Pranay is enthusiastic about app development. Below he shares his experiences of developing LowPrice, an Android app powered by DataWeave’s pricing API.]
In this two part blog post, I will discuss how I created LowPrice using DataWeave’s APIs and a custom backend. LowPrice (https://play.google.com/store/apps/details?id=com.binarybricks.lowprice) is an Android app that gets you the lowest price for books across 9 online stores in India simply by scanning the book’s barcode. LowPrice also lets you search books by title, author, or publisher, so you don’t even have to go to a store really!
In Part 1, let’s look at the relevant APIs provided by DataWeave, understand how to query data, and how to consume this data through an Android app. In Part 2, we will see how I created the Scan and Search features and how you can achieve the same. Finally, we will wire the code from these two posts to create a complete app.
DataWeave API and JSON
DataWeave provides APIs around public data available on the web. One of their APIs is around providing prices of books across various ecommerce stores in India. LowPrice uses this API http://www.dataweave.in/apis/dataset-Book-Price-Search-By-ISBN—19.html. This API covers a lot of different sources including some of the top ecommerce stores in India like Landmark, Infibeam, and HomeShop18.
For an exclusive access to their APIs, register on http://dataweave.in/ and get an access key. This gives you access to all the data APIs provided by DataWeave.
Lets look at the API for book prices:
1. Search by ISBN
This feature enables you to query for the price of a book across various stores by ISBN number of the book. The API call for the same is http://api.dataweave.in/v1/book_search/searchByIsbn/?isbn=9788174369109 . The API returns data as JSON. The response format is as follows:
{
source: "Landmark",
url:"http://www.landmarkonthenet.com/beyond-the-lines-an-autobiography-by-kuldip-nayar-books-9788174369109-22248864/",
listprice: "595",
price:"399",
thumbnail: "static.landmarkonthenet.com/9788174369109/m800/m800/?dept=books&shot=0",
publisher: "Roli Books Pvt Ltd",
author: "Kuldip Nayar"
}
This response contains the URL of the book, source, MRP of the book, price at which the book is being sold, publisher, and author. The Scan and Search feature of LowPrice uses this API to get the prices of books. How does the App get the ISBN number? That will be explored in Part 2.
2. Search by Title/Author/Publisher
Not always does everybody have the ISBN number of a book that they want. Most of the times, we know the title/author/publisher of a certain book. To enable free text search, the API also provides a method to search by these attributes of a book as well. LowPrice uses this API to power book search. If we would like to search by the name of a book eg: Harry Potter,
http://api.dataweave.in/v1/book_search/searchByTitle/?title=harry%20potter . The response from the API is as follows:
{
title: "HARRY POTTER AND THE CHAMBER OF SECRETS",
isbn: "9781408810552",
author: " J. K. Rowling",
mrp: " 399",
available_price: "263",
thumbnail: "http://img8.flixcart.com/image/book/5/5/2/harry-potter-and-the-chamber-of-secrets-100x100-imad8qthkqxzw5q5.jpeg",
publisher: "BLOOMSBURY"
}
Now that we have the data in an easy to consume format, let us now focus on building the Android app itself.
How to do a REST call in Android
All of DataWeave’s API are REST based. Let us see how we can consume a REST API in Android.
DefaultHttpClient httpClient = new DefaultHttpClient();
HttpGet httpGet = new HttpGet("http://api.dataweave.in/v1/book_search/searchByTitle/?title="+bookname);
HttpResponse httpResponse = httpClient.execute(httpGet);
HttpEntity httpEntity = httpResponse.getEntity();
With this simple 4 lines of Code we are able to consume the data in our app. Since the API returns the data as JSON, lets see how to parse this data.
How to parse JSON in Android
Android SDK provides us with an option to natively parse JSON. Optionally we could also use Jackson or GSON.
JSONArray json = null;
List mElementList = new ArrayList();
json = new JSONArray(jsonString);
This simple code will help you to parse JSON response which we receive from our REST API.
Summary
In this post we took a brief look at DataWeave’s API which is powering LowPrice backend and we also saw how to consume this API in android and use the data in a way we want. In the next post, I will explain how I built the Scan and Search feature in LowPrice.
Building data apps has never been easier!
Feel free to download and check out lowprice android app from here: https://play.google.com/store/apps/details?id=com.binarybricks.lowprice
Please share your feedback.
Data is ubiquitous on the web. Data is available to anybody who wants to make use of it. Businesses and developers can greatly benefit from data. All the above are given. But the problem is that data is spread across a large number of sources; it is not clean; there exists duplicate data; and it keeps getting updated all the time. There is no one solution that exists today, that can make this data consumable without going through hoops around crawling, cleansing and then consuming data per source.
DataWeave is all about providing easy access to high quality datasets. The primary mode of access to our datasets is through our Data APIs. In this post, we’d like to share the story of a data API: our process from understanding a widespread need—often latent—for a certain kind of data to delivering the data through APIs. In brief, this process involves the following steps:
What is Commodity Prices API
The commodity prices dataset comprises daily arrival and prices data of different agricultural commodities as received from the Agricultural Produce Market Committee (APMCs) of different states in India. Each of these APMCs publishes information independently.The structure of a sample commodity record represented as a JSON object looks like this:
"data": [
{
"date": "2011-12-02",
"commodity": "Ajwan",
"state": "Andhra Pradesh",
"market": "Kurnool",
"Arrivals_Tonnes": "3.9",
"origin": "",
"variety": "Other",
"Minimum_Price": "8590",
"Maximum_Price": "13069",
"Modal_Price": "10800",
"unit": "Rs./Quintal"
}
],
The API provides access to the above datasets in multiple ways, such as: list of all commodities and find volume and prices by commodity.
How we built Commodity Prices API
Identifying Need for Data
Commodity prices are needed by farmers who want to sell their produce; by wholesale dealers who want to stock up their supply; and by market analysts who want to analyze and predict trends. This data can be an end in itself or it might be the prerequisite for something more. But whatever the need, the amount of effort expended in getting this data is enormous. We know for a fact that so far this has involved dispatching someone to get the list of prices from the APMC, or cajoling the personnel there to call over the phone at the close of a day.
Well, the commodity prices data is published online by the respective state agricultural and marketing boards as well as the national agricultural market association! So, what is required simply is an automated way of extracting and organizing this data. However, the way in which this data is published varies across all these sites. The difference can be observed right from name of a particular field to the unit of representation. When trying to collate data from these sources, the amount of effort demanded of the developer/journalist/user is quite high. The end might not seem so bright in comparison to this arduous journey!
Identifying Data Sources
Source identification is one of the biggest challenges we face every day. How to determine the validity of a source? When will a source publish? How often do we have to poll the source to check for information? How do we ensure we are not taxing the resources of our sources? Is it possible that something that we did in good faith turns out to be unethical inadvertently? These are some questions that we need to answer. Even if we know which government department is responsible for publishing the data we want, the department itself is spread across multiple websites each holding part of the information we need to put together. Or it might turn out that the same data is published on all these sites, but the rate at which the data is published varies. Some sources might stop publishing data without any notifications while we are still waiting for a fresh set of data points to be populated. Identifying sources for a data API is not a one time effort. It requires us to keep ourselves abreast of all the alternative sources that might publish information.
Crawling and Scraping
It appears once we have identified the data sources and understood (with some pain, you admit!) the various formats and patterns in which they publish data, the process of crawling these sources is very simple. Well, not really! There awaits a goblin at the turn of the street. One thing about web programming is that there is no single standard to write a website in (plain old PHP/.NET/JSP/Ajax based). This is a boon as well as a bane. Its a boon to web developers. They might find it easier to develop the website in a certain framework. But its a bane for people trying to use this data in a machine readable format. There is no one size fits all solution for this. You will have to understand the functioning of each of these frameworks. Sometimes, websites implement a combination of techniques to make navigation/usage better. Trying to generalize solutions based on these factors is quite difficult.
Pre-processing/storing/post-processing
Once the raw data is available with us, we need to figure out an appropriate schema for storage. Is the data clean enough to be consumed or is there a data sanitization phase? This will determine the additional steps before the data API can be published. The data sanitization phase could entail something as trivial as formatting the date properly to something as complex as breaking a phrase into constituent terms, identify metadata for these terms and normalize the metadata. The data thus accumulated should be stored efficiently to be served effectively. Some of the datasets might be more suited for a traditional RDBMS storage while others might be suited for an alternative storage models. If the data has to be served from an RDBMS, a careful decision has to be taken whether to de-normalize the data or store it as is.
Data cleaning/deduplication
As mentioned previously, a data API can be powered by multiple sources. There might or might not exist any overlap in the data published by these sources. How to deduplicate data once we have crawled them is a separate problem of its own might. The deduplication process could mean just deleting duplicate records from the database. But the way a source publishes data could be very different. For example, the price of a certain commodity may be published by multiple sources for the same market/location depending on whose jurisdiction the market/location might lie in. They might not represent the same location in the same way. E.g., Bangalore, Bengaluru or Bengalooru. Identifying these as variants of the same word or concept is a non-trivial problem.
Developing and publishing API
Publishing an API requires us to understand how a developer/ user might use the API. We cannot always expect a user to download the complete dataset and run these aggregates on their end. The user might not always be executing his application on a beefy computer with all the requisite software packages.
This requires us to look at various possible views for each dataset, come up with methods that would meet these needs.
Customizing API for specific requests
As discussed above, there might be a large number of views on each dataset. It is not always easy to comprehend or anticipate the consumers’ usage patterns. There might be cases where a particular business application needs might be difficult to cater to from the APIs available with us. In such cases, our users place requests for specific APIs (pertaining to their access patterns) to be made available on this data. DataWeave does take up and implement such customization requests on a case to case basis.
Please do sign up for an API key and try out our APIs. If you have any suggestions/ ideas/ questions please send it to contact@dataweave.in.
We talked about our Data APIs at the startup saturday held recently in Bangalore, India. You can find the slides here. We presented our worldview of data and the unique challenges in dealing with different “kinds” of data.

The image above signifies the two fundamental axes that helps us in classifying data. The horizontal axis signifies temporality while the vertical axis represents the presence or absence of structure underlying the data. Any “kind” of data that we might think of falls into any one of these quadrants.
The reason why we try and classify data into one of these quadrants is because the underlying challenges of dealing with data from any of these quadrants are inherently different. For example, data that is unstructured requires sophisticated and text mining techniques to derive value from the data, while mining data based on freshness becomes important when dealing with data that is temporal in nature.
Most of the datasets that we deal with at DataWeave are primarily unstructured and temporal.