Story of a Data API
Data is ubiquitous on the web. Data is available to anybody who wants to make use of it. Businesses and developers can greatly benefit from data. All the above are given. But the problem is that data is spread across a large number of sources; it is not clean; there exists duplicate data; and it keeps getting updated all the time. There is no one solution that exists today, that can make this data consumable without going through hoops around crawling, cleansing and then consuming data per source.
DataWeave is all about providing easy access to high quality datasets. The primary mode of access to our datasets is through our Data APIs. In this post, we’d like to share the story of a data API: our process from understanding a widespread need—often latent—for a certain kind of data to delivering the data through APIs. In brief, this process involves the following steps:
- identifying/understanding a need (often latent) for data
- identifying the right data sources
- crawling/scraping the relevant information
- data cleaning/deduplication
- developing/publishing the API
- customizing the API based on specific requests
Below the fold, we illustrate the process through one of our popular datasets, Commodity Prices dataset.
What is Commodity Prices API
The commodity prices dataset comprises daily arrival and prices data of different agricultural commodities as received from the Agricultural Produce Market Committee (APMCs) of different states in India. Each of these APMCs publishes information independently.The structure of a sample commodity record represented as a JSON object looks like this:
"state": "Andhra Pradesh",
The API provides access to the above datasets in multiple ways, such as: list of all commodities and find volume and prices by commodity.
How we built Commodity Prices API
Identifying Need for Data
Commodity prices are needed by farmers who want to sell their produce; by wholesale dealers who want to stock up their supply; and by market analysts who want to analyze and predict trends. This data can be an end in itself or it might be the prerequisite for something more. But whatever the need, the amount of effort expended in getting this data is enormous. We know for a fact that so far this has involved dispatching someone to get the list of prices from the APMC, or cajoling the personnel there to call over the phone at the close of a day.
Well, the commodity prices data is published online by the respective state agricultural and marketing boards as well as the national agricultural market association! So, what is required simply is an automated way of extracting and organizing this data. However, the way in which this data is published varies across all these sites. The difference can be observed right from name of a particular field to the unit of representation. When trying to collate data from these sources, the amount of effort demanded of the developer/journalist/user is quite high. The end might not seem so bright in comparison to this arduous journey!
Identifying Data Sources
Source identification is one of the biggest challenges we face every day. How to determine the validity of a source? When will a source publish? How often do we have to poll the source to check for information? How do we ensure we are not taxing the resources of our sources? Is it possible that something that we did in good faith turns out to be unethical inadvertently? These are some questions that we need to answer. Even if we know which government department is responsible for publishing the data we want, the department itself is spread across multiple websites each holding part of the information we need to put together. Or it might turn out that the same data is published on all these sites, but the rate at which the data is published varies. Some sources might stop publishing data without any notifications while we are still waiting for a fresh set of data points to be populated. Identifying sources for a data API is not a one time effort. It requires us to keep ourselves abreast of all the alternative sources that might publish information.
Crawling and Scraping
It appears once we have identified the data sources and understood (with some pain, you admit!) the various formats and patterns in which they publish data, the process of crawling these sources is very simple. Well, not really! There awaits a goblin at the turn of the street. One thing about web programming is that there is no single standard to write a website in (plain old PHP/.NET/JSP/Ajax based). This is a boon as well as a bane. Its a boon to web developers. They might find it easier to develop the website in a certain framework. But its a bane for people trying to use this data in a machine readable format. There is no one size fits all solution for this. You will have to understand the functioning of each of these frameworks. Sometimes, websites implement a combination of techniques to make navigation/usage better. Trying to generalize solutions based on these factors is quite difficult.
Once the raw data is available with us, we need to figure out an appropriate schema for storage. Is the data clean enough to be consumed or is there a data sanitization phase? This will determine the additional steps before the data API can be published. The data sanitization phase could entail something as trivial as formatting the date properly to something as complex as breaking a phrase into constituent terms, identify metadata for these terms and normalize the metadata. The data thus accumulated should be stored efficiently to be served effectively. Some of the datasets might be more suited for a traditional RDBMS storage while others might be suited for an alternative storage models. If the data has to be served from an RDBMS, a careful decision has to be taken whether to de-normalize the data or store it as is.
As mentioned previously, a data API can be powered by multiple sources. There might or might not exist any overlap in the data published by these sources. How to deduplicate data once we have crawled them is a separate problem of its own might. The deduplication process could mean just deleting duplicate records from the database. But the way a source publishes data could be very different. For example, the price of a certain commodity may be published by multiple sources for the same market/location depending on whose jurisdiction the market/location might lie in. They might not represent the same location in the same way. E.g., Bangalore, Bengaluru or Bengalooru. Identifying these as variants of the same word or concept is a non-trivial problem.
Developing and publishing API
Publishing an API requires us to understand how a developer/ user might use the API. We cannot always expect a user to download the complete dataset and run these aggregates on their end. The user might not always be executing his application on a beefy computer with all the requisite software packages.
This requires us to look at various possible views for each dataset, come up with methods that would meet these needs.
Customizing API for specific requests
As discussed above, there might be a large number of views on each dataset. It is not always easy to comprehend or anticipate the consumers’ usage patterns. There might be cases where a particular business application needs might be difficult to cater to from the APIs available with us. In such cases, our users place requests for specific APIs (pertaining to their access patterns) to be made available on this data. DataWeave does take up and implement such customization requests on a case to case basis.
Please do sign up for an API key and try out our APIs. If you have any suggestions/ ideas/ questions please send it to firstname.lastname@example.org.