Without argument, the most used file format used in DataWeave is JSON. We use JSON everywhere — managing product lists, storing data extracted from crawl dumps, generating assortment reports, and almost everywhere else. So it is very important to make sure that the libraries and packages we are using are fast enough to handle large data sets. There are two popular packages used for handling json — first is the stock json package that comes with default installation of Python, the other one is simplejson which is an optimized and maintained json package for Python. The goal of this blog post is to introduce ultrajson or Ultra JSON, a JSON library written mostly in C and built to be extremely fast.
We have done the benchmark on three popular operations — load, loads and dumps. We have a dictionary with 3 keys - id, name and address. We will dump this dictionary using
json.dumps() and store it in a file. Then we will use
json.load() separately to load the dictionaries from the file. We have performed this experiment on 10000, 50000, 100000, 200000, 1000000 dictionaries and observed how much time it takes to perform the operation by each library.
Here is the result we received using the
json.dumps() operations. We have dumped the content dictionary by dictionary.
We notice that json performs better than simplejson but ultrajson wins the game with almost 400% speedup than stock json.dumps operation (all dictionaries at once)
In this experiment, we have stored all the dictionaries in a list and dumped the list using
simplejson is almost as good as stock json, but again ultrajson outperforms them by 150% speedup. Now lets see how they perform for
Now we do the load operation on a list of dictionaries and compare the results.
Surprisingly, simplejson beats other two, with ultrajson being almost close to simplejson. Here, we observe that simplejson is almost 400% faster than stock json, same with ultrajson.loads operation on dictionaries
In this experiment, we load dictionaries from the file one by one and pass them to the
Again ultrajson steals the show, being almost 600% faster than stock json and 400% faster than simplejson.
That is all the benchmarks we have here. The verdict is pretty clear. Use simplejson instead of stock json in any case, since simplejson is well maintained repository. If you really want something extremely fast, then go for ultrajson. In that case, keep in mind that ultrajson only works with well defined collections and will not work for un-serializable collections. But if you are dealing with texts, this should not be a problem.