Create Nested Dictionaries

The problem

Suppose you have tabular data can be retrieve from rational database or CSV file:

The tabular data
 list = [
     ['fk1-A','fk2-A', 'key1', 10],
     ['fk1-A','fk2-A', 'key2', 20],
     ['fk1-B','fk2-B', 'key3', 30],
     ['fk1-B','fk2-B', 'key4', 40],
     ['fk1-B','fk2-A', 'key1', 20],
     ['fk1-B','fk2-A', 'key2', 40],
     ['fk1-B','fk2-B', 'key3', 40],
     ['fk1-B','fk2-B', 'key4', 40],
 ]

You want convert this data to a nested dictionary so you can access the data in efficient way.

The nested dictionary
 >>> nd = {
     'fk1-A': { 'fk2-A': {'key1': 10, 'key2': 20},
     'fk2-B': {'key3': 30, 'key4': 40}},
     'fk1-B': { 'fk2-A': {'key1': 20, 'key2': 40},
     'fk2-B': {'key3': 40, 'key4': 40}}
 }
 >>> print(nd['fk1-A']['fk2-B']['key3'])
 30
 >>> print(nd['fk1-B']['fk2-A']['key1'])
 20

using dict.setdefault

The dict has very useful method to creating nested dictionaries on the fly. dict.setdefault(key,default) will return the value of key if it exist in the dictionary, otherwise it will insert the key into the dictionary with the value default. Thus using this method we can write a straightforward function to convert the data to nested dictionary.

dict.setdefault
 def toNested1(data):
     nd = {}
     for a0,a1,a2,a3 in data:
         nd.setdefault(a0, {}).setdefault(a1, {})[a2] = a3
     return nd

However there is one limitation to this approach. Every time we process a row in the data, we create a new 2 dictionary even if the is already in dictionary. Let’s see how can we fix it:

dict.setdefault improvement
 def toNested2(data):
     def addKeyDict(map,key):
         if key not in map:
             item = map[key] = {}
             return item
         return map[key]
     nd = {}
     for a0,a1,a2,a3 in data :
         addKeyDict( addKeyDict( nd, a0) , a1 )[a2] = a3
         return nd

When benchmark those function we found that the second one is faster by ~10%.

dict.setdefault benchmark
 if __name__ == '__main__':
     def faster(z1,z2):
         return round(100 * (z1-z2)/z1, 1);
 import pprint
 import timeit
 pp = pprint.PrettyPrinter(indent=2)
 pp.pprint(toNested1(data))
 pp.pprint(toNested2(data))
 z1 = timeit.timeit("toNested1(data*10);", setup="from __main__ import toNested1,data")
 z2 = timeit.timeit("toNested2(data*10);", setup="from __main__ import toNested2,data")
 print(faster(z1,z2))

using collections.defaultdict

The collections module provides specialized container datatypes providing alternatives to Python’s general purpose built-in dict. One of them is collections.defaultdict.

The behavior of this class is similar to above approaches. When you evaluate the expression dictionary[key] the result will be the value of key if it exist in the dictionary, otherwise the function will call the function stored in default_factory attribute, store its return value, and insert the key into the dictionary with its value. The default_factory attribute can be initialized in the defaultdict constructor.

Using this method we can implement the following method:

collections.defaultdict
 def nested_dict():
     return defaultdict(nested_dict)
 def toNested3(data):
     nd = nested_dict()
     for a0,a1,a2,a3 in data:
         nd[a0][a1][a2] = a3
     return nd

When we benchmark toNested2 and this approach, we found that it faster by 55%.

This is more neat, readable and straightforward to create the nested dictionaries structure.

Conclusions

Using collections.defaultdict is neat, readable and straightforward way to nested dictionaries structure. Furthermore, it is faster by 55% of other methods