The problem

Suppose you have tabular data which was retrieved from rational database or CSV file:

The tabular data
list = [
  ['fk1-A','fk2-A', 'key1', 10],
  ['fk1-A','fk2-A', 'key2', 20],
  ['fk1-B','fk2-B', 'key3', 30],
  ['fk1-B','fk2-B', 'key4', 40],
  ['fk1-B','fk2-A', 'key1', 20],
  ['fk1-B','fk2-A', 'key2', 40],
  ['fk1-B','fk2-B', 'key3', 40],
  ['fk1-B','fk2-B', 'key4', 40],
]

You want convert this data to a nested dictionary so you can access the data in efficient way.

The nested dictionary
>>> nd = {
  'fk1-A': { 'fk2-A': {'key1': 10, 'key2': 20},
             'fk2-B': {'key3': 30, 'key4': 40}},
  'fk1-B': { 'fk2-A': {'key1': 20, 'key2': 40},
             'fk2-B': {'key3': 40, 'key4': 40}}
}
>>> print(nd['fk1-A']['fk2-B']['key3'])
30
>>> print(nd['fk1-B']['fk2-A']['key1'])
20

using dict.setdefault

The dict has very useful method for creating nested dictionaries on the fly. dict.setdefault(key,default) will return the value of key if it exists in the dictionary, otherwise it will insert the key into the dictionary with the value default. Thus using this method we can write a straightforward function to convert the data to nested dictionary.

dict.setdefault
def toNested1(data):
  nd = {}
  for a0,a1,a2,a3 in data:
      nd.setdefault(a0, {}).setdefault(a1, {})[a2] = a3
  return nd

However there is one limitation to this approach. Every time we process a row in the data, we create a new 2 dictionary even if the key is already in dictionary. Let’s see how can we fix it:

dict.setdefault improvement
def toNested2(data):
    def addKeyDict(map,key):
        if key not in map:
            item = map[key] = {}
            return item
        return map[key]
    nd = {}
    for a0,a1,a2,a3 in data :
        addKeyDict( addKeyDict( nd, a0) , a1 )[a2] = a3
    return nd

When benchmark those functions we found that the second one is faster by ~10%.

dict.setdefault
if __name__ == '__main__':
    def faster(z1,z2):
        return round(100 * (z1-z2)/z1, 1);
    import pprint
    import timeit
    pp = pprint.PrettyPrinter(indent=2)
    pp.pprint(toNested1(data))
    pp.pprint(toNested2(data))
    z1 = timeit.timeit("toNested1(data*10);", setup="from __main__ import toNested1,data")
    z2 = timeit.timeit("toNested2(data*10);", setup="from __main__ import toNested2,data")
    print(faster(z1,z2))

using collections.defaultdict

The collections module provides specialized container datatypes providing alternatives to Python’s general purpose built-in like dict. One of them is collections.defaultdict.

The behavior of this class is similar to above approaches. When you evaluate the expression dictionary[key] the result will be the value of key if it exists in the dictionary, otherwise the function will call the function stored in default_factory attribute and store the value, insert the key into the dictionary with this value and return it. The default_factory attribute can be initialized in the defaultdict constructor.

Using this method we can implement the following method:

collections.defaultdict
def nested_dict():
    return defaultdict(nested_dict)
def toNested3(data):
    nd = nested_dict()
    for a0,a1,a2,a3 in data:
        nd[a0][a1][a2] = a3
    return nd

When we benchmark toNested2 and this approach, we found that it faster by 55%.

conclusions

Using collections.defaultdict is neat, readable and straightforward way to create nested dictionaries structure. Furthermore, it is faster by 55% of other methods