Document Databases: Redundant data, references, etc. (MongoDB specifically)

Question

There are basically two scenario’s: fresh and stale.

Fresh data

Storing duplicate data is easy. Maintaining the duplicate data is the hard part. So the easiest thing to do is to avoid maintenance, by simply not storing any duplicate data to begin with. This is mainly useful if you need fresh data. Only store the references, and query the collections when you need to retrieve information.

In this scenario, you’ll have some overhead due to the extra queries. The alternative is to track all locations of duplicate data, and update all instances on each update. This also involves overhead, especially in N-to-M relations like the one you mentioned. So either way, you will have some overhead, if you require fresh data. You can’t have the best of both worlds.

Stale data

If you can afford to have stale data, things get a lot easier. To avoid query overhead, you can store duplicate data. To avoid having to maintain duplicate data, you’re not going to store duplicate data. At least not actively.

In this scenario you’ll also want to store only the references between documents. Then use a periodic map-reduce job to generate the duplicate data. You can then query the single map-reduce result, rather than separate collections. This way you avoid the query overhead, but you also don’t have to hunt down data changes.

Summary

Only store references to other documents. If you can afford stale data, use periodic map-reduce jobs to generate duplicate data. Avoid maintaining duplicate data; it’s complex and error-prone.

Fresh data

Stale data

Summary

Leave a Comment Cancel reply