Why protobuf is bad for large data structures?

Question

It is just general guidance, so it doesn’t apply to every case. For example, the OpenStreetMap project uses a protocol buffers based file format for its maps, and the files are often 10-100 GB in size. Another example is Google’s own TensorFlow, which uses protobuf and the graphs it stores are often up to 1 GB in size.

However, OpenStreetMap does not have the entire file as a single message. Instead it consists of thousands individual messages, each encoding a part of the map. You can apply a similar approach, so that each message only encodes e.g. one node.

The main problem with protobuf for large files is that it doesn’t support random access. You’ll have to read the whole file, even if you only want to access a specific item. If your application will be reading the whole file to memory anyway, this is not an issue. This is what TensorFlow does, and it appears to store everything in a single message.

If you need a random access format that is compatible across many languages, I would suggest HDF5 or sqlite.

Leave a Comment Cancel reply