You’ve probably noticed that Python’s syntax for data structures is very similar to JSON’s syntax.
What’s happening is Python’s json library encodes Python’s builtin datatypes directly into text chunks, replacing ' into " and deleting , here and there (to oversimplify a bit).
On the other hand, pyyaml has to construct a whole representation graph before serialising it into a string.
The same kind of stuff has to happen backwards when loading.
The only way to speedup yaml.load() would be to write a new Loader, but I doubt it could be a huge leap in performance, except if you’re willing to write your own single-purpose sort-of YAML parser, taking the following comment in consideration:
YAML builds a graph because it is a general-purpose serialisation
format that is able to represent multiple references to the same
object. If you know no object is repeated and only basic types appear,
you can use a json serialiser, it will still be valid YAML.
— UPDATE
What I said before remains true, but if you’re running Linux there’s a way to speed up Yaml parsing. By default, Python’s yaml uses the Python parser. You have to tell it that you want to use PyYaml C parser.
You can do it this way:
import yaml
from yaml import CLoader as Loader, CDumper as Dumper
dump = yaml.dump(dummy_data, fh, encoding='utf-8', default_flow_style=False, Dumper=Dumper)
data = yaml.load(fh, Loader=Loader)
In order to do so, you need libyaml-cpp-dev (originally yaml-cpp-dev) installed, for instance with apt-get:
$ apt-get install libyaml-cpp-dev
And PyYaml with LibYaml as well. But that’s already the case based on your output.
I can’t test it right now because I’m running OS X and brew has some trouble installing yaml-cpp-dev but if you follow PyYaml documentation, they are pretty clear that performance will be much better.