As I understand it, this does not actually guarantee that kyro serialization is used; if a serializer is not available, kryo will fall back to Java serialization.
No. If you set spark.serializer
to org.apache.spark.serializer.
then Spark will use Kryo. If Kryo is not available, you will get an error. There is no fallback.
KryoSerializer
So what is this Kryo registration then?
When Kryo serializes an instance of an unregistered class it has to output the fully qualified class name. That’s a lot of characters. Instead, if a class has been pre-registered, Kryo can just output a numeric reference to this class, which is just 1-2 bytes.
This is especially crucial when each row of an RDD is serialized with Kryo. You don’t want to include the same class name for each of a billion rows. So you pre-register these classes. But it’s easy to forget to register a new class and then you’re wasting bytes again. The solution is to require every class to be registered:
conf.set("spark.kryo.registrationRequired", "true")
Now Kryo will never output full class names. If it encounters an unregistered class, that’s a runtime error.
Unfortunately it’s hard to enumerate all the classes that you are going to be serializing in advance. The idea is that Spark registers the Spark-specific classes, and you register everything else. You have an RDD[(X, Y, Z)]
? You have to register classOf[scala.Tuple3[_, _, _]]
.
The list of classes that Spark registers actually includes CompactBuffer
, so if you see an error for that, you’re doing something wrong. You are bypassing the Spark registration procedure. You have to use either spark.kryo.classesToRegister
or spark.kryo.registrator
to register your classes. (See the config options. If you use GraphX, your registrator should call GraphXUtils. registerKryoClasses.)