Why do we need ZooKeeper in the Hadoop stack?

Hadoop 1.x does not use Zookeeper. HBase does use zookeeper even in Hadoop 1.x installations.

Hadoop adopted Zookeeper as well starting with version 2.0.

The purpose of Zookeeper is cluster management. This fits with the general philosophy of *nix of using smaller specialized components – so components of Hadoop that want clustering capabilities rely on Zookeeper for that rather than develop their own.

Zookeeper is a distributed storage that provides the following guarantees (copied from Zookeeper overview page):

  • Sequential Consistency – Updates from a client will be applied in the
    order that they were sent.
  • Atomicity – Updates either succeed or
    fail. No partial results.
  • Single System Image – A client will see the
    same view of the service regardless of the server that it connects
    to.
  • Reliability – Once an update has been applied, it will persist
    from that time forward until a client overwrites the update.
  • Timeliness – The clients view of the system is guaranteed to be
    up-to-date within a certain time bound.

You can use these to implement different “recipes” that are required for cluster management like locks, leader election etc.

If you’re going to use ZooKeeper yourself, I recommend you take a look at Curator from Netflix which makes it easier to use (e.g. they implement a few recipes out of the box)

Leave a Comment