Here @Easter-eggs[1], like others, we start playing with the awesome CEPH[2] distributed object storage. Our current use of it is the hosting of virtual machines disks.

Our first cluster was just installed this week on tuesday. Some non production virtual machines where installed on it and the whole cluster added to our monitoring systems.


On thursday evening, one of the cluster nodes went down due to cpu overhead (to be investigated, looks like a fan problem).

Monitoring systems send us alerts as usual, and we discovered that CEPH just did the job :) :

  • the server lost was detected by other nodes
  • CEPH started to replicate pgs between other nodes to maintain our replication level (this introduced a bit of load on virtual machines during the sync)
  • virtual machines that were running on the dead node were not alive anymore, but we just add to manually start them on another node (pacemaker is going to be setup on this cluster to manage this automagically)

On friday morning, we repaired the dead server, and boot it again:

  • the server automatically joined the CEPH cluster again
  • osd on this server were added automatically in the cluster
  • replication started to get an optimal replication state

Incident closed!

What to say else?

  • thanks to CEPH and the principle of server redundancy to let us sleep in our home instead of working a night in the datacenter
  • thanks to CEPH for being so magical
  • let's start the next step: configure pacemaker for automatic virtual machines failover on cluster nodes


[1] http://www.easter-eggs.com/

[2] http://ceph.com/