mcwhirter.com.au/craige/blog/2018/Resolving a Partitioned RabbitMQ Cluster with JuJu

Posted by Craige McWhirter on Tue Jan 2 10:14:56 2018
Last edited Tue Jan 2 22:18:07 2018

On occasion, a RabbitMQ cluster may partition itself. In a OpenStack environment this can often first present itself as nova-compute services stopping with errors such as these:

ERROR nova.openstack.common.periodic_task [-] Error during ComputeManager._sync_power_states: Timed out waiting for a reply to message ID 8fc8ea15c5d445f983fba98664b53d0c
...
TRACE nova.openstack.common.periodic_task self._raise_timeout_exception(msg_id)
TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 218, in _raise_timeout_exception
TRACE nova.openstack.common.periodic_task 'Timed out waiting for a reply to message ID %s' % msg_id)
TRACE nova.openstack.common.periodic_task MessagingTimeout: Timed out waiting for a reply to message ID 8fc8ea15c5d445f983fba98664b53d0c

Merely restarting the stopped nova-compute services will not resolve this issue.

You may also find that querying the rabbitmq service may either not return or take an awful long time to return:

$ sudo rabbitmqctl -p openstack list_queues name messages consumers status

...and in an environment managed by juju, you could also see JuJu trying to correct the RabbitMQ but failing:

$ juju stat --format tabular | grep rabbit
rabbitmq-server                       false local:trusty/rabbitmq-server-128
rabbitmq-server/0           idle   1.25.13.1 0/lxc/12 5672/tcp 192.168.7.148
rabbitmq-server/1   error   idle   1.25.13.1 1/lxc/8  5672/tcp 192.168.7.163   hook failed: "config-changed"
rabbitmq-server/2   error   idle   1.25.13.1 2/lxc/10 5672/tcp 192.168.7.174   hook failed: "config-changed"

You should now run rabbitmqctl cluster_status on each of your rabbit instances and review the output. If the cluster is partitioned, you will see something like the below:

ubuntu@my_juju_lxc:~$ sudo rabbitmqctl cluster_status
Cluster status of node 'rabbit@192-168-7-148' ...
[{nodes,[{disc,['rabbit@192-168-7-148','rabbit@192-168-7-163',
                'rabbit@192-168-7-174']}]},
 {running_nodes,['rabbit@192-168-7-174','rabbit@192-168-7-148']},
 {partitions,[{'rabbit@192-168-7-174',['rabbit@192-168-7-163']},
               {'rabbit@192-168-7-148',['rabbit@192-168-7-163']}]}]
...done.

You can clearly see from the above that there are two partitions for RabbitMQ. We need to now identify which of these is considered the leader:

maas-my_cloud:~$ juju run --service rabbitmq-server "is-leader"
- MachineId: 0/lxc/12
  Stderr: |
  Stdout: |
    True
  UnitId: rabbitmq-server/0
- MachineId: 1/lxc/8
  Stderr: |
  Stdout: |
    False
  UnitId: rabbitmq-server/1
- MachineId: 2/lxc/10
  Stderr: |
  Stdout: |
    False
  UnitId: rabbitmq-server/2

As you see above, in this example machine 0/lxc/12 is the leader, via it's status of "True". Now we need to hit the other two servers and shut down RabbitMQ:

# service rabbitmq-server stop

Once both services have completed shutting down, we can resolve the partitioning by running:

$ juju resolved -r rabbitmq-server/<whichever is leader>

Substituting <whichever is leader> for the machine ID identified earlier.

Once that has completed, you can start the previously stopped services with the below on each host:

# service rabbitmq-server start

and verify the result with:

$ sudo rabbitmqctl cluster_status
Cluster status of node 'rabbit@192-168-7-148' ...
[{nodes,[{disc,['rabbit@192-168-7-148','rabbit@192-168-7-163',
                'rabbit@192-168-7-174']}]},
 {running_nodes,['rabbit@192-168-7-163','rabbit@192-168-7-174',
                 'rabbit@192-168-7-148']},
 {partitions,[]}]
...done.

No partitions \o/

The JuJu errors for RabbitMQ should clear within a few minutes:

$ juju stat --format tabular | grep rabbit
rabbitmq-server                       false local:trusty/rabbitmq-server-128
rabbitmq-server/0             idle   1.25.13.1 0/lxc/12 5672/tcp 19 2.168.1.148
rabbitmq-server/1   unknown   idle   1.25.13.1 1/lxc/8  5672/tcp 19 2.168.1.163
rabbitmq-server/2   unknown   idle   1.25.13.1 2/lxc/10 5672/tcp 192.168.1.174

You should also find the nova-compute instances starting up fine.