On occasion, a RabbitMQ cluster may partition itself. In a OpenStack environment this can often first present itself as nova-compute services stopping with errors such as these:
ERROR nova.openstack.common.periodic_task [-] Error during ComputeManager._sync_power_states: Timed out waiting for a reply to message ID 8fc8ea15c5d445f983fba98664b53d0c
...
TRACE nova.openstack.common.periodic_task self._raise_timeout_exception(msg_id)
TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 218, in _raise_timeout_exception
TRACE nova.openstack.common.periodic_task 'Timed out waiting for a reply to message ID %s' % msg_id)
TRACE nova.openstack.common.periodic_task MessagingTimeout: Timed out waiting for a reply to message ID 8fc8ea15c5d445f983fba98664b53d0c
Merely restarting the stopped nova-compute services will not resolve this issue.
You may also find that querying the rabbitmq service may either not return or take an awful long time to return:
$ sudo rabbitmqctl -p openstack list_queues name messages consumers status
...and in an environment managed by juju, you could also see JuJu trying to correct the RabbitMQ but failing:
$ juju stat --format tabular | grep rabbit
rabbitmq-server false local:trusty/rabbitmq-server-128
rabbitmq-server/0 idle 1.25.13.1 0/lxc/12 5672/tcp 192.168.7.148
rabbitmq-server/1 error idle 1.25.13.1 1/lxc/8 5672/tcp 192.168.7.163 hook failed: "config-changed"
rabbitmq-server/2 error idle 1.25.13.1 2/lxc/10 5672/tcp 192.168.7.174 hook failed: "config-changed"
You should now run rabbitmqctl cluster_status
on each of your rabbit
instances and review the output. If the cluster is partitioned, you will see
something like the below:
ubuntu@my_juju_lxc:~$ sudo rabbitmqctl cluster_status
Cluster status of node 'rabbit@192-168-7-148' ...
[{nodes,[{disc,['rabbit@192-168-7-148','rabbit@192-168-7-163',
'rabbit@192-168-7-174']}]},
{running_nodes,['rabbit@192-168-7-174','rabbit@192-168-7-148']},
{partitions,[{'rabbit@192-168-7-174',['rabbit@192-168-7-163']},
{'rabbit@192-168-7-148',['rabbit@192-168-7-163']}]}]
...done.
You can clearly see from the above that there are two partitions for RabbitMQ. We need to now identify which of these is considered the leader:
maas-my_cloud:~$ juju run --service rabbitmq-server "is-leader"
- MachineId: 0/lxc/12
Stderr: |
Stdout: |
True
UnitId: rabbitmq-server/0
- MachineId: 1/lxc/8
Stderr: |
Stdout: |
False
UnitId: rabbitmq-server/1
- MachineId: 2/lxc/10
Stderr: |
Stdout: |
False
UnitId: rabbitmq-server/2
As you see above, in this example machine 0/lxc/12 is the leader, via it's status of "True". Now we need to hit the other two servers and shut down RabbitMQ:
# service rabbitmq-server stop
Once both services have completed shutting down, we can resolve the partitioning by running:
$ juju resolved -r rabbitmq-server/<whichever is leader>
Substituting <whichever is leader>
for the machine ID identified earlier.
Once that has completed, you can start the previously stopped services with the below on each host:
# service rabbitmq-server start
and verify the result with:
$ sudo rabbitmqctl cluster_status
Cluster status of node 'rabbit@192-168-7-148' ...
[{nodes,[{disc,['rabbit@192-168-7-148','rabbit@192-168-7-163',
'rabbit@192-168-7-174']}]},
{running_nodes,['rabbit@192-168-7-163','rabbit@192-168-7-174',
'rabbit@192-168-7-148']},
{partitions,[]}]
...done.
No partitions \o/
The JuJu errors for RabbitMQ should clear within a few minutes:
$ juju stat --format tabular | grep rabbit
rabbitmq-server false local:trusty/rabbitmq-server-128
rabbitmq-server/0 idle 1.25.13.1 0/lxc/12 5672/tcp 19 2.168.1.148
rabbitmq-server/1 unknown idle 1.25.13.1 1/lxc/8 5672/tcp 19 2.168.1.163
rabbitmq-server/2 unknown idle 1.25.13.1 2/lxc/10 5672/tcp 192.168.1.174
You should also find the nova-compute instances starting up fine.