Linux: Troubleshooting Redhat Cluster Suite

Ref:http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Configuration_Example_-_NFS_Over_GFS/NFS_GFS_Troubleshoot.html

If you find that you are seeing error messages when you try to configure your system, or if after configuration your system does not behave as expected, you can perform the following checks and examine the following areas.

*
Connect to one of the nodes in the cluster and execute the clustat(8) command. This command runs a utility that displays the status of the cluster. It shows membership information, quorum view, and the state of all configured user services.
The following example shows the output of the clustat(8) command.

      [root@clusternode4 ~]# clustat
      Cluster Status for nfsclust @ Wed Dec  3 12:37:22 2008
      Member Status: Quorate
 
       Member Name                              ID   Status
       ------ ----                              ---- ------
       clusternode5.example.com          1 Online, rgmanager
       clusternode4.example.com          2 Online, Local, rgmanager
       clusternode3.example.com          3 Online, rgmanager
       clusternode2.example.com          4 Online, rgmanager
       clusternode1.example.com          5 Online, rgmanager
 
       Service Name             Owner (Last)                     State
       ------- ---              ----- ------                     -----
       service:nfssvc           clusternode2.example.com         starting

In this example, clusternode4 is the local node since it is the host from which the command was run. If rgmanager did not appear in the Status category, it could indicate that cluster services are not running on the node.
*
Connect to one of the nodes in the cluster and execute the group_tool(8) command. This command provides information that you may find helpful in debugging your system. The following example shows the output of the group_tool(8) command.

      [root@clusternode1 ~]# group_tool
      type             level name       id       state
      fence            0     default    00010005 none
      [1 2 3 4 5]
      dlm              1     clvmd      00020005 none
      [1 2 3 4 5]
      dlm              1     rgmanager  00030005 none
      [3 4 5]
      dlm              1     mygfs      007f0005 none
      [5]
      gfs              2     mygfs      007e0005 none
      [5]

The state of the group should be none. The numbers in the brackets are the node ID numbers of the cluster nodes in the group. The clustat shows which node IDs are associated with which nodes. If you do not see a node number in the group, it is not a member of that group. For example, if a node ID is not in dlm/rgmanager group, it is not using the rgmanager dlm lock space (and probably is not running rgmanager).
The level of a group indicates the recovery ordering. 0 is recovered first, 1 is recovered second, and so forth.
*
Connect to one of the nodes in the cluster and execute the cman_tool nodes -f command This command provides information about the cluster nodes that you may want to look at. The following example shows the output of the cman_tool nodes -f command.

      [root@clusternode1 ~]# cman_tool nodes -f
      Node  Sts   Inc   Joined               Name
         1   M    752   2008-10-27 11:17:15  clusternode5.example.com
         2   M    752   2008-10-27 11:17:15  clusternode4.example.com
         3   M    760   2008-12-03 11:28:44  clusternode3.example.com
         4   M    756   2008-12-03 11:28:26  clusternode2.example.com
         5   M    744   2008-10-27 11:17:15  clusternode1.example.com

The Sts heading indicates the status of a node. A status of M indicates the node is a member of the cluster. A status of X indicates that the node is dead. The Inc heading indicating the incarnation number of a node, which is for debugging purposes only.
*
Check whether the cluster.conf is identical in each node of the cluster. If you configure your system with Conga, as in the example provided in this document, these files should be identical, but one of the files may have accidentally been deleted or altered.
*
In addition to using Conga to fence a node in order to test whether failover is working properly as described in Chapter 6, Testing the NFS Cluster Service, you could disconnect the ethernet connection between cluster members. You might try disconnecting one, two, or three nodes, for example. This could help isolate where the problem is.
*
If you are having trouble mounting or modifying an NFS volume, check whether the cause is one of the following:
o
The network between server and client is down.
o
The storage devices are not connected to the system.
o
More than half of the nodes in the cluster have crashed, rendering the cluster inquorate. This stops the cluster.
o
The GFS file system is not mounted on the cluster nodes.
o
The GFS file system is not writable.
o
The IP address you defined in the cluster.conf is not bounded to the correct interface / NIC (sometimes the ip.sh script does not perform as expected).
*
Execute a showmount -e command on the node running the cluster service. If it shows up the right 5 exports, check your firewall configuration for all necessary ports for using NFS.
*
If SELinux is currently in enforcing mode on your system, check your /var/log/audit.log file for any relevant messages. If you are using NFS to serve home directories, check whether the correct SELinux boolean value for nfs_home_dirs has been set to 1; this is required if you want to use NFS-based home directories on a client that is running SELinux. If you do not set this value on, you can mount the directories as root but cannot use them as home directories for your users.
*
Check the /var/log/messages file for error messages from the NFS daemon.
*
If you see the expected results locally at the cluster nodes and between the cluster nodes but not at the defined clients, check the firewall configuration at the clients.

Troubleshooting Red Hat Cluster Suite Networking
Ref : http://people.redhat.com/ccaulfie/docs/CSNetworking.pdf

Leave a Reply

*