Pacemaker Cluster on RHEL 9 by Vathsa

 

 

Implementing Pacemaker Cluster on RHEL Servers

 


Two RHEL v9 nodes in cluster providing high availability to storage and LAN network

sudo subscription-manager repos --enable=rhel-9-for-x86_64-highavailability-rpms
sudo dnf install pcs pacemaker fence-agents-all -y
sudo firewall-cmd --permanent --add-service=high-availability
sudo firewall-cmd --reload
sudo passwd hacluster
sudo systemctl enable --now pcsd


## Preparing cluster configuration ##



## Basic Informations ##

*You can substitute the values below accordingly to your environment:
Linux Node1: LNXSRV1
IP Node1: X.X.X.X
Linux Node2: LNXSRV2
IP Node2: Y.Y.Y.Y
Virtual Hostname: HOSTVT
Virtual IP: V.V.V.V
LAN Network Adapter: enpXsY
Shared Disk: /dev/sdA (the disk name on the node that you'll configure the cluster)
VG Name: nameVG
LV1 Name: lv1_name (ex: lv_www)
FS1 name: fs1name (ex: /var/www/)
LV2 Name: lv2_name (ex: lv_bkp)
FS2 name: fs2name (ex: /web/bkp/ for backup)
LV3 Name: lv3_name (ex: lv_mon)
FS3 name: fs3name (ex: /web/mon/ for monitoring)
LV3 Name: lv4_name (ex: lv_mir)
FS3 name: fs4name (ex: /var/mir/ for mirror)
Cluster name: name-cls
Resource Group name: name-rg




## If there is no IP address defined on your LAN network adapter (ex. enpXsY), you need to define it now ##

*On your first node, create a network file adapter on the /etc/sysconfig/network-scripts/ folder
# vi /etc/sysconfig/network-scripts/ifcfg-enpXsY
DEVICE=enpXsY
BOOTPROTO=none
ONBOOT=yes
PREFIX=24
IPADDR=X.X.X.X
*Then restart network service on RHEL 8:
systemctl restart NetworkManager
*Check the IPAddress with the "ifconfig -a" command. If it didn't work and you can't restart the machine, define the IPaddress manually:
ip addr add X.X.X.X/24 dev enpXsY

*On your second node, create a network file adapter on the /etc/sysconfig/network-scripts/ folder
# vi /etc/sysconfig/network-scripts/ifcfg-enpXsY
DEVICE=enpXsY
BOOTPROTO=none
ONBOOT=yes
PREFIX=24
IPADDR=Y.Y.Y.Y
*Then restart network service on RHEL 8:
systemctl restart NetworkManager
*Check the IPAddress with the "ifconfig -a" command. If it didn't work and you can't restart the machine, define the IPaddress manually:
ip addr add Y.Y.Y.Y/24 dev enpXsY



## Update the /etc/hosts file with the linux server IP address and the virtual IP ##
# echo "X.X.X.X LNXSRV1" >> /etc/hosts
# echo "Y.Y.Y.Y LNXSRV2" >> /etc/hosts
# echo "V.V.V.V HOSTVT" >> /etc/hosts
# cat /etc/hosts
*Do it on both nodes



## Run a ping connection from each node server ##
*from LNXSRV1:
# ping LNXSRV2
*from LNXSRV2
# ping LNXSRV1



## Install the pacemaker package on both nodes ##
*First you need to register a subscrition, then enable the high availability package
# sudo subscription-manager repos --enable=rhel-9-for-x86_64-highavailability-rpms
*Then you can install the pacemaker package
# sudo dnf install pcs pacemaker fence-agents-all
*Run the steps above on both linux servers



## Allow high availability ports in the firewall ##
# sudo firewall-cmd --permanent --add-service=high-availability
# sudo firewall-cmd --reload
*Run the steps above on both linux servers



## Set a password to hacluster user then start/ebable pcsd service ##
# echo senha123 | sudo passwd --stdin hacluster
Changing password for user hacluster.
passwd: all authentication tokens updated successfully.
# sudo systemctl start pcsd.service
# sudo systemctl enable pcsd.service
Created symlink /etc/systemd/system/multi-user.target.wants/pcsd.service → /usr/lib/systemd/system/pcsd.service.
*Do it on both nodes





## Creating and configuring pacemaker cluster ##



## Authorize both nodes to access the cluster ##
# sudo pcs host auth LNXSRV1 LNXSRV2 -u hacluster -p senha123
LNXSRV1: Authorized
LNXSRV2: Authorized
*run this command just from one node. I ran from the node LNXSRV1.



## Create a cluster with two nodes ##
# sudo pcs cluster setup name-cls --start LNXSRV1 LNXSRV2
No addresses specified for host 'LNXSRV1', using 'LNXSRV1'
No addresses specified for host 'LNXSRV2', using 'LNXSRV2'
Destroying cluster on hosts: 'LNXSRV1', 'LNXSRV2'...
LNXSRV2: Successfully destroyed cluster
LNXSRV1: Successfully destroyed cluster
...
Cluster has been successfully set up.
Starting cluster on hosts: 'LNXSRV1', 'LNXSRV2'...

*where:
"name-cls" is the cluster name
"--start" will start the cluster just after creating it.
"LNXSRV1 and LNXSRV2" are the nodes of the cluster
*run this command just from one node. I ran from the node rhel9vm1.



## Enable the cluster to start on all nodes after a OS boot ##
# sudo pcs cluster enable --all
LNXSRV1: Cluster Enabled
LNXSRV2: Cluster Enabled



## You can check the cluster status running the following command ##
# sudo pcs cluster status
Cluster Status:
Status of pacemakerd: 'Pacemaker is running' (last updated 2023-11-01 14:16:38 -03:00)
Cluster Summary:
* Stack: corosync
* Current DC: LNXSRV1 (version 2.1.5-9.el9_2.3-a3f44794f94) - partition with quorum
* Last updated: Wed Nov 1 14:16:39 2023
* Last change: Wed Nov 1 14:08:59 2023 by hacluster via crmd on LNXSRV1
* 2 nodes configured
* 0 resource instances configured
Node List:
* Online: [ LNXSRV1 LNXSRV2 ]
PCSD Status:
LNXSRV1: Online
LNXSRV2: Online



## Disable the following fence parameter ##
# sudo pcs property set stonith-enabled=false
# sudo pcs property set no-quorum-policy=ignore
*run this command just from one node. I ran from the node rhel9vm1.



## Configure a shared disk on the cluster ##

NOTE: As mentioned in the begin of this procedure the /dev/sdA is our shared disk.
*Change the system_id_source parameter in the /etc/lvm/lvm.conf file from "none" to "uname. The variable system_id_source will assume the system id allowing the node to activate the cluster, as it follows below:
# sudo sed -i 's/# system_id_source = "none"/ system_id_source = "uname"/g' /etc/lvm/lvm.conf
# cat /etc/lvm/lvm.conf |grep "system_id_source ="
NOTE: Run the commands above on both nodes..

*Run the following commands to create a LVM configuration (JUST FROM DE NODE LNXSRV1)
# sudo pvcreate /dev/sdA
# sudo vgcreate --setautoactivation n nameVG /dev/sdA
# sudo lvcreate -L1G -n lv1_name nameVG
# sudo lvs /dev/nameVG/lv1_name
# sudo mkfs.xfs /dev/nameVG/lv1_name
# sudo lvcreate -L1G -n lv2_name nameVG
# sudo lvs /dev/nameVG/lv2_name
# sudo mkfs.xfs /dev/nameVG/lv2_name
# sudo lvcreate -L1G -n lv3_name nameVG
# sudo lvs /dev/nameVG/lv3_name
# sudo mkfs.xfs /dev/nameVG/lv3_name
# sudo lvcreate -L1G -n lv4_name nameVG
# sudo lvs /dev/nameVG/lv4_name
# sudo mkfs.xfs /dev/nameVG/lv4_name

*Then, run the following node on the second node:
# sudo lvmdevices --adddev /dev/sdA





## Installing and configuring Apache Werbserver ##



## Install apache webserver (httpd) on both the servers with the dnf command ##
# sudo dnf install httpd wget
Updating Subscription Management repositories.
Red Hat Enterprise Linux 9 for x86_64 - BaseOS (RPMs)
Red Hat Enterprise Linux 9 for x86_64 - BaseOS (RPMs)
...
Complete!
NOTE: Run the step above on both nodes



## Allow Apache ports in firewall runnig the firewall-cmd command on both servers ##
# sudo firewall-cmd --permanent --zone=public --add-service=http
success
# sudo firewall-cmd --permanent --zone=public --add-service=https
success
# sudo firewall-cmd --reload
success
NOTE: Run the steps above on both nodes



## Then create status.conf file for the Apache resource agent get the status of Apache ##
# sudo vi /etc/httpd/conf.d/status.conf
<Location /server-status>
SetHandler server-status
Require local
</Location>
NOTE: Run the step above on both nodes



## On both nodes, change the /etc/logrotate.d/httpd file following the instructions below ##
* Disable the following line:
sudo vi /etc/logrotate.d/httpd
# /bin/systemctl reload httpd.service > /dev/null 2>/dev/null || true
* then include the three lines below:
/usr/bin/test -f /run/httpd.pid >/dev/null 2>/dev/null &&
/usr/bin/ps -q $(/usr/bin/cat /run/httpd.pid) >/dev/null 2>/dev/null &&
/usr/sbin/httpd -f /etc/httpd/conf/httpd.conf -c "PidFile /run/httpd.pid" -k graceful > /dev/null 2>/dev/null || true



## After changing the /etc/logrotate.d/httpd file, check the result on both nodes ##
# cat /etc/logrotate.d/httpd
# Note that logs are not compressed unless "compress" is configured,
# which can be done either here or globally in /etc/logrotate.conf.
/var/log/httpd/*log {
missingok
notifempty
sharedscripts
delaycompress
postrotate
#/bin/systemctl reload httpd.service > /dev/null 2>/dev/null || true
/usr/bin/test -f /run/httpd.pid >/dev/null 2>/dev/null &&
/usr/bin/ps -q $(/usr/bin/cat /run/httpd.pid) >/dev/null 2>/dev/null &&
/usr/sbin/httpd -f /etc/httpd/conf/httpd.conf -c "PidFile /run/httpd.pid" -k graceful > /dev/null 2>/dev/null || true
endscript
}



## Now you need to create a basic Apache webpage, running the following commands from the primary node ##

*Configure the availability of the logical volumes to yes:
# sudo lvchange -ay nameVG/lv1_name
# sudo lvchange -ay nameVG/lv2_name
# sudo lvchange -ay nameVG/lv3_name
# sudo lvchange -ay nameVG/lv4_name

*Mount the LVs on the specific mount points:
# sudo mkdir -p fs1name
# sudo mount /dev/nameVG/lv1_name fs1name
# sudo mkdir -p fs2name
# sudo mount /dev/nameVG/lv2_name fs2name
# sudo mkdir -p fs3name
# sudo mount /dev/nameVG/lv3_name fs3name
# sudo mkdir -p fs4name
# sudo mount /dev/nameVG/lv4_name fs4name

*Cheate the following directories on the fs1name filesystem
# sudo mkdir -p fs1namehtml
# sudo mkdir -p fs1namecgi-bin
# sudo mkdir -p fs1nameerror

*Check if all filesystems created above are mounted on the node1
df -h fs1name
df -h fs2name
df -h fs3name
df -h fs4name

*ON THE SECOND NODE LNXSRV2, run the following commands
# sudo mkdir -p fs1name
# sudo mkdir -p fs2name
# sudo mkdir -p fs3name
# sudo mkdir -p fs4name

*In the index.html file, include the code below running the following command (from the node1):
# sudo bash -c ' cat <<-END >fs1namehtml/index.html
<html>
<body>Test Apache webpage High Availability with pacemaker cluster</body>
</html>
END'

*Run the chcon command on the html directory to enable permission from SElinux security
# chcon -Rt httpd_sys_content_t fs1namehtml

# cat fs1namehtml/index.html

# sudo umount fs1name
# sudo umount fs2name
# sudo umount fs3name
# sudo umount fs4name



## If you have the SELinux enabled, run the following command on both nodes ##
*First, check if the SElinux is enabled...
# sudo sestatus
SELinux status: enabled
SELinuxfs mount: /sys/fs/selinux
...
*it's used to reset the security context
# sudo restorecon -R fs1name
# sudo restorecon -R fs2name
# sudo restorecon -R fs3name
# sudo restorecon -R fs4name

NOTE: Run the steps above on both nodes




###########################################################
## Creating resources and resources group on the cluster ##
###########################################################

NOTE1: The name of the resource group that we are going to use is name-rg;
NOTE2: You can run the commands below from any node of the cluster;


## Create a resource called rvg for the volume group nameVG that we created before ##
# sudo pcs resource create rlvm ocf:heartbeat:LVM-activate vgname=nameVG vg_access_mode=system_id --group name-rg
where:
ocf - Open Cluster Framework;



## Create resources called rlvX for the shared logical volumes lvX_name that we created before ##
# sudo pcs resource create lv1_name Filesystem device="/dev/nameVG/lv1_name" directory="fs1name" fstype="xfs" --group name-rg
# sudo pcs resource create lv2_name Filesystem device="/dev/nameVG/lv2_name" directory="fs2name" fstype="xfs" --group name-rg
# sudo pcs resource create lv3_name Filesystem device="/dev/nameVG/lv3_name" directory="fs3name" fstype="xfs" --group name-rg
# sudo pcs resource create lv4_name Filesystem device="/dev/nameVG/lv4_name" directory="fs4name" fstype="xfs" --group name-rg



## Create a resource called rip for the virtual IP address V.V.V.V that we defined in the begin of this procedure ##
# sudo pcs resource create rip IPaddr2 ip=V.V.V.V cidr_netmask=24 nic=enpXsY --group name-rg
NOTE: enpXsY is the virtual ethernet device that we are using on both node with the IP addresses X.X.X.X and Y.Y.Y.Y on each node.



## Create a resource called rweb for the Webpage Apache ##
# sudo pcs resource create rweb apache configfile="/etc/httpd/conf/httpd.conf" statusurl="http://127.0.0.1/server-status" --group name-rg



## Check the status and configuration of the cluster ##

# sudo pcs cluster config show
Cluster Name: name-cls
Cluster UUID: 4418d49cb04e40c589de05c30ab2df44
...
hash: sha256

# sudo pcs resource
* Resource Group: name-rg:
* rlvm (ocf:heartbeat:LVM-activate): Started LNXSRV1
* lv1_name (ocf:heartbeat:Filesystem): Started LNXSRV1
* lv2_name (ocf:heartbeat:Filesystem): Started LNXSRV1
* lv3_name (ocf:heartbeat:Filesystem): Started LNXSRV1
* lv4_name (ocf:heartbeat:Filesystem): Started LNXSRV1
* rip (ocf:heartbeat:IPaddr2): Started LNXSRV1
* rweb (ocf:heartbeat:apache): Started LNXSRV1

# sudo pcs cluster status
Cluster Status:
Status of pacemakerd: 'Pacemaker is running' (last updated 2023-11-03 19:08:36 -03:00)
...
LNXSRV1: Online
LNXSRV2: Online

# sudo pcs status
Cluster name: name-cls
...
pcsd: active/enabled


## Checking resource group and resources ##
# sudo pcs resource group list
name-rg: rlvm lv1_name lv2_name lv3_name lv4_name rip rweb





## Testing Cluster Failover ##



## Check the web service that is running on the node1 LNXSRV1 ##
*Open your web browser and write de URL address below:
http://V.V.V.V/index.html



## Test cluster take over setting node1 as standby ##
# sudo pcs node standby LNXSRV1



## Then check if all resources became online on the second node ##

# sudo pcs status

NOTE: run the commands below on the second node:

# df -h fs1name
# df -h fs2name
# df -h fs3name
# df -h fs4name

# ip -f inet addr show enpXsY
3: enpXsY: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
inet Y.Y.Y.Y brd 192.168.2.255 scope global noprefixroute enpXsY
valid_lft forever preferred_lft forever
inet V.V.V.V brd 192.168.2.255 scope global secondary enpXsY
valid_lft forever preferred_lft forever

# sudo vgs nameVG
VG #PV #LV #SN Attr VSize VFree
nameVG 1 1 0 wz--n- <4.00g 4.00m

# Check if the web service is running on the node2 LNXSRV2 ##
*Open your web browser and write de URL address below:
http://V.V.V.V/index.html



## Roll-back cluster resource to the origin Node ##
# sudo pcs node unstandby LNXSRV1
# sudo pcs resource move name-rg LNXSRV1
Location constraint to move resource 'name-rg' has been created
Waiting for the cluster to apply configuration changes...



## Then check if all resources became online on the primary node ##

# sudo pcs status

NOTE: run the commands below on the primary node:
# df -h fs1name
# df -h fs2name
# df -h fs3name
# df -h fs4name
# ip -f inet addr show enpXsY
# sudo vgs nameVG
*Open your web browser and write de URL address below:
http://V.V.V.V/index.html



## Run a shutdown on the node LNXSRV1 then check if the resources activate automatticaly on node LNXSRV2 ##

*On node LNXSRV1
# shutdown -F now

*On node LNXSRV2
# sudo pcs status
Full List of Resources:
* Resource Group: name-rg:
* rlvm (ocf:heartbeat:LVM-activate): Started LNXSRV2
* lv1_name (ocf:heartbeat:Filesystem): Started LNXSRV2
* lv2_name (ocf:heartbeat:Filesystem): Started LNXSRV2
* lv3_name (ocf:heartbeat:Filesystem): Started LNXSRV2
* lv4_name (ocf:heartbeat:Filesystem): Started LNXSRV2
* rip (ocf:heartbeat:IPaddr2): Started LNXSRV2
* rweb (ocf:heartbeat:apache): Started LNXSRV2
# df -h fs1name
# df -h fs2name
# df -h fs3name
# df -h fs4name
# ip -f inet addr show enpXsY
# sudo vgs nameVG
*Check if the web service is running on the node2 LNXSRV2. Open your web browser and write the URL address below*
http://V.V.V.V/index.html



## Reactivate node LNXSRV1 from VMWare ##



## After activating the node LNXSRV1, check if resources are still on the secondary node ##
# sudo pcs status
Cluster name: name-cls
Cluster Summary:
...
Node List:
* Online: [ LNXSRV1 LNXSRV2 ]
Full List of Resources:
* Resource Group: name-rg:
* rlvm (ocf:heartbeat:LVM-activate): Started LNXSRV2
* lv1_name (ocf:heartbeat:Filesystem): Started LNXSRV2
* lv2_name (ocf:heartbeat:Filesystem): Started LNXSRV2
* lv3_name (ocf:heartbeat:Filesystem): Started LNXSRV2
....



## Roll-back cluster resources to the origin Node ##
# sudo pcs resource move name-rg LNXSRV1
Location constraint to move resource 'name-rg' has been created
Waiting for the cluster to apply configuration changes...



## Now check if resources were moved to the main node ##

# sudo pcs status
Cluster name: name-cls
Cluster Summary:
....
Full List of Resources:
* Resource Group: name-rg:
* rlvm (ocf:heartbeat:LVM-activate): Started LNXSRV1
* lv1_name (ocf:heartbeat:Filesystem): Started LNXSRV1
* lv2_name (ocf:heartbeat:Filesystem): Started LNXSRV1
....

# df -h fs1name
# df -h fs2name
# df -h fs3name
# df -h fs4name
# ip -f inet addr show enpXsY
# sudo vgs nameVG
*Check if the web service is running on the node1 LNXSRV1. Open your web browser and write the URL address below*
http://V.V.V.V/index.html
 
 

Other PCS Commands:

OperationCommand
pcs --versionPacemaker or pcs version:
pcs statusPacemaker Overview
pcs resourcePacemaker resource overview
lcmapDetermine path-ownership in a cluster.
pcs resource enable resource_nameEnable (start) resource.
pcs resource debug-start resource_nameStart pcs resource with debug.
pcs resource config resource_nameReview pcs resource configuration settings
pcs resource disable resource_nameDisable (stop) resource:
pcs resource cleanup resource_nameRestart failed resource.
pcs stop cluster [--force]Stop pacemaker on node.
pcs cluster start [--all]Start pacemaker
pcs node standby node_namePut the node in standby.
pcs node unstandby node_nameBring the node out of standby.


  TROUBLESHOOTING PACEMAKER 

1. Check Cluster Status pcs status Look for: Offline or UNCLEAN nodes Stopped or failed resources 

2. Check Logs Pacemaker logs to: /var/log/pacemaker.log /var/log/messages journalctl -xe 

3. Check Resource Errors pcs resource failcount show pcs resource status <resource-name> Clear fail counts: pcs resource cleanup <resource-name>

 4. Check Node Status pcs node status crm_node -l 

5. Check CIB (Cluster Information Base) pcs cluster cib This can help identify misconfigurations. 

6. Force Start or Reset If the cluster is stuck or misbehaving: pcs resource cleanup pcs cluster stop --all pcs cluster start --all Or reboot one node at a time if needed.

PathPurposeSupplemental Commands
/var/log/messagesContains global system messages regarding system resources and services.grep 'pacemaker.*\(error\|warning\)' /var/log/messages
/var/log/pacemaker/pacemaker.logDefault pacemaker information logging for pacemaker resources and functions.N/A
/var/log/pcsd/pcsd.logDefault pacemaker service/daemon (pcsd) log.N/A
/var/log/cluster/corosync.logDefault pacemaker node communication log.N/A
/usr/sbin/nw_hae.logNetWorker (nws) resource start log as defined in /usr/lib/ocf/resource.d/EMC_NetWorker/ServerN/A
/usr/lib/ocf/resource.d/EMC_NetWorker/ServerNetWorker pacemaker configuration file. This is what operations are performed/managed by pcs.N/A

Comments

Popular posts from this blog

RHEL - How to back out a failed patch

Vathsa's- Linux - SysOps and DevOps

Local Yum Repository for Oracle Linux 8