Implementing Pacemaker Cluster on RHEL Servers


Two RHEL v9 nodes in cluster providing high availability to storage and LAN network

sudo subscription-manager repos --enable=rhel-9-for-x86_64-highavailability-rpms
sudo dnf install pcs pacemaker fence-agents-all -y
sudo firewall-cmd --permanent --add-service=high-availability
sudo firewall-cmd --reload
sudo passwd hacluster
sudo systemctl enable --now pcsd


## Preparing cluster configuration ##



## Basic Informations ##

*You can substitute the values below accordingly to your environment:
Linux Node1: LNXSRV1
IP Node1: X.X.X.X
Linux Node2: LNXSRV2
IP Node2: Y.Y.Y.Y
Virtual Hostname: HOSTVT
Virtual IP: V.V.V.V
LAN Network Adapter: enpXsY
Shared Disk: /dev/sdA (the disk name on the node that you'll configure the cluster)
VG Name: nameVG
LV1 Name: lv1_name (ex: lv_www)
FS1 name: fs1name (ex: /var/www/)
LV2 Name: lv2_name (ex: lv_bkp)
FS2 name: fs2name (ex: /web/bkp/ for backup)
LV3 Name: lv3_name (ex: lv_mon)
FS3 name: fs3name (ex: /web/mon/ for monitoring)
LV3 Name: lv4_name (ex: lv_mir)
FS3 name: fs4name (ex: /var/mir/ for mirror)
Cluster name: name-cls
Resource Group name: name-rg




## If there is no IP address defined on your LAN network adapter (ex. enpXsY), you need to define it now ##

*On your first node, create a network file adapter on the /etc/sysconfig/network-scripts/ folder
# vi /etc/sysconfig/network-scripts/ifcfg-enpXsY
DEVICE=enpXsY
BOOTPROTO=none
ONBOOT=yes
PREFIX=24
IPADDR=X.X.X.X
*Then restart network service on RHEL 8:
systemctl restart NetworkManager
*Check the IPAddress with the "ifconfig -a" command. If it didn't work and you can't restart the machine, define the IPaddress manually:
ip addr add X.X.X.X/24 dev enpXsY

*On your second node, create a network file adapter on the /etc/sysconfig/network-scripts/ folder
# vi /etc/sysconfig/network-scripts/ifcfg-enpXsY
DEVICE=enpXsY
BOOTPROTO=none
ONBOOT=yes
PREFIX=24
IPADDR=Y.Y.Y.Y
*Then restart network service on RHEL 8:
systemctl restart NetworkManager
*Check the IPAddress with the "ifconfig -a" command. If it didn't work and you can't restart the machine, define the IPaddress manually:
ip addr add Y.Y.Y.Y/24 dev enpXsY



## Update the /etc/hosts file with the linux server IP address and the virtual IP ##
# echo "X.X.X.X       LNXSRV1" >> /etc/hosts
# echo "Y.Y.Y.Y       LNXSRV2" >> /etc/hosts
# echo "V.V.V.V       HOSTVT" >> /etc/hosts
# cat /etc/hosts
*Do it on both nodes



## Run a ping connection from each node server ##
*from LNXSRV1:
# ping LNXSRV2
*from LNXSRV2
# ping LNXSRV1



## Install the pacemaker package on both nodes ##
*First you need to register a subscrition, then enable the high availability package
# sudo subscription-manager repos --enable=rhel-9-for-x86_64-highavailability-rpms
*Then you can install the pacemaker package
# sudo dnf install pcs pacemaker fence-agents-all
*Run the steps above on both linux servers



## Allow high availability ports in the firewall ##
# sudo firewall-cmd --permanent --add-service=high-availability
# sudo firewall-cmd --reload
*Run the steps above on both linux servers



## Set a password to hacluster user then start/ebable pcsd service ##
# echo senha123 | sudo passwd --stdin hacluster
 Changing password for user hacluster.
 passwd: all authentication tokens updated successfully.
# sudo systemctl start pcsd.service
# sudo systemctl enable pcsd.service
 Created symlink /etc/systemd/system/multi-user.target.wants/pcsd.service → /usr/lib/systemd/system/pcsd.service.
*Do it on both nodes

 



## Creating and configuring pacemaker cluster ##



## Authorize both nodes to access the cluster ##
# sudo pcs host auth LNXSRV1 LNXSRV2 -u hacluster -p senha123
 LNXSRV1: Authorized
 LNXSRV2: Authorized
*run this command just from one node. I ran from the node LNXSRV1.
 
 
 
## Create a cluster with two nodes ##
# sudo pcs cluster setup name-cls --start LNXSRV1 LNXSRV2
 No addresses specified for host 'LNXSRV1', using 'LNXSRV1'
 No addresses specified for host 'LNXSRV2', using 'LNXSRV2'
 Destroying cluster on hosts: 'LNXSRV1', 'LNXSRV2'...
 LNXSRV2: Successfully destroyed cluster
 LNXSRV1: Successfully destroyed cluster
 ...
 Cluster has been successfully set up.
 Starting cluster on hosts: 'LNXSRV1', 'LNXSRV2'...

*where: 
    "name-cls" is the cluster name
    "--start" will start the cluster just after creating it.
    "LNXSRV1 and LNXSRV2" are the nodes of the cluster
*run this command just from one node. I ran from the node rhel9vm1.



## Enable the cluster to start on all nodes after a OS boot ##    
# sudo pcs cluster enable --all
 LNXSRV1: Cluster Enabled
 LNXSRV2: Cluster Enabled



## You can check the cluster status running the following command ##
# sudo pcs cluster status
 Cluster Status:
  Status of pacemakerd: 'Pacemaker is running' (last updated 2023-11-01 14:16:38 -03:00)
  Cluster Summary:
    * Stack: corosync
    * Current DC: LNXSRV1 (version 2.1.5-9.el9_2.3-a3f44794f94) - partition with quorum
    * Last updated: Wed Nov  1 14:16:39 2023
    * Last change:  Wed Nov  1 14:08:59 2023 by hacluster via crmd on LNXSRV1
    * 2 nodes configured
   * 0 resource instances configured
  Node List:
    * Online: [ LNXSRV1 LNXSRV2 ]
 PCSD Status:
   LNXSRV1: Online
   LNXSRV2: Online
   


## Disable the following fence parameter ##
# sudo pcs property set stonith-enabled=false
# sudo pcs property set no-quorum-policy=ignore
*run this command just from one node. I ran from the node rhel9vm1.



## Configure a shared disk on the cluster ##

NOTE: As mentioned in the begin of this procedure the /dev/sdA is our shared disk.
*Change the system_id_source parameter in the /etc/lvm/lvm.conf file from "none" to "uname. The variable system_id_source will assume the system id allowing the node to activate the cluster, as it follows below:
# sudo sed -i 's/# system_id_source = "none"/ system_id_source = "uname"/g' /etc/lvm/lvm.conf
# cat /etc/lvm/lvm.conf |grep "system_id_source ="
NOTE: Run the commands above on both nodes..

*Run the following commands to create a LVM configuration (JUST FROM DE NODE LNXSRV1)
# sudo pvcreate /dev/sdA
# sudo vgcreate --setautoactivation n nameVG /dev/sdA
# sudo lvcreate -L1G -n lv1_name nameVG
# sudo lvs /dev/nameVG/lv1_name
# sudo mkfs.xfs /dev/nameVG/lv1_name
# sudo lvcreate -L1G -n lv2_name nameVG
# sudo lvs /dev/nameVG/lv2_name
# sudo mkfs.xfs /dev/nameVG/lv2_name
# sudo lvcreate -L1G -n lv3_name nameVG
# sudo lvs /dev/nameVG/lv3_name
# sudo mkfs.xfs /dev/nameVG/lv3_name
# sudo lvcreate -L1G -n lv4_name nameVG
# sudo lvs /dev/nameVG/lv4_name
# sudo mkfs.xfs /dev/nameVG/lv4_name

*Then, run the following node on the second node:
# sudo lvmdevices --adddev /dev/sdA





## Installing and configuring Apache Werbserver ##



## Install apache webserver (httpd) on both the servers with the dnf command ##
# sudo dnf install httpd wget
 Updating Subscription Management repositories.
 Red Hat Enterprise Linux 9 for x86_64 - BaseOS (RPMs) 
 Red Hat Enterprise Linux 9 for x86_64 - BaseOS (RPMs) 
 ... 
 Complete!
NOTE: Run the step above on both nodes



## Allow Apache ports in firewall runnig the firewall-cmd command on both servers ##
# sudo firewall-cmd --permanent --zone=public --add-service=http
 success
# sudo firewall-cmd --permanent --zone=public --add-service=https
 success
# sudo firewall-cmd --reload
 success
NOTE: Run the steps above on both nodes



## Then create status.conf file for the Apache resource agent get the status of Apache ##
# sudo vi /etc/httpd/conf.d/status.conf
<Location /server-status>
    SetHandler server-status
    Require local
</Location>
NOTE: Run the step above on both nodes



## On both nodes, change the /etc/logrotate.d/httpd file following the instructions below ##
* Disable the following line:
sudo vi /etc/logrotate.d/httpd
# /bin/systemctl reload httpd.service > /dev/null 2>/dev/null || true
* then include the three lines below:
/usr/bin/test -f /run/httpd.pid >/dev/null 2>/dev/null &&
/usr/bin/ps -q $(/usr/bin/cat /run/httpd.pid) >/dev/null 2>/dev/null &&
/usr/sbin/httpd -f /etc/httpd/conf/httpd.conf -c "PidFile /run/httpd.pid" -k graceful > /dev/null 2>/dev/null || true



## After changing the /etc/logrotate.d/httpd file, check the result on both nodes ##
# cat /etc/logrotate.d/httpd
 # Note that logs are not compressed unless "compress" is configured,
 # which can be done either here or globally in /etc/logrotate.conf.
 /var/log/httpd/*log {
     missingok
     notifempty
     sharedscripts
     delaycompress
     postrotate
            #/bin/systemctl reload httpd.service > /dev/null 2>/dev/null || true
            /usr/bin/test -f /run/httpd.pid >/dev/null 2>/dev/null && 
	        /usr/bin/ps -q $(/usr/bin/cat /run/httpd.pid) >/dev/null 2>/dev/null &&
	        /usr/sbin/httpd -f /etc/httpd/conf/httpd.conf -c "PidFile /run/httpd.pid" -k graceful > /dev/null 2>/dev/null || true
     endscript
 }



## Now you need to create a basic Apache webpage, running the following commands from the primary node ##

*Configure the availability of the logical volumes to yes:
# sudo lvchange -ay nameVG/lv1_name
# sudo lvchange -ay nameVG/lv2_name
# sudo lvchange -ay nameVG/lv3_name
# sudo lvchange -ay nameVG/lv4_name

*Mount the LVs on the specific mount points:
# sudo mkdir -p fs1name
# sudo mount /dev/nameVG/lv1_name fs1name
# sudo mkdir -p fs2name
# sudo mount /dev/nameVG/lv2_name fs2name
# sudo mkdir -p fs3name
# sudo mount /dev/nameVG/lv3_name fs3name
# sudo mkdir -p fs4name
# sudo mount /dev/nameVG/lv4_name fs4name

*Cheate the following directories on the fs1name filesystem
# sudo mkdir -p fs1namehtml
# sudo mkdir -p fs1namecgi-bin
# sudo mkdir -p fs1nameerror

*Check if all filesystems created above are mounted on the node1
df -h fs1name
df -h fs2name
df -h fs3name
df -h fs4name

*ON THE SECOND NODE LNXSRV2, run the following commands
# sudo mkdir -p fs1name
# sudo mkdir -p fs2name
# sudo mkdir -p fs3name
# sudo mkdir -p fs4name

*In the index.html file, include the code below running the following command (from the node1):
# sudo bash -c ' cat <<-END >fs1namehtml/index.html
<html>
<body>Test Apache webpage High Availability with pacemaker cluster</body>
</html>
END'

*Run the chcon command on the html directory to enable permission from SElinux security
# chcon -Rt httpd_sys_content_t fs1namehtml

# cat fs1namehtml/index.html

# sudo umount fs1name
# sudo umount fs2name
# sudo umount fs3name
# sudo umount fs4name



## If you have the SELinux enabled, run the following command on both nodes ##
*First, check if the SElinux is enabled...
# sudo sestatus
 SELinux status:                 enabled
 SELinuxfs mount:                /sys/fs/selinux
 ...
*it's used to reset the security context
# sudo restorecon -R fs1name
# sudo restorecon -R fs2name
# sudo restorecon -R fs3name
# sudo restorecon -R fs4name

NOTE: Run the steps above on both nodes




###########################################################
## Creating resources and resources group on the cluster ##
###########################################################

NOTE1: The name of the resource group that we are going to use is name-rg;
NOTE2: You can run the commands below from any node of the cluster;


## Create a resource called rvg for the volume group nameVG that we created before ##
# sudo pcs resource create rlvm ocf:heartbeat:LVM-activate vgname=nameVG vg_access_mode=system_id --group name-rg
where:
    ocf - Open Cluster Framework;



## Create resources called rlvX for the shared logical volumes lvX_name that we created before ##
# sudo pcs resource create lv1_name Filesystem device="/dev/nameVG/lv1_name" directory="fs1name" fstype="xfs" --group name-rg
# sudo pcs resource create lv2_name Filesystem device="/dev/nameVG/lv2_name" directory="fs2name" fstype="xfs" --group name-rg
# sudo pcs resource create lv3_name Filesystem device="/dev/nameVG/lv3_name" directory="fs3name" fstype="xfs" --group name-rg
# sudo pcs resource create lv4_name Filesystem device="/dev/nameVG/lv4_name" directory="fs4name" fstype="xfs" --group name-rg



## Create a resource called rip for the virtual IP address V.V.V.V that we defined in the begin of this procedure ##
# sudo pcs resource create rip IPaddr2 ip=V.V.V.V cidr_netmask=24 nic=enpXsY --group name-rg
NOTE: enpXsY is the virtual ethernet device that we are using on both node with the IP addresses X.X.X.X and Y.Y.Y.Y on each node.



## Create a resource called rweb for the Webpage Apache ##
# sudo pcs resource create rweb apache configfile="/etc/httpd/conf/httpd.conf" statusurl="http://127.0.0.1/server-status" --group name-rg



## Check the status and configuration of the cluster ##

# sudo pcs cluster config show
 Cluster Name: name-cls
 Cluster UUID: 4418d49cb04e40c589de05c30ab2df44
 ...
 hash: sha256

# sudo pcs resource
    * Resource Group: name-rg:
    * rlvm	(ocf:heartbeat:LVM-activate):	 Started LNXSRV1
    * lv1_name	(ocf:heartbeat:Filesystem):	 Started LNXSRV1
    * lv2_name	(ocf:heartbeat:Filesystem):	 Started LNXSRV1
    * lv3_name	(ocf:heartbeat:Filesystem):	 Started LNXSRV1
    * lv4_name	(ocf:heartbeat:Filesystem):	 Started LNXSRV1
    * rip	(ocf:heartbeat:IPaddr2):	 Started LNXSRV1
    * rweb	(ocf:heartbeat:apache):	 Started LNXSRV1

# sudo pcs cluster status
Cluster Status:
 Status of pacemakerd: 'Pacemaker is running' (last updated 2023-11-03 19:08:36 -03:00)
 ...
 LNXSRV1: Online
 LNXSRV2: Online

# sudo pcs status
 Cluster name: name-cls
 ...
 pcsd: active/enabled
 
 
## Checking resource group and resources ##
# sudo pcs resource group list
 name-rg: rlvm lv1_name lv2_name lv3_name lv4_name rip rweb





## Testing Cluster Failover ##
 


## Check the web service that is running on the node1 LNXSRV1 ##
*Open your web browser and write de URL address below:
http://V.V.V.V/index.html



## Test cluster take over setting node1 as standby ##
# sudo pcs node standby LNXSRV1



## Then check if all resources became online on the second node ##

# sudo pcs status

NOTE: run the commands below on the second node:

# df -h fs1name
# df -h fs2name
# df -h fs3name
# df -h fs4name
 
# ip -f inet addr show enpXsY 
3: enpXsY: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
 inet Y.Y.Y.Y brd 192.168.2.255 scope global noprefixroute enpXsY
  valid_lft forever preferred_lft forever
 inet V.V.V.V brd 192.168.2.255 scope global secondary enpXsY
  valid_lft forever preferred_lft forever
  
# sudo vgs nameVG
 VG   #PV #LV #SN Attr   VSize  VFree
 nameVG   1   1   0 wz--n- <4.00g 4.00m

# Check if the web service is running on the node2 LNXSRV2 ##
*Open your web browser and write de URL address below:
 http://V.V.V.V/index.html



## Roll-back cluster resource to the origin Node ##
# sudo pcs node unstandby LNXSRV1
# sudo pcs resource move name-rg LNXSRV1
 Location constraint to move resource 'name-rg' has been created
 Waiting for the cluster to apply configuration changes...



## Then check if all resources became online on the primary node ##

# sudo pcs status

NOTE: run the commands below on the primary node:
# df -h fs1name
# df -h fs2name
# df -h fs3name
# df -h fs4name
# ip -f inet addr show enpXsY
# sudo vgs nameVG
*Open your web browser and write de URL address below:
 http://V.V.V.V/index.html



## Run a shutdown on the node LNXSRV1 then check if the resources activate automatticaly on node LNXSRV2 ##

*On node LNXSRV1
# shutdown -F now

*On node LNXSRV2
# sudo pcs status
 Full List of Resources:
   * Resource Group: name-rg:
     * rlvm  (ocf:heartbeat:LVM-activate):    Started LNXSRV2
     * lv1_name    (ocf:heartbeat:Filesystem):  Started LNXSRV2
     * lv2_name    (ocf:heartbeat:Filesystem):  Started LNXSRV2
     * lv3_name   (ocf:heartbeat:Filesystem):  Started LNXSRV2
     * lv4_name  (ocf:heartbeat:Filesystem):  Started LNXSRV2
     * rip   (ocf:heartbeat:IPaddr2):     Started LNXSRV2
     * rweb  (ocf:heartbeat:apache):  Started LNXSRV2
# df -h fs1name
# df -h fs2name
# df -h fs3name
# df -h fs4name
# ip -f inet addr show enpXsY
# sudo vgs nameVG
*Check if the web service is running on the node2 LNXSRV2. Open your web browser and write the URL address below*
 http://V.V.V.V/index.html



## Reactivate node LNXSRV1 from VMWare ##



## After activating the node LNXSRV1, check if resources are still on the secondary node ##
# sudo pcs status
 Cluster name: name-cls
 Cluster Summary:
 ...
 Node List:
  * Online: [ LNXSRV1 LNXSRV2 ]
 Full List of Resources:
  * Resource Group: name-rg:
    * rlvm  (ocf:heartbeat:LVM-activate):    Started LNXSRV2
    * lv1_name    (ocf:heartbeat:Filesystem):  Started LNXSRV2
    * lv2_name    (ocf:heartbeat:Filesystem):  Started LNXSRV2
    * lv3_name   (ocf:heartbeat:Filesystem):  Started LNXSRV2
 ....



## Roll-back cluster resources to the origin Node ##
# sudo pcs resource move name-rg LNXSRV1
 Location constraint to move resource 'name-rg' has been created
 Waiting for the cluster to apply configuration changes...



## Now check if resources were moved to the main node ##

# sudo pcs status
Cluster name: name-cls
Cluster Summary:
....
Full List of Resources:
  * Resource Group: name-rg:
    * rlvm  (ocf:heartbeat:LVM-activate):    Started LNXSRV1
    * lv1_name    (ocf:heartbeat:Filesystem):  Started LNXSRV1
    * lv2_name    (ocf:heartbeat:Filesystem):  Started LNXSRV1
    ....
    
# df -h fs1name
# df -h fs2name
# df -h fs3name
# df -h fs4name
# ip -f inet addr show enpXsY
# sudo vgs nameVG
*Check if the web service is running on the node1 LNXSRV1. Open your web browser and write the URL address below*
 http://V.V.V.V/index.html

Other PCS Commands:

Operation	Command
pcs --version	Pacemaker or pcs version:
pcs status	Pacemaker Overview
pcs resource	Pacemaker resource overview
lcmap	Determine path-ownership in a cluster.
pcs resource enable resource_name	Enable (start) resource.
pcs resource debug-start resource_name	Start pcs resource with debug.
pcs resource config resource_name	Review pcs resource configuration settings
pcs resource disable resource_name	Disable (stop) resource:
pcs resource cleanup resource_name	Restart failed resource.
pcs stop cluster [--force]	Stop pacemaker on node.
pcs cluster start [--all]	Start pacemaker
pcs node standby node_name	Put the node in standby.
pcs node unstandby node_name	Bring the node out of standby.

TROUBLESHOOTING PACEMAKER

1. Check Cluster Status pcs status Look for: Offline or UNCLEAN nodes Stopped or failed resources

2. Check Logs Pacemaker logs to: /var/log/pacemaker.log /var/log/messages journalctl -xe

3. Check Resource Errors pcs resource failcount show pcs resource status <resource-name> Clear fail counts: pcs resource cleanup <resource-name>

4. Check Node Status pcs node status crm_node -l

5. Check CIB (Cluster Information Base) pcs cluster cib This can help identify misconfigurations.

6. Force Start or Reset If the cluster is stuck or misbehaving: pcs resource cleanup pcs cluster stop --all pcs cluster start --all Or reboot one node at a time if needed.

Path	Purpose	Supplemental Commands
/var/log/messages	Contains global system messages regarding system resources and services.	grep 'pacemaker.*\(error\\|warning\)' /var/log/messages
/var/log/pacemaker/pacemaker.log	Default pacemaker information logging for pacemaker resources and functions.	N/A
/var/log/pcsd/pcsd.log	Default pacemaker service/daemon (pcsd) log.	N/A
/var/log/cluster/corosync.log	Default pacemaker node communication log.	N/A
/usr/sbin/nw_hae.log	NetWorker (nws) resource start log as defined in /usr/lib/ocf/resource.d/EMC_NetWorker/Server	N/A
/usr/lib/ocf/resource.d/EMC_NetWorker/Server	NetWorker pacemaker configuration file. This is what operations are performed/managed by pcs.	N/A

Search This Blog

Vathsa's- Linux - SysOps and DevOps

Pacemaker Cluster on RHEL 9 by Vathsa

Implementing Pacemaker Cluster on RHEL Servers

Other PCS Commands:

Comments

Post a Comment

Popular posts from this blog

RHEL - How to back out a failed patch

Vathsa's- Linux - SysOps and DevOps

Local Yum Repository for Oracle Linux 8

Pacemaker Cluster on RHEL 9 by Vathsa

Implementing Pacemaker Cluster on RHEL Servers

Other PCS Commands:

td {border: 1px solid #cccccc;}br {mso-data-placement:same-cell;}

td {border: 1px solid #cccccc;}br {mso-data-placement:same-cell;}

td {border: 1px solid #cccccc;}br {mso-data-placement:same-cell;}

Comments

Post a Comment

Popular posts from this blog