High Availability Cluster¶
NNF software supports provisioning of Red Hat GFS2 (Global File System 2) storage. Per RedHat:
GFS2 allows multiple nodes to share storage at a block level as if the storage were connected locally to each cluster node. GFS2 cluster file system requires a cluster infrastructure.
Therefore, in order to use GFS2, the NNF node and its associated compute nodes must form a high availability cluster.
Cluster Setup¶
Red Hat provides instructions for creating a high availability cluster with Pacemaker, including instructions for installing cluster software and
creating a high availability cluster. When following these instructions, each of the high availability clusters that are created should be named after the hostname of the NNF node. In the Red Hat examples the cluster name is my_cluster
.
Fencing Agents¶
Fencing is the process of restricting and releasing access to resources that a failed cluster node may have access to. Since a failed node may be unresponsive, an external device must exist that can restrict access to shared resources of that node, or to issue a hard reboot of the node. More information can be found form Red Hat: 1.2.1 Fencing.
HPE hardware implements software known as the Hardware System Supervisor (HSS), which itself conforms to the SNIA Redfish/Swordfish standard. This provides the means to manage hardware outside the host OS.
NNF Fencing¶
Source¶
The NNF Fencing agent is available at https://github.com/NearNodeFlash/fence-agents under the nnf
branch.
Build¶
Refer to the NNF.md file
at the root directory of the fence-agents repository. The fencing agents must be installed on every node in the cluster.
Setup¶
Configure the NNF agent with the following parameters:
Argument | Definition |
---|---|
kubernetes-service-host=[ADDRESS] |
The IP address of the kubeapi server |
kubernetes-service-port=[PORT] |
The listening port of the kubeapi server |
service-token-file=[PATH] |
The location of the service token file. The file must be present on all nodes within the cluster |
service-cert-file=[PATH] |
The location of the service certificate file. The file must be present on all nodes within the cluster |
nnf-node-name=[NNF-NODE-NAME] |
Name of the NNF node as it is appears in the System Configuration |
api-version=[VERSION] |
The API Version of the NNF Node resource. Defaults to "v1alpha1" |
The token and certificate can be found in the Kubernetes Secrets resource for the nnf-system/nnf-fence-agent ServiceAccount. This provides RBAC rules to limit the fencing agent to only the Kubernetes resources it needs access to.
For example, setting up the NNF fencing agent on rabbit-node-1
with a kubernetes service API running at 192.168.0.1:6443
and the service token and certificate copied to /etc/nnf/fence/
. This needs to be run on one node in the cluster.
pcs stonith create rabbit-node-1 fence_nnf pcmk_host_list=rabbit-node-1 kubernetes-service-host=192.168.0.1 kubernetes-service-port=6443 service-token-file=/etc/nnf/fence/service.token service-cert-file=/etc/nnf/fence/service.cert nnf-node-name=rabbit-node-1
Recovery¶
Since the NNF node is connected to 16 compute blades, careful coordination around fencing of a NNF node is required to minimize the impact of the outage. When a Rabbit node is fenced, the corresponding DWS Storage resource (storages.dws.cray.hpe.com
) status changes. The workload manager must observe this change and follow the procedure below to recover from the fencing status.
- Observed the
storage.Status
changed and thatstorage.Status.RequiresReboot == True
- Set the
storage.Spec.State := Disabled
- Wait for a change to the Storage status
storage.Status.State == Disabled
- Reboot the NNF node
- Set the
storage.Spec.State := Enabled
- Wait for
storage.Status.State == Enabled
Compute Fencing¶
The Redfish fencing agent from ClusterLabs should be used for Compute nodes in the cluster. It is also included at https://github.com/NearNodeFlash/fence-agents, and can be built at the same time as the NNF fencing agent. Configure the agent with the following parameters:
Argument | Definition |
---|---|
ip=[ADDRESS] |
The IP address or hostname of the HSS controller |
port=80 |
The Port of the HSS controller. Must be 80 |
systems-uri=/redfish/v1/Systems/1 |
The URI of the Systems object. Must be /redfish/v1/Systems/1 |
ssl-insecure=true |
Instructs the use of an insecure SSL exchange. Must be true |
username=[USER] |
The user name for connecting to the HSS controller |
password=[PASSWORD] |
the password for connecting to the HSS controller |
For example, setting up the Redfish fencing agent on rabbit-compute-2
with the redfish service at 192.168.0.1
. This needs to be run on one node in the cluster.
pcs stonith create rabbit-compute-2 fence_redfish pcmk_host_list=rabbit-compute-2 ip=192.168.0.1 systems-uri=/redfish/v1/Systems/1 username=root password=password ssl_insecure=true
Dummy Fencing¶
The dummy fencing agent from ClusterLabs can be used for nodes in the cluster for an early access development system.
Configuring a GFS2 file system in a cluster¶
Follow steps 1-8 of the procedure from Red Hat: Configuring a GFS2 file system in a cluster.