Lustre External MGT
Background
Lustre has a limitation where only a single MGT can be mounted on a node at a time. In some situations it may be desirable to share an MGT between multiple Lustre file systems to increase the number of Lustre file systems that can be created and to decrease scheduling complexity. This guide provides instructions on how to configure NNF to share MGTs. There are three methods that can be used:
- Use a Lustre MGT from outside the NNF cluster
- Create a persistent Lustre file system through DWS and use the MGT it provides
- Create a pool of standalone persistent Lustre MGTs, and have the NNF software select one of them
These three methods are not mutually exclusive on the system as a whole. Individual file systems can use any of options 1-3 or create their own MGT.
Configuration with an External MGT
Storage Profile
An existing MGT external to the NNF cluster can be used to manage the Lustre file systems on the NNF nodes. An advantage to this configuration is that the MGT can be highly available through multiple MGSs. A disadvantage is that there is only a single MGT. An MGT shared between more than a handful of Lustre file systems is not a common use case, so the Lustre code may prove less stable.
The following yaml provides an example of what the NnfStorageProfile
should contain to use an MGT on an external server.
apiVersion: nnf.cray.hpe.com/v1alpha1
kind: NnfStorageProfile
metadata:
name: external-mgt
namespace: nnf-system
data:
[...]
lustreStorage:
externalMgs: 1.2.3.4@eth0:1.2.3.5@eth0
combinedMgtMdt: false
standaloneMgtPoolName: ""
[...]
NnfLustreMGT
A NnfLustreMGT
resource tracks which fsnames have been used on the MGT to prevent fsname re-use. Any Lustre file systems that are created through the NNF software will request an fsname to use from a NnfLustreMGT
resource. Every MGT must have a corresponding NnfLustreMGT
resource. For MGTs that are hosted on NNF hardware, the NnfLustreMGT
resources are created automatically. The NNF software also erases any unused fsnames from the MGT disk for any internally hosted MGTs.
For a MGT hosted on an external node, an admin must create an NnfLustreMGT
resource. This resource ensures that fsnames will be created in a sequential order without any fsname re-use. However, after an fsname is no longer in use by a file system, it will not be erased from the MGT disk. An admin may decide to periodically run the lctl erase_lcfg [fsname]
command to remove fsnames that are no longer in use.
Below is an example NnfLustreMGT
resource. The NnfLustreMGT
resource for external MGSs must be created in the nnf-system
namespace.
apiVersion: nnf.cray.hpe.com/v1alpha1
kind: NnfLustreMGT
metadata:
name: external-mgt
namespace: nnf-system
spec:
addresses:
- "1.2.3.4@eth0:1.2.3.5@eth0"
fsNameStart: "aaaaaaaa"
fsNameBlackList:
- "mylustre"
fsNameStartReference:
name: external-mgt
namespace: default
kind: ConfigMap
addresses
- This is a list of LNet addresses that could be used for this MGT. This should match any values that are used in theexternalMgs
field in theNnfStorageProfiles
.fsNameStart
- The first fsname to use. Subsequent fsnames will be incremented based on this starting fsname (e.g,aaaaaaaa
,aaaaaaab
,aaaaaaac
). fsnames use lowercase letters'a'
-'z'
.fsNameStart
should be exactly 8 characters long.fsNameBlackList
- This is a list of fsnames that should not be given to any NNF Lustre file systems. If the MGT is hosting any non-NNF Lustre file systems, their fsnames should be included in this blacklist.fsNameStartReference
- This is an optionalObjectReference
to aConfigMap
that holds a starting fsname. If this field is specified, it takes precedence over thefsNameStart
field in the spec. TheConfigMap
will be updated to the next available fsname every time an fsname is assigned to a new Lustre file system.
ConfigMap
For external MGTs, the fsNameStartReference
should be used to point to a ConfigMap
in the default
namespace. The ConfigMap
should be left empty initially. The ConfigMap
is used to hold the value of the next available fsname, and it should not be deleted or modified while a NnfLustreMGT
resource is referencing it. Removing the ConfigMap
will cause the Rabbit software to lose track of which fsnames have already been used on the MGT. This is undesireable unless the external MGT is no longer being used by Rabbit software or if an admin has erased all previously used fsnames with the lctl erase_lcfg [fsname]
command.
When using the ConfigMap
, the nnf-sos software may be undeployed and redeployed without losing track of the next fsname value. During an undeploy, the NnfLustreMGT
resource will be removed. During a deploy, the NnfLustreMGT
resource will read the fsname value from the ConfigMap
if it is present. The value in the ConfigMap
will override the fsname in the fsNameStart
field.
Configuration with Persistent Lustre
The MGT from a persistent Lustre file system hosted on the NNF nodes can also be used as the MGT for other NNF Lustre file systems. This configuration has the advantage of not relying on any hardware outside of the cluster. However, there is no high availability, and a single MGT is still shared between all Lustre file systems created on the cluster.
To configure a persistent Lustre file system that can share its MGT, a NnfStorageProfile
should be used that does not specify externalMgs
. The MGT can either share a volume with the MDT or not (combinedMgtMdt
).
apiVersion: nnf.cray.hpe.com/v1alpha1
kind: NnfStorageProfile
metadata:
name: persistent-lustre-shared-mgt
namespace: nnf-system
data:
[...]
lustreStorage:
externalMgs: ""
combinedMgtMdt: false
standaloneMgtPoolName: ""
[...]
The persistent storage is created with the following DW directive:
#DW create_persistent name=shared-lustre capacity=100GiB type=lustre profile=persistent-lustre-shared-mgt
After the persistent Lustre file system is created, an admin can discover the MGS address by looking at the NnfStorage
resource with the same name as the persistent storage that was created (shared-lustre
in the above example).
apiVersion: nnf.cray.hpe.com/v1alpha1
kind: NnfStorage
metadata:
name: shared-lustre
namespace: default
[...]
status:
mgsNode: 5.6.7.8@eth1
[...]
A separate NnfStorageProfile
can be created that specifies the MGS address.
apiVersion: nnf.cray.hpe.com/v1alpha1
kind: NnfStorageProfile
metadata:
name: internal-mgt
namespace: nnf-system
data:
[...]
lustreStorage:
externalMgs: 5.6.7.8@eth1
combinedMgtMdt: false
standaloneMgtPoolName: ""
[...]
With this configuration, an admin must determine that no file systems are using the shared MGT before destroying the persistent Lustre instance.
Configuration with an Internal MGT Pool
Another method NNF supports is to create a number of persistent Lustre MGTs on NNF nodes. These MGTs are not part of a full file system, but are instead added to a pool of MGTs available for other Lustre file systems to use. Lustre file systems that are created will choose one of the MGTs at random to use and add a reference to make sure it isn't destroyed. This configuration has the advantage of spreading the Lustre management load across multiple servers. The disadvantage of this configuration is that it does not provide high availability.
To configure the system this way, the first step is to make a pool of Lustre MGTs. This is done by creating a persistent instance from a storage profile that specifies the standaloneMgtPoolName
option. This option tells NNF software to only create an MGT, and to add it to a named pool. The following NnfStorageProfile
provides an example where the MGT is added to the example-pool
pool:
apiVersion: nnf.cray.hpe.com/v1alpha1
kind: NnfStorageProfile
metadata:
name: mgt-pool-member
namespace: nnf-system
data:
[...]
lustreStorage:
externalMgs: ""
combinedMgtMdt: false
standaloneMgtPoolName: "example-pool"
[...]
A persistent storage MGTs can be created with the following DW directive:
Multiple persistent instances with different names can be created using the mgt-pool-member
profile to add more than one MGT to the pool.
To create a Lustre file system that uses one of the MGTs from the pool, an NnfStorageProfile
should be created that uses the special notation pool:[pool-name]
in the externalMgs
field.
apiVersion: nnf.cray.hpe.com/v1alpha1
kind: NnfStorageProfile
metadata:
name: mgt-pool-consumer
namespace: nnf-system
data:
[...]
lustreStorage:
externalMgs: "pool:example-pool"
combinedMgtMdt: false
standaloneMgtPoolName: ""
[...]
The following provides an example DW directive that uses an MGT from the MGT pool:
MGT pools are named, so there can be separate pools with collections of different MGTs in them. A storage profile targeting each pool would be needed.