Storage Profile Overview
Storage Profiles allow for customization of the Rabbit storage provisioning process. Examples of content that can be customized via storage profiles is
- The RAID type used for storage
- Any mkfs or LVM args used
- An external MGS NID for Lustre
- A boolean value indicating the Lustre MGT and MDT should be combined on the same target device
DW directives that allocate storage on Rabbit nodes allow a profile
parameter to be specified to control how the storage is configured. NNF software provides a set of canned profiles to choose from, and the administrator may create more profiles.
The administrator shall choose one profile to be the default profile that is used when a profile parameter is not specified.
Specifying a Profile
To specify a profile name on a #DW directive, use the profile
option
Setting A Default Profile
A default profile must be defined at all times. Any #DW line that does not specify a profile will use the default profile. If a default profile is not defined, then any new workflows will be rejected. If more than one profile is marked as default then any new workflows will be rejected.
To query existing profiles
$ kubectl get nnfstorageprofiles -A
NAMESPACE NAME DEFAULT AGE
nnf-system durable true 14s
nnf-system performance false 6s
To set the default flag on a profile
kubectl patch nnfstorageprofile performance -n nnf-system --type merge -p '{"data":{"default":true}}'
To clear the default flag on a profile
Creating The Initial Default Profile
Create the initial default profile from scratch or by using the NnfStorageProfile/template resource as a template. If nnf-deploy
was used to install nnf-sos then the default profile described below will have been created automatically.
To use the template
resource begin by obtaining a copy of it either from the nnf-sos repo or from a live system. To get it from a live system use the following command:
Edit the profile.yaml
file to trim the metadata section to contain only a name and namespace. The namespace must be left as nnf-system, but the name should be set to signify that this is the new default profile. In this example we will name it default
. The metadata section will look like the following, and will contain no other fields:
Mark this new profile as the default profile by setting default: true
in the data section of the resource:
Apply this resource to the system and verify that it is the only one marked as the default resource:
The output will appear similar to the following:
The administrator should edit the default
profile to record any cluster-specific settings.
Maintain a copy of this resource YAML in a safe place so it isn't lost across upgrades.
Keeping The Default Profile Updated
An upgrade of nnf-sos may include updates to the template
profile. It may be necessary to manually copy these updates into the default
profile.
Profile Parameters
XFS
The following shows how to specify command line options for pvcreate, vgcreate, lvcreate, and mkfs for XFS storage. Optional mount options are specified one per line
apiVersion: nnf.cray.hpe.com/v1alpha1
kind: NnfStorageProfile
metadata:
name: xfs-stripe-example
namespace: nnf-system
data:
[...]
xfsStorage:
commandlines:
pvCreate: $DEVICE
vgCreate: $VG_NAME $DEVICE_LIST
lvCreate: -l 100%VG --stripes $DEVICE_NUM --stripesize=32KiB --name $LV_NAME $VG_NAME
mkfs: $DEVICE
options:
mountRabbit:
- noatime
- nodiratime
[...]
GFS2
The following shows how to specify command line options for pvcreate, lvcreate, and mkfs for GFS2.
apiVersion: nnf.cray.hpe.com/v1alpha1
kind: NnfStorageProfile
metadata:
name: gfs2-stripe-example
namespace: nnf-system
data:
[...]
gfs2Storage:
commandlines:
pvCreate: $DEVICE
vgCreate: $VG_NAME $DEVICE_LIST
lvCreate: -l 100%VG --stripes $DEVICE_NUM --stripesize=32KiB --name $LV_NAME $VG_NAME
mkfs: -j2 -p $PROTOCOL -t $CLUSTER_NAME:$LOCK_SPACE $DEVICE
[...]
Lustre / ZFS
The following shows how to specify a zpool virtual device (vdev). In this case the default vdev is a stripe. See zpoolconcepts(7) for virtual device descriptions.
apiVersion: nnf.cray.hpe.com/v1alpha1
kind: NnfStorageProfile
metadata:
name: zpool-stripe-example
namespace: nnf-system
data:
[...]
lustreStorage:
mgtCommandlines:
zpoolCreate: -O canmount=off -o cachefile=none $POOL_NAME $DEVICE_LIST
mkfs: --mgs $VOL_NAME
mdtCommandlines:
zpoolCreate: -O canmount=off -o cachefile=none $POOL_NAME $DEVICE_LIST
mkfs: --mdt --fsname=$FS_NAME --mgsnode=$MGS_NID --index=$INDEX $VOL_NAME
mgtMdtCommandlines:
zpoolCreate: -O canmount=off -o cachefile=none $POOL_NAME $DEVICE_LIST
mkfs: --mgs --mdt --fsname=$FS_NAME --index=$INDEX $VOL_NAME
ostCommandlines:
zpoolCreate: -O canmount=off -o cachefile=none $POOL_NAME $DEVICE_LIST
mkfs: --ost --fsname=$FS_NAME --mgsnode=$MGS_NID --index=$INDEX $VOL_NAME
[...]
ZFS dataset properties
The following shows how to specify ZFS dataset properties in the --mkfsoptions
arg for mkfs.lustre. See zfsprops(7).
apiVersion: nnf.cray.hpe.com/v1alpha1
kind: NnfStorageProfile
metadata:
name: zpool-stripe-example
namespace: nnf-system
data:
[...]
lustreStorage:
[...]
ostCommandlines:
zpoolCreate: -O canmount=off -o cachefile=none $POOL_NAME $DEVICE_LIST
mkfs: --ost --mkfsoptions="recordsize=1024K -o compression=lz4" --fsname=$FS_NAME --mgsnode=$MGS_NID --index=$INDEX $VOL_NAME
[...]
Mount Options for Targets
Persistent Mount Options
Use the mkfs.lustre --mountfsoptions
parameter to set persistent mount options for Lustre targets.
apiVersion: nnf.cray.hpe.com/v1alpha1
kind: NnfStorageProfile
metadata:
name: target-mount-option-example
namespace: nnf-system
data:
[...]
lustreStorage:
[...]
ostCommandlines:
zpoolCreate: -O canmount=off -o cachefile=none $POOL_NAME $DEVICE_LIST
mkfs: --ost --mountfsoptions="errors=remount-ro,mballoc" --mkfsoptions="recordsize=1024K -o compression=lz4" --fsname=$FS_NAME --mgsnode=$MGS_NID --index=$INDEX $VOL_NAME
[...]
Non-Persistent Mount Options
Non-persistent mount options can be specified with the ostOptions.mountTarget parameter to the NnfStorageProfile:
apiVersion: nnf.cray.hpe.com/v1alpha1
kind: NnfStorageProfile
metadata:
name: target-mount-option-example
namespace: nnf-system
data:
[...]
lustreStorage:
[...]
ostCommandlines:
zpoolCreate: -O canmount=off -o cachefile=none $POOL_NAME $DEVICE_LIST
mkfs: --ost --mountfsoptions="errors=remount-ro" --mkfsoptions="recordsize=1024K -o compression=lz4" --fsname=$FS_NAME --mgsnode=$MGS_NID --index=$INDEX $VOL_NAME
ostOptions:
mountTarget:
- mballoc
[...]
Target Layout
Users may want Lustre file systems with different performance characteristics. For example, a user job with a single compute node accessing the Lustre file system would see acceptable performance from a single OSS. An FPP workload might want as many OSSs as posible to avoid contention.
The NnfStorageProfile
allows admins to specify where and how many Lustre targets are allocated by the WLM. During the proposal phase of the workflow, the NNF software uses the information in the NnfStorageProfile
to add extra constraints in the DirectiveBreakdown
. The WLM uses these constraints when picking storage.
The NnfStorageProfile
has three fields in the mgtOptions
, mdtOptions
, and ostOptions
to specify target layout. The fields are:
count
- A static value for how many Lustre targets to create.scale
- A value from 1-10 that the WLM can use to determine how many Lustre targets to allocate. This is up to the WLM and the admins to agree on how to interpret this field. A value of 1 might indicate the minimum number of NNF nodes needed to reach the minimum capacity, while 10 might result in a Lustre target on every Rabbit attached to the computes in the job. Scale takes into account allocation size, compute node count, and Rabbit count.colocateComputes
- true/false value. When "true", this adds a location constraint in theDirectiveBreakdown
that limits the WLM to picking storage with a physical connection to the compute resources. In practice this means that Rabbit storage is restricted to the chassis used by the job. This can be set individually for each of the Lustre target types. When this is "false", any Rabbit storage can be picked, even if the Rabbit doesn't share a chassis with any of the compute nodes in the job.
Only one of scale
and count
can be set for a particular target type.
The DirectiveBreakdown
for create_persistent
#DWs won't include the constraint from colocateCompute=true
since there may not be any compute nodes associated with the job.
apiVersion: nnf.cray.hpe.com/v1alpha1
kind: NnfStorageProfile
metadata:
name: high-metadata
namespace: default
data:
default: false
...
lustreStorage:
combinedMgtMdt: false
capacityMdt: 500GiB
capacityMgt: 1GiB
[...]
ostOptions:
scale: 5
colocateComputes: true
mdtOptions:
count: 10
Example Layouts
scale
with colocateComputes=true
will likely be the most common layout type to use for jobdw
directives. This will result in a Lustre file system whose performance scales with the number of compute nodes in the job.
count
may be used when a specific performance characteristic is desired such as a single shared file workload that has low metadata requirements and only needs a single MDT. It may also be useful when a consistently performing file system is required across different jobs.
colocatedComputes=false
may be useful for placing MDTs on NNF nodes without an OST (within the same file system).
The count
field may be useful when creating a persistent file system since the job with the create_persistent
directive may only have a single compute node.
In general, scale
gives a simple way for users to get a filesystem that has performance consistent with their job size. count
is useful for times when a user wants full control of the file system layout.
Command Line Variables
pvcreate
$DEVICE
- expands to the/dev/<path>
value for one device that has been allocated
vgcreate
$VG_NAME
- expands to a volume group name that is controlled by Rabbit software.$DEVICE_LIST
- expands to a list of space-separated/dev/<path>
devices. This list will contain the devices that were iterated over for the pvcreate step.
lvcreate
$VG_NAME
- see vgcreate above.$LV_NAME
- expands to a logical volume name that is controlled by Rabbit software.$DEVICE_NUM
- expands to a number indicating the number of devices allocated for the volume group.$DEVICE1, $DEVICE2, ..., $DEVICEn
- each expands to one of the devices from the$DEVICE_LIST
above.
XFS mkfs
$DEVICE
- expands to the/dev/<path>
value for the logical volume that was created by the lvcreate step above.
GFS2 mkfs
$DEVICE
- expands to the/dev/<path>
value for the logical volume that was created by the lvcreate step above.$CLUSTER_NAME
- expands to a cluster name that is controlled by Rabbit Software$LOCK_SPACE
- expands to a lock space key that is controlled by Rabbit Software.$PROTOCOL
- expands to a locking protocol that is controlled by Rabbit Software.
zpool create
$DEVICE_LIST
- expands to a list of space-separated/dev/<path>
devices. This list will contain the devices that were allocated for this storage request.$POOL_NAME
- expands to a pool name that is controlled by Rabbit software.$DEVICE_NUM
- expands to a number indicating the number of devices allocated for this storage request.$DEVICE1, $DEVICE2, ..., $DEVICEn
- each expands to one of the devices from the$DEVICE_LIST
above.
lustre mkfs
$FS_NAME
- expands to the filesystem name that was passed to Rabbit software from the workflow's #DW line.$MGS_NID
- expands to the NID of the MGS. If the MGS was orchestrated by nnf-sos then an appropriate internal value will be used.$POOL_NAME
- see zpool create above.$VOL_NAME
- expands to the volume name that will be created. This value will be<pool_name>/<dataset>
, and is controlled by Rabbit software.$INDEX
- expands to the index value of the target and is controlled by Rabbit software.
PostMount/PreUnmount and PostActivate/PreDeactivate
$MOUNT_PATH
- expands to the mount path of the fileystem to perform certain actions on the mounted filesystem
Lustre Specific
These variables are for lustre only and can be used to perform PostMount activities such are setting lustre striping.
$NUM_MDTS
- expands to the number of MDTs for the lustre filesystem$NUM_MGTS
- expands to the number of MGTs for the lustre filesystem$NUM_MGTMDTS
- expands to the number of combined MGTMDTs for the lustre filesystem$NUM_OSTS
- expands to the number of OSTs for the lustre filesystem$NUM_NNFNODES
- expands to the number of NNF Nodes for the lustre filesystem