Debugging NVMe Namespaces
Total Space Available or Used
Find the total space available, and the total space used, on a Rabbit node using the Redfish API. One way to access the API is to use the nnf-node-manager
pod on that node.
To view the space on node ee50, find its nnf-node-manager
pod and then exec into it to query the Redfish API:
[richerso@ee1:~]$ kubectl get pods -A -o wide | grep ee50 | grep node-manager
nnf-system nnf-node-manager-jhglm 1/1 Running 0 61m 10.85.71.11 ee50 <none> <none>
Then query the Redfish API to view the AllocatedBytes
and GuaranteedBytes
:
[richerso@ee1:~]$ kubectl exec --stdin --tty -n nnf-system nnf-node-manager-jhglm -- curl -S localhost:50057/redfish/v1/StorageServices/NNF/CapacitySource | jq
{
"@odata.id": "/redfish/v1/StorageServices/NNF/CapacitySource",
"@odata.type": "#CapacitySource.v1_0_0.CapacitySource",
"Id": "0",
"Name": "Capacity Source",
"ProvidedCapacity": {
"Data": {
"AllocatedBytes": 128849888,
"ConsumedBytes": 128849888,
"GuaranteedBytes": 307132496928,
"ProvisionedBytes": 307261342816
},
"Metadata": {},
"Snapshot": {}
},
"ProvidedClassOfService": {},
"ProvidingDrives": {},
"ProvidingPools": {},
"ProvidingVolumes": {},
"Actions": {},
"ProvidingMemory": {},
"ProvidingMemoryChunks": {}
}
Total Orphaned or Leaked Space
To determine the amount of orphaned space, look at the Rabbit node when there are no allocations on it. If there are no allocations then there should be no NnfNodeBlockStorages
in the k8s namespace with the Rabbit's name:
To check that there are no orphaned namespaces, you can use the nvme command while logged into that Rabbit node:
[root@ee50:~]# nvme list
Node SN Model Namespace Usage Format FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S666NN0TB11877 SAMSUNG MZ1L21T9HCLS-00A07 1 8.57 GB / 1.92 TB 512 B + 0 B GDC7302Q
There should be no namespaces on the kioxia drives:
If there are namespaces listed, and there weren't any NnfNodeBlockStorages
on the node, then they need to be deleted through the Rabbit software. The NnfNodeECData
resource is a persistent data store for the allocations that should exist on the Rabbit. By deleting it, and then deleting the nnf-node-manager pod, it causes nnf-node-manager to delete the orphaned namespaces. This can take a few minutes after you actually delete the pod: