Hello,
I’m having an issue with two of my three FSTs (nfs12, nfs13).
I’m setting EOS system now.
Two FSTs show bootfailure in eos fs ls. The error is filesystem has a different label ... than the configuration.
I tried to remove them to re-add cleanly, but eos fs rm fails with "not empty" and eos node rm fails with "filesystems are not all in empty state".
What is the correct procedure to clean up these inconsistent filesystem entries so I can re-register the nodes?
our setup
MGM/QDB
3nodes (grid04,grid05,grid06)
FST
3nodes (nfs11,nfs12,nfs13)
[root@grid04 ~]# eos fs boot 2
success: boot message sent to nfs12.aligrid.hiroshima-u.ac.jp:/eos1
[root@grid04 ~]# eos fs boot 3
success: boot message sent to nfs13.aligrid.hiroshima-u.ac.jp:/eos1
[root@grid04 ~]# eos fs ls
┌───────────────────────────────┬────┬──────┬────────────────────────────────┬────────────────┬────────────────┬────────────┬──────────────┬────────────┬──────┬────────┬────────────────┐
│host │port│ id│ path│ schedgroup│ geotag│ boot│ configstatus│ drain│ usage│ active│ health│
└───────────────────────────────┴────┴──────┴────────────────────────────────┴────────────────┴────────────────┴────────────┴──────────────┴────────────┴──────┴────────┴────────────────┘
nfs11.aligrid.hiroshima-u.ac.jp 1095 1 /eos1 default.0 local::geo booted rw nodrain 0.70 no smartctl
nfs12.aligrid.hiroshima-u.ac.jp 1095 2 /eos1 default.1 local::geo bootfailure rw nodrain 0.00 no smartctl
nfs13.aligrid.hiroshima-u.ac.jp 1095 3 /eos1 default.2 local::geo bootfailure rw nodrain 0.00 no smartctl
[root@grid04 ~]# eos node ls
┌──────────┬────────────────────────────────────┬────────────────┬──────────┬────────────┬────────────────┬─────┐
│type │ hostport│ geotag│ status│ activated│ heartbeatdelta│ nofs│
└──────────┴────────────────────────────────────┴────────────────┴──────────┴────────────┴────────────────┴─────┘
nodesview nfs11.aligrid.hiroshima-u.ac.jp:1095 local::geo online on 1 1
nodesview nfs12.aligrid.hiroshima-u.ac.jp:1095 local::geo online on 0 1
nodesview nfs13.aligrid.hiroshima-u.ac.jp:1095 local::geo online on 1 1
[root@grid04 ~]# sudo eos fs status -l 2 | grep uuid
stat.errmsg := filesystem has a different label (fsid=2, uuid=f995a3de-c6e1-4806-9082-3c5e3c20691f) than the configuration
uuid := f995a3de-c6e1-4806-9082-3c5e3c20691f
[root@grid04 ~]# sudo eos fs status -l 3 | grep uuid
stat.errmsg := filesystem has a different label (fsid=3, uuid=7dee6080-89f2-4e7b-a0ac-7e9d58823e2e) than the configuration
uuid := 7dee6080-89f2-4e7b-a0ac-7e9d58823e2e
First of all, what version of the EOS software are you running. Are you still using the MQ service or not? Check the actual labels which are stored in the .eosfsid and .eosfsuuid files of the file system mountpoint i.e. /eos1/.eosfsid etc. Depending on your answers there are different other things to investigate.
I think the problem comes from the the fact that all 3 of your file systems are stored in the same location, namely /eos1. This is why file system 1 manages to boot, while the other ones fail. Make sure that each file system has it’s own location where to write the files. For example:
fsid 1 → /eos1/
fsid 2 → /eos2/
fsid 3 → /eos3/
etc
Thank you for your advice.
Following the advice, I have reconfigured on FSTs nfs12 and nfs13 to use /eos2 and /eos3.
The current problem is that I can’t clean up the old incorrect filesystementries from the MGM.
I tried the old /eos1 entries for nfs12 and nfs13, the commands fail
eos fs rm <fsid> fails with error: you can only remove file systems which are in 'empty' status.
eos node rm <hostname> subsequently fails with error: ...filesystems are not all in empty state.
[root@grid04 ~]# eos fs rm 2
error: you can only remove file systems which are in 'empty' status
[root@grid04 ~]# eos node rm nfs12.aligrid.hiroshima-u.ac.jp
error: unable to remove node '/eos/nfs12.aligrid.hiroshima-u.ac.jp:1095/fst' - filesystems are not all in empty state
How can I remove these stale metadata entries?
[root@grid04 ~]# eos space ls -l defaul
┌──────────┬────────────────┬────────────┬────────────┬──────┬─────────┬───────────────┬──────────────┬─────────────┬─────────────┬──────┐
│type │ name│ groupsize│ groupmod│ N(fs)│ N(fs-rw)│ sum(usedbytes)│ sum(capacity)│ capacity(rw)│ nom.capacity│ quota│
└──────────┴────────────────┴────────────┴────────────┴──────┴─────────┴───────────────┴──────────────┴─────────────┴─────────────┴──────┘
spaceview default 0 24 5 0 6.04 TB 1.62 PB 0 B 0 B off
┌────────────────────────┬────┬────────────────────────┬──────────┬──────┬────────────────────────────────────┬────────────────────────────────┬────────────────┬──────────┬────────────┬──────────────┬────────────┬──────┬────────┬──────────────┬────────────────────┬────────────────┬────────────────────────┐
│host │port│alias.host │alias.port│ id│ uuid│ path│ schedgroup│ headroom│ boot│ configstatus│ drain│ usage│ active│ scaninterval│ scan_rain_interval│ health│ statuscomment│
└────────────────────────┴────┴────────────────────────┴──────────┴──────┴────────────────────────────────────┴────────────────────────────────┴────────────────┴──────────┴────────────┴──────────────┴────────────┴──────┴────────┴──────────────┴────────────────────┴────────────────┴────────────────────────┘
nfs11 1095 1 9ac28717-c447-49dc-abed-a50701741da3 /eos1 default.0 5.10 G booted rw nodrain 0.70 1814400 2419200 no smartctl
nfs12 1095 2 f995a3de-c6e1-4806-9082-3c5e3c20691f /eos1 default.1 5.10 G opserror off nodrain 21.29 1814400 2419200 no smartctl
nfs13 1095 3 7dee6080-89f2-4e7b-a0ac-7e9d58823e2e /eos1 default.2 5.10 G opserror off nodrain 17.01 1814400 2419200 no smartctl
nfs12 1095 4 f38276e4-0e30-48b4-8fde-33eb6eec0b00 /eos2 default.3 5.10 G bootfailure rw nodrain 0.00 1814400 2419200 no smartctl
nfs13 1095 5 3aa4b474-5a09-4884-a66e-57d1ea547128 /eos3 default.4 5.10 G bootfailure rw nodrain 0.00 online 1814400 2419200 no smartctl
The cleanup is done, I will register the new /eos2 and /eos3 filesystems correctly using the following sequence of commands: fs add, group set, fs config, and fs boot.
You can first try to set those file systems that you want to remove to empty status. This will not work if you have managed to write any files to them, but I guess this should not be the case: eos fs config 2 configstatus=emtpy and then you can try removing them with eos fs rm 2.
If this does not work then you need to delete them from the configuration in QDB. You stop the MGM and FST daemons and search for the corresponding entry in QDB: redis-cli -p 7777 hgetall eos-config:default | grep -B 1 "id=2"
Then you take the value on the first line and use it as a key to perform the deletion: redis-cli -p 7777 hdel eos-config:default <the_key_from_the_step_before>
You can now restart the services and the offending file systems should go away.