FST Boot Failure and Removal Issue

tamatsum · July 29, 2025, 3:43pm

Hello,
I’m having an issue with two of my three FSTs (nfs12, nfs13).
I’m setting EOS system now.
Two FSTs show bootfailure in eos fs ls. The error is filesystem has a different label ... than the configuration.

I tried to remove them to re-add cleanly, but eos fs rm fails with "not empty" and eos node rm fails with "filesystems are not all in empty state".

What is the correct procedure to clean up these inconsistent filesystem entries so I can re-register the nodes?

our setup

MGM/QDB
3nodes (grid04,grid05,grid06)
FST
3nodes (nfs11,nfs12,nfs13)

[root@grid04 ~]# eos fs boot 2
success: boot message sent to nfs12.aligrid.hiroshima-u.ac.jp:/eos1
[root@grid04 ~]# eos fs boot 3
success: boot message sent to nfs13.aligrid.hiroshima-u.ac.jp:/eos1
[root@grid04 ~]# eos fs ls
┌───────────────────────────────┬────┬──────┬────────────────────────────────┬────────────────┬────────────────┬────────────┬──────────────┬────────────┬──────┬────────┬────────────────┐
│host                           │port│    id│                            path│      schedgroup│          geotag│        boot│  configstatus│       drain│ usage│  active│          health│
└───────────────────────────────┴────┴──────┴────────────────────────────────┴────────────────┴────────────────┴────────────┴──────────────┴────────────┴──────┴────────┴────────────────┘
 nfs11.aligrid.hiroshima-u.ac.jp 1095      1                            /eos1        default.0       local::geo       booted             rw      nodrain   0.70               no smartctl
 nfs12.aligrid.hiroshima-u.ac.jp 1095      2                            /eos1        default.1       local::geo  bootfailure             rw      nodrain   0.00               no smartctl
 nfs13.aligrid.hiroshima-u.ac.jp 1095      3                            /eos1        default.2       local::geo  bootfailure             rw      nodrain   0.00               no smartctl


[root@grid04 ~]# eos node ls
┌──────────┬────────────────────────────────────┬────────────────┬──────────┬────────────┬────────────────┬─────┐
│type      │                            hostport│          geotag│    status│   activated│  heartbeatdelta│ nofs│
└──────────┴────────────────────────────────────┴────────────────┴──────────┴────────────┴────────────────┴─────┘
 nodesview  nfs11.aligrid.hiroshima-u.ac.jp:1095       local::geo     online           on                1     1
 nodesview  nfs12.aligrid.hiroshima-u.ac.jp:1095       local::geo     online           on                0     1
 nodesview  nfs13.aligrid.hiroshima-u.ac.jp:1095       local::geo     online           on                1     1


[root@grid04 ~]# sudo eos fs status -l 2 | grep uuid
stat.errmsg                      := filesystem has a different label (fsid=2, uuid=f995a3de-c6e1-4806-9082-3c5e3c20691f) than the configuration
uuid                             := f995a3de-c6e1-4806-9082-3c5e3c20691f

[root@grid04 ~]# sudo eos fs status -l 3 | grep uuid
stat.errmsg                      := filesystem has a different label (fsid=3, uuid=7dee6080-89f2-4e7b-a0ac-7e9d58823e2e) than the configuration
uuid                             := 7dee6080-89f2-4e7b-a0ac-7e9d58823e2e

Best regards,
Takuma

esindril · July 30, 2025, 3:44pm

Hi Takuma,

First of all, what version of the EOS software are you running. Are you still using the MQ service or not? Check the actual labels which are stored in the .eosfsid and .eosfsuuid files of the file system mountpoint i.e. /eos1/.eosfsid etc. Depending on your answers there are different other things to investigate.

Thanks,
Elvin

tamatsum · July 31, 2025, 12:28pm

Hi Elvin,

Thank you for your help.
This is the information.

Version

EOS 5.3.15is running.

[root@grid04 ~]# eos -v
EOS 5.3.15 (2020)

About MQ service

I am not using MQ service, it is now backed by QuarkDB.
I set EOS_USE_MQ_ON_QDB=1 on all nodes.

[root@grid04 ~]# grep EOS_USE_MQ_ON_QDB /etc/sysconfig/eos_env
EOS_USE_MQ_ON_QDB=1

About labels

I found a difference between the FSTs.
On nfs12

[root@nfs12 eos1]# ls -a
.  ..  .eosattrconverted  .eosfsuuid
[root@nfs12 eos1]# cat .eosfsuuid
1cc9fe75-2bb7-4443-a974-873f6e142b63

On nfs13

[root@nfs13 eos1]# ls -a
.  ..  .eosattrconverted  .eosdeletions  .eosfsid  .eosfsuuid  .eosorphans  scrub.re-write.1  scrub.write-once.1
[root@nfs13 eos1]# cat .eosfsid
1
[root@nfs13 eos1]# cat .eosfsuuid
44c3172f-43f6-410b-92e8-5ca48ffb8ffc

id is different between eos fs ls and this file.

Thank you for your support.
Best regards,

Takuma

esindril · August 1, 2025, 9:08am

Hi Takuma,

I think the problem comes from the the fact that all 3 of your file systems are stored in the same location, namely /eos1. This is why file system 1 manages to boot, while the other ones fail. Make sure that each file system has it’s own location where to write the files. For example:
fsid 1 → /eos1/
fsid 2 → /eos2/
fsid 3 → /eos3/
etc

Cheers,
Elvin

tamatsum · August 1, 2025, 2:46pm

Hi Elvin,

Thank you for your advice.
Following the advice, I have reconfigured on FSTs nfs12 and nfs13 to use /eos2 and /eos3.
The current problem is that I can’t clean up the old incorrect filesystementries from the MGM.
I tried the old /eos1 entries for nfs12 and nfs13, the commands fail

eos fs rm <fsid> fails with error: you can only remove file systems which are in 'empty' status.
eos node rm <hostname> subsequently fails with error: ...filesystems are not all in empty state.

[root@grid04 ~]# eos fs rm 2
error: you can only remove file systems which are in 'empty' status
[root@grid04 ~]# eos node rm nfs12.aligrid.hiroshima-u.ac.jp
error: unable to remove node '/eos/nfs12.aligrid.hiroshima-u.ac.jp:1095/fst' - filesystems are not all in empty state

How can I remove these stale metadata entries?

[root@grid04 ~]# eos space ls -l defaul
┌──────────┬────────────────┬────────────┬────────────┬──────┬─────────┬───────────────┬──────────────┬─────────────┬─────────────┬──────┐
│type      │            name│   groupsize│    groupmod│ N(fs)│ N(fs-rw)│ sum(usedbytes)│ sum(capacity)│ capacity(rw)│ nom.capacity│ quota│
└──────────┴────────────────┴────────────┴────────────┴──────┴─────────┴───────────────┴──────────────┴─────────────┴─────────────┴──────┘
 spaceview           default            0           24      5         0         6.04 TB        1.62 PB           0 B           0 B    off

┌────────────────────────┬────┬────────────────────────┬──────────┬──────┬────────────────────────────────────┬────────────────────────────────┬────────────────┬──────────┬────────────┬──────────────┬────────────┬──────┬────────┬──────────────┬────────────────────┬────────────────┬────────────────────────┐
│host                    │port│alias.host              │alias.port│    id│                                uuid│                            path│      schedgroup│  headroom│        boot│  configstatus│       drain│ usage│  active│  scaninterval│  scan_rain_interval│          health│           statuscomment│
└────────────────────────┴────┴────────────────────────┴──────────┴──────┴────────────────────────────────────┴────────────────────────────────┴────────────────┴──────────┴────────────┴──────────────┴────────────┴──────┴────────┴──────────────┴────────────────────┴────────────────┴────────────────────────┘
 nfs11                    1095                                          1 9ac28717-c447-49dc-abed-a50701741da3                            /eos1        default.0     5.10 G       booted             rw      nodrain   0.70                 1814400              2419200      no smartctl
 nfs12                    1095                                          2 f995a3de-c6e1-4806-9082-3c5e3c20691f                            /eos1        default.1     5.10 G     opserror            off      nodrain  21.29                 1814400              2419200      no smartctl
 nfs13                    1095                                          3 7dee6080-89f2-4e7b-a0ac-7e9d58823e2e                            /eos1        default.2     5.10 G     opserror            off      nodrain  17.01                 1814400              2419200      no smartctl
 nfs12                    1095                                          4 f38276e4-0e30-48b4-8fde-33eb6eec0b00                            /eos2        default.3     5.10 G  bootfailure             rw      nodrain   0.00                 1814400              2419200      no smartctl
 nfs13                    1095                                          5 3aa4b474-5a09-4884-a66e-57d1ea547128                            /eos3        default.4     5.10 G  bootfailure             rw      nodrain   0.00   online        1814400              2419200      no smartctl

The cleanup is done, I will register the new /eos2 and /eos3 filesystems correctly using the following sequence of commands: fs add, group set, fs config, and fs boot.

Best regards,
Takuma

esindril · August 4, 2025, 9:26am

Hi Takuma,

You can first try to set those file systems that you want to remove to empty status. This will not work if you have managed to write any files to them, but I guess this should not be the case:
eos fs config 2 configstatus=emtpy and then you can try removing them with eos fs rm 2.

If this does not work then you need to delete them from the configuration in QDB. You stop the MGM and FST daemons and search for the corresponding entry in QDB:
redis-cli -p 7777 hgetall eos-config:default | grep -B 1 "id=2"
Then you take the value on the first line and use it as a key to perform the deletion:
redis-cli -p 7777 hdel eos-config:default <the_key_from_the_step_before>

You can now restart the services and the offending file systems should go away.

Cheers,
Elvin

tamatsum · August 4, 2025, 10:09am

Hi Elvin,

Thank you for your help.
The first method you suggested worked well.
I appreciate your advice.
Thank you!!

Best regards,
Takuma

CERN Accelerating science

FST Boot Failure and Removal Issue

our setup

Version

About MQ service

About labels