Hi,
We are facing an inconsistency in our EOS storage. We have four nodes (Let’s call them A, B, C and D) with MGM, MQ, FST and QDB running, forming an HA cluster. The following is some of the problems we are experiencing:
file reading
On node A, we can see our files normally
➜ ~ eos ls -alh /eos/lhcb/
Secsss (getKeyTab): Unable to open /etc/eos.keytab; permission denied
Unable to open keytab file.
drwxr-xr-x 1 root root 1.23 T Nov 7 15:59 .
drwxrwxr-x 1 root root 1.43 T Nov 7 00:49 ..
drwxr-xr-+ 1 laf lhcb 14.80 G Feb 18 03:57 laf
drwxr-xr-x 1 qinning lhcb 700.20 M Jan 18 16:25 qinning
drwxr-xr-x 1 yinghua lhcb 1.22 T Jan 18 18:48 yinghua
➜ ~ eos ls -alh /eos/lhcb/laf
Secsss (getKeyTab): Unable to open /etc/eos.keytab; permission denied
Unable to open keytab file.
drwxr-xr-+ 1 laf lhcb 14.80 G Feb 18 03:57 .
drwxr-xr-x 1 root root 1.23 T Nov 7 15:59 ..
drwxr-sr-+ 1 laf lhcb 56.28 M Dec 11 04:56 Analysis
drwxr-sr-+ 1 laf lhcb 60.93 M Dec 20 17:07 AnalysisDev_v41r15
drwxr-sr-+ 1 laf lhcb 34.74 M Dec 11 07:00 davinci_run3_test
drwxr-sr-+ 1 laf lhcb 8.59 G Dec 5 16:30 layout_test
drwxr-xr-+ 1 laf lhcb 3.70 G Nov 9 01:26 ssmb_log
drwxr-sr-+ 1 laf lhcb 62.67 M Dec 29 01:22 wangjq_debug
drwxr-sr-+ 1 laf lhcb 2.29 G Jan 22 08:29 zfit_test
However, on node B, C or D, we can only see empty top-level directories with incorrect metadata
➜ ~ eos ls -alh /eos/lhcb/
Secsss (getKeyTab): Unable to open /etc/eos.keytab; permission denied
Unable to open keytab file.
drwxr-xr-x 1 root root 0 Mar 11 00:47 .
drwxr-xr-x 1 root root 20.48 k Jan 1 1970 ..
drwxr-xr-x 1 root root 0 Jan 1 1970 laf
drwxr-xr-x 1 root root 0 Jan 1 1970 qinning
drwxr-xr-x 1 root root 0 Jan 1 1970 yinghua
➜ ~ eos ls -alh /eos/lhcb/laf
Secsss (getKeyTab): Unable to open /etc/eos.keytab; permission denied
Unable to open keytab file.
drwxr-xr-x 1 root root 0 Jan 1 1970 .
drwxr-xr-x 1 root root 0 Mar 11 00:47 ..
file operations
When node A is the master MGM, on node A, I can normally move files into and out of EOS, but the changes cannot be seen as described in the above section (only seeing empty directories).
When node B, C or D is the master MGM, on that node, I can add new directories, change the owner of them or move file into them. These changes are synced among B, C and D, but cannot be seen on node A (only seeing the original files). The namespace looks somehow split.
The QDB status looks good
➜ ~ sudo eos daemon config qdb qdb info
[putenv] EOS_USE_MQ_ON_QDB=1
[putenv] EOS_XROOTD=/opt/eos/xrootd/
[putenv] GEO_TAG=local
[putenv] INSTANCE_NAME=eosdev
[putenv] LD_LIBRARY_PATH=/opt/eos/xrootd//lib64:/opt/eos/grpc/lib64
[putenv] LD_PRELOAD=/usr/lib64/libjemalloc.so
[putenv] QDB_CLUSTER_ID=eosdev
[putenv] QDB_HOST=hepfarm40.hep.tsinghua.edu.cn
[putenv] QDB_NODE=hepfarm40.hep.tsinghua.edu.cn:7777
[putenv] QDB_NODES=hepfarm40.hep.tsinghua.edu.cn:7777
[putenv] QDB_PATH=/var/lib/qdb
[putenv] QDB_PORT=7777
[putenv] SERVER_HOST=hepfarm40.hep.tsinghua.edu.cn
1) TERM 28
2) LOG-START 0
3) LOG-SIZE 17674580
4) LEADER hepfarm40.hep.tsinghua.edu.cn:7777
5) CLUSTER-ID eosdev
6) COMMIT-INDEX 17674579
7) LAST-APPLIED 17674579
8) BLOCKED-WRITES 0
9) LAST-STATE-CHANGE 2058205 (23 days, 19 hours, 43 minutes, 25 seconds)
10) ----------
11) MYSELF hepfarm40.hep.tsinghua.edu.cn:7777
12) VERSION 5.2.14.1
13) STATUS LEADER
14) NODE-HEALTH GREEN
15) JOURNAL-FSYNC-POLICY sync-important-updates
16) ----------
17) MEMBERSHIP-EPOCH 72376
18) NODES hepfarm41.hep.tsinghua.edu.cn:7777,hepfarm40.hep.tsinghua.edu.cn:7777,hepfarm30.hep.tsinghua.edu.cn:7777,hepfarm21.hep.tsinghua.edu.cn:7777
19) OBSERVERS
20) QUORUM-SIZE 3
21) ----------
22) REPLICA hepfarm21.hep.tsinghua.edu.cn:7777 | ONLINE | UP-TO-DATE | LOG-SIZE 17674580 | VERSION 5.2.14.1
23) REPLICA hepfarm30.hep.tsinghua.edu.cn:7777 | ONLINE | UP-TO-DATE | LOG-SIZE 17674580 | VERSION 5.2.17.1
24) REPLICA hepfarm41.hep.tsinghua.edu.cn:7777 | ONLINE | UP-TO-DATE | LOG-SIZE 17674580 | VERSION 5.2.17.1
info: run 'export REDISCLI_AUTH=`cat /etc/eos.keytab`; redis-cli -p `cat /var/run/eos/xrd.cf.qdb|grep xrd.port | cut -d ' ' -f 2` <<< raft-info' retc=0
These inconsistencies did not arise until recent weeks. I am not sure if it is caused by the upgrade.
The EOS_SERVER_VERSION of node A is 5.2.14, while on node B, C or D the EOS_SERVER_VERSION is 5.2.17 or 5.2.18. Since only on node A can we read our files, we are keeping the version on node A to avoid losing our data, and see if there’s any advice on resolving this issue.