quarkdb experts,
Our SE enters an unavailable state, which appears caused by quark silently failing though reporting healthy.
FSTs log access errors like:
200221 00:42:32 9241 FstOfs_stat: ? Unable to locate /fsidmnt/8026/000004
51/00a8bccf; no such file or directory
200221 00:42:32 9241 FstOfs_remove: ? Unable to remove /fsidmnt/8026/0000
0451/00a8bccf; no such file or directory
200221 00:42:32 9241 FstOfs_close: jalien.28432:391@alientest07.cern.ch U
nable to store file - file has been cleaned because the stored file does
not match the provided targetsize /14/08994/0bcf0a58-fe99-11e7-8109-776bf
0cf739d; input/output error
200221 00:42:32 8961 FstOfs_CallManager: ? Unable to Unable to commit fil
esize change - file is already removed [EIDRM] ; No such file or director
y /eos/aliceornleos/grid/06/51359/0bd018c0-0a4d-11e8-a341-ab1ca558aec2; i
dentifier removed
200221 00:42:32 8961 FstOfs_stat: ? Unable to locate /fsidmnt/8032/000004
51/00a8bc8f; no such file or directory
Quark reports healthy state:
[root@ornl-eos-01 ~]# redis-cli -p 7001 -a '$REDISCLI_AUTH' raft_info
1) TERM 109222
2) LOG-START 0
3) LOG-SIZE 24098052
4) LEADER ornl-eos-01.ornl.gov:7001
5) CLUSTER-ID 0123456789
6) COMMIT-INDEX 24098051
7) LAST-APPLIED 24098051
8) BLOCKED-WRITES 0
9) LAST-STATE-CHANGE 39675 (11 hours, 1 minutes, 15 seconds)
10) ----------
11) MYSELF ornl-eos-01.ornl.gov:7001
12) VERSION 0.4.1
13) STATUS LEADER
14) NODE-HEALTH GREEN
15) JOURNAL-FSYNC-POLICY sync-important-updates
16) ----------
17) MEMBERSHIP-EPOCH 95141
18) NODES ornl-eos-01.ornl.gov:7001,ornl-eos-02.ornl.gov:7001,warp-ornl-cern-05.ornl.gov:7001
19) OBSERVERS
20) QUORUM-SIZE 2
21) ----------
22) REPLICA ornl-eos-02.ornl.gov:7001 | ONLINE | UP-TO-DATE | NEXT-INDEX 24098052 | VERSION 0.3.8
23) REPLICA warp-ornl-cern-05.ornl.gov:7001 | ONLINE | UP-TO-DATE | NEXT-INDEX 24098052 | VERSION 0.4.1
[root@ornl-eos-01 ~]# redis-cli -p 7001 -a '$REDISCLI_AUTH' quarkdb_health
1) NODE-HEALTH GREEN
2) NODE ornl-eos-01.ornl.gov:7001
3) VERSION 0.4.1
4) ----------
5) GREEN >> FREE-SPACE-SM 1213132754944 bytes (64.9225%)
6) GREEN >> PART-OF-QUORUM Yes | LEADER ornl-eos-01.ornl.gov:7001
7) GREEN >> QUORUM-STABILITY Good
8) GREEN >> REPLICA ornl-eos-02.ornl.gov:7001 | ONLINE | UP-TO-DATE | NEXT-INDEX 24102612 | VERSION 0.3.8
9) GREEN >> REPLICA warp-ornl-cern-05.ornl.gov:7001 | ONLINE | UP-TO-DATE | NEXT-INDEX 24102612 | VERSION 0.4.1
The mgm, mq, and fst service all report running status, none failed.
[root@ornl-eos-01 ~]# systemctl status eos@mgm -l
● eos@mgm.service - EOS mgm
Loaded: loaded (/usr/lib/systemd/system/eos@.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2020-02-20 13:41:26 CET; 10h ago
Process: 27537 ExecStop=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-stop %i (code=exited, status=0/SUCCESS)
Process: 27552 ExecStartPre=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-start-pre %i (code=exited, status=0/SUCCESS)
Main PID: 27584 (sh)
CGroup: /system.slice/system-eos.slice/eos@mgm.service
├─27584 /bin/sh -c XRDPROG=/usr/bin/xrootd; test -e /opt/eos/xrootd/bin/xrootd && XRDPROG=/opt/eos/xrootd/bin/xrootd; echo $XRDPROG ; $XRDPROG -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -s /tmp/xrootd.mgm.pid -Rdaemon
├─27585 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -s /tmp/xrootd.mgm.pid -Rdaemon
├─27693 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -s /tmp/xrootd.mgm.pid -Rdaemon
└─27973 eos -b console log _MGMID_
Feb 20 13:41:26 ornl-eos-01.ornl.gov sh[27584]: /opt/eos/xrootd/bin/xrootd
Feb 20 13:41:26 ornl-eos-01.ornl.gov sh[27584]: Register objects provided by NsQuarkdbPlugin ...
Feb 20 13:41:26 ornl-eos-01.ornl.gov sh[27584]: =====> XrdAliceTokenAcc: Public key in use is EXPORT PATH:/ VO:* ACCESS:ALLOW CERT:*
Feb 20 13:41:26 ornl-eos-01.ornl.gov sh[27584]: =====> XrdAliceTokenAcc: Cannot load public key !
Feb 20 13:41:26 ornl-eos-01.ornl.gov sh[27584]: =====> XrdAliceTokenAcc: Public key in use is RULE PATH:/ AUTHZ:delete|read|write|write-once| NOAUTHZ:| VO:*| CERT:IGNORE
Feb 20 13:41:26 ornl-eos-01.ornl.gov sh[27584]: =====> XrdAliceTokenAcc: Cannot load public key !
Feb 20 13:41:26 ornl-eos-01.ornl.gov sh[27584]: =====> XrdAliceTokenAcc: Public key in use is RULE PATH:/eos/alice/user/ AUTHZ:| NOAUTHZ:delete|read|write|write-once| VO:*| CERT:IGNORE
Feb 20 13:41:26 ornl-eos-01.ornl.gov sh[27584]: =====> XrdAliceTokenAcc: Cannot load public key !
Feb 20 13:41:26 ornl-eos-01.ornl.gov sh[27584]: =====> XrdAliceTokenAcc: Public key in use is KEY VO:* PRIVKEY:/etc/grid-security/xrootd/privkey.pem PUBKEY:/etc/grid-security/xrootd/pubkey.pem
Feb 20 13:41:26 ornl-eos-01.ornl.gov sh[27584]: =====> XrdAliceTokenAcc: Cannot load public key !
[root@ornl-eos-01 ~]# systemctl status eos@mq
● eos@mq.service - EOS mq
Loaded: loaded (/usr/lib/systemd/system/eos@.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2020-02-18 15:24:20 CET; 2 days ago
Process: 4621 ExecStop=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-stop %i (code=exited, status=0/SUCCESS)
Process: 4640 ExecStartPre=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-start-pre %i (code=exited, status=0/SUCCESS)
Main PID: 4672 (sh)
CGroup: /system.slice/system-eos.slice/eos@mq.service
├─4672 /bin/sh -c XRDPROG=/usr/bin/xrootd; test -e /opt/eos/xrootd/bin/xroo...
└─4673 /opt/eos/xrootd/bin/xrootd -n mq -c /etc/xrd.cf.mq -l /var/log/eos/x...
Restarting quarkdb to elect new master, followed by restarting mgm and fst services return SE to serving files once again.
/var/log/quarkdb/xrootd-quarkdb_1/xrootd.log not showing root cause, though can post or provide.
Cheers,
Pete