Eos services silently failing

quarkdb experts,

Our SE enters an unavailable state, which appears caused by quark silently failing though reporting healthy.

FSTs log access errors like:

200221 00:42:32 9241 FstOfs_stat: ? Unable to locate /fsidmnt/8026/000004
51/00a8bccf; no such file or directory
200221 00:42:32 9241 FstOfs_remove: ? Unable to remove /fsidmnt/8026/0000
0451/00a8bccf; no such file or directory
200221 00:42:32 9241 FstOfs_close: jalien.28432:391@alientest07.cern.ch U
nable to store file - file has been cleaned because the stored file does 
not match the provided targetsize /14/08994/0bcf0a58-fe99-11e7-8109-776bf
0cf739d; input/output error
200221 00:42:32 8961 FstOfs_CallManager: ? Unable to Unable to commit fil
esize change - file is already removed [EIDRM] ; No such file or director
y /eos/aliceornleos/grid/06/51359/0bd018c0-0a4d-11e8-a341-ab1ca558aec2; i
dentifier removed
200221 00:42:32 8961 FstOfs_stat: ? Unable to locate /fsidmnt/8032/000004
51/00a8bc8f; no such file or directory

Quark reports healthy state:

[root@ornl-eos-01 ~]# redis-cli -p 7001 -a '$REDISCLI_AUTH'  raft_info     
 1) TERM 109222
 2) LOG-START 0
 3) LOG-SIZE 24098052
 4) LEADER ornl-eos-01.ornl.gov:7001
 5) CLUSTER-ID 0123456789
 6) COMMIT-INDEX 24098051
 7) LAST-APPLIED 24098051
 8) BLOCKED-WRITES 0
 9) LAST-STATE-CHANGE 39675 (11 hours, 1 minutes, 15 seconds)
10) ----------
11) MYSELF ornl-eos-01.ornl.gov:7001
12) VERSION 0.4.1
13) STATUS LEADER
14) NODE-HEALTH GREEN
15) JOURNAL-FSYNC-POLICY sync-important-updates
16) ----------
17) MEMBERSHIP-EPOCH 95141
18) NODES ornl-eos-01.ornl.gov:7001,ornl-eos-02.ornl.gov:7001,warp-ornl-cern-05.ornl.gov:7001
19) OBSERVERS 
20) QUORUM-SIZE 2
21) ----------
22) REPLICA ornl-eos-02.ornl.gov:7001 | ONLINE | UP-TO-DATE | NEXT-INDEX 24098052 | VERSION 0.3.8
23) REPLICA warp-ornl-cern-05.ornl.gov:7001 | ONLINE | UP-TO-DATE | NEXT-INDEX 24098052 | VERSION 0.4.1
[root@ornl-eos-01 ~]# redis-cli -p 7001 -a '$REDISCLI_AUTH' quarkdb_health   
1) NODE-HEALTH GREEN
2) NODE ornl-eos-01.ornl.gov:7001
3) VERSION 0.4.1
4) ----------
5) GREEN  >> FREE-SPACE-SM 1213132754944 bytes (64.9225%)
6) GREEN  >> PART-OF-QUORUM Yes | LEADER ornl-eos-01.ornl.gov:7001
7) GREEN  >> QUORUM-STABILITY Good
8) GREEN  >> REPLICA ornl-eos-02.ornl.gov:7001 | ONLINE | UP-TO-DATE | NEXT-INDEX 24102612 | VERSION 0.3.8
9) GREEN  >> REPLICA warp-ornl-cern-05.ornl.gov:7001 | ONLINE | UP-TO-DATE | NEXT-INDEX 24102612 | VERSION 0.4.1

The mgm, mq, and fst service all report running status, none failed.

[root@ornl-eos-01 ~]# systemctl status eos@mgm -l    
● eos@mgm.service - EOS mgm
   Loaded: loaded (/usr/lib/systemd/system/eos@.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2020-02-20 13:41:26 CET; 10h ago
  Process: 27537 ExecStop=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-stop %i (code=exited, status=0/SUCCESS)
  Process: 27552 ExecStartPre=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-start-pre %i (code=exited, status=0/SUCCESS)
 Main PID: 27584 (sh)
   CGroup: /system.slice/system-eos.slice/eos@mgm.service
           ├─27584 /bin/sh -c XRDPROG=/usr/bin/xrootd; test -e /opt/eos/xrootd/bin/xrootd && XRDPROG=/opt/eos/xrootd/bin/xrootd; echo $XRDPROG ; $XRDPROG -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -s /tmp/xrootd.mgm.pid -Rdaemon
           ├─27585 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -s /tmp/xrootd.mgm.pid -Rdaemon
           ├─27693 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -s /tmp/xrootd.mgm.pid -Rdaemon
           └─27973 eos -b console log _MGMID_

Feb 20 13:41:26 ornl-eos-01.ornl.gov sh[27584]: /opt/eos/xrootd/bin/xrootd
Feb 20 13:41:26 ornl-eos-01.ornl.gov sh[27584]: Register objects provided by NsQuarkdbPlugin ...
Feb 20 13:41:26 ornl-eos-01.ornl.gov sh[27584]: =====> XrdAliceTokenAcc: Public key in use is EXPORT   PATH:/ VO:*     ACCESS:ALLOW CERT:*
Feb 20 13:41:26 ornl-eos-01.ornl.gov sh[27584]: =====> XrdAliceTokenAcc: Cannot load public key !
Feb 20 13:41:26 ornl-eos-01.ornl.gov sh[27584]: =====> XrdAliceTokenAcc: Public key in use is RULE     PATH:/ AUTHZ:delete|read|write|write-once| NOAUTHZ:| VO:*| CERT:IGNORE
Feb 20 13:41:26 ornl-eos-01.ornl.gov sh[27584]: =====> XrdAliceTokenAcc: Cannot load public key !
Feb 20 13:41:26 ornl-eos-01.ornl.gov sh[27584]: =====> XrdAliceTokenAcc: Public key in use is RULE     PATH:/eos/alice/user/ AUTHZ:| NOAUTHZ:delete|read|write|write-once| VO:*| CERT:IGNORE
Feb 20 13:41:26 ornl-eos-01.ornl.gov sh[27584]: =====> XrdAliceTokenAcc: Cannot load public key !
Feb 20 13:41:26 ornl-eos-01.ornl.gov sh[27584]: =====> XrdAliceTokenAcc: Public key in use is KEY VO:* PRIVKEY:/etc/grid-security/xrootd/privkey.pem PUBKEY:/etc/grid-security/xrootd/pubkey.pem
Feb 20 13:41:26 ornl-eos-01.ornl.gov sh[27584]: =====> XrdAliceTokenAcc: Cannot load public key !
[root@ornl-eos-01 ~]# systemctl status eos@mq
● eos@mq.service - EOS mq
   Loaded: loaded (/usr/lib/systemd/system/eos@.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2020-02-18 15:24:20 CET; 2 days ago
  Process: 4621 ExecStop=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-stop %i (code=exited, status=0/SUCCESS)
  Process: 4640 ExecStartPre=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-start-pre %i (code=exited, status=0/SUCCESS)
 Main PID: 4672 (sh)
   CGroup: /system.slice/system-eos.slice/eos@mq.service
           ├─4672 /bin/sh -c XRDPROG=/usr/bin/xrootd; test -e /opt/eos/xrootd/bin/xroo...
           └─4673 /opt/eos/xrootd/bin/xrootd -n mq -c /etc/xrd.cf.mq -l /var/log/eos/x...

Restarting quarkdb to elect new master, followed by restarting mgm and fst services return SE to serving files once again.

/var/log/quarkdb/xrootd-quarkdb_1/xrootd.log not showing root cause, though can post or provide.

Cheers,
Pete

Hi Pete,

It’s hard to tell what might be wrong here – you mentioned the service becomes available again once you restart QDB, but it looks like you’re restarting everything?

Next time it happens, could you restart just QDB first and see if the problem resolves itself? If not, then try restarting the MGM / FSTs, too.

  • Could it be loss of heartbeats between the MGM and the FSTs? What’s eos fs ls / eos node ls?
  • What errors do end-clients receive? How do the transfers fail?
  • Which MGM / FST versions are you using?

Cheers,
Georgios

Hi Georgios,

Indeed, the problem is with eos services and not quarkdb failing.

When this occurs the mgm, mq and fst services continue to run and systemctl status exits with 0 – though occasionally status eos@mgm exits 130

When things enters this inconsistent state eos node ls times out, though eos fs ls returns a list of fsids. The FST maintains connections to various sites, though with only a few KB of traffic

eos*-4.5.9 and xrootd-*-4.9.1

And example which just occured below. Restarting all services resolves the issue – for a while.

Digging in logs has not revealed anything fruitful as yet.

Any suggestions are appreciated.

Cheers,
Pete

(awscli) [root@ornl-eos-01 ~]# systemctl status eos@mgm
● eos@mgm.service - EOS mgm
Loaded: loaded (/usr/lib/systemd/system/eos@.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2020-02-24 19:00:01 CET; 49min ago
Process: 22299 ExecStop=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-stop %i (code=exited, status=0/SUCCESS)
Process: 22362 ExecStartPre=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-start-pre %i (code=exited, status=0/SUCCESS)
Main PID: 22396 (sh)
CGroup: /system.slice/system-eos.slice/eos@mgm.service
├─22396 /bin/sh -c XRDPROG=/usr/bin/xrootd; test -e /opt/eos/xrootd/bin/xrootd && XRDPROG=/opt/eos/xrootd/bin/xrootd; echo $XRDPROG ; $XRDPROG -n mgm -c /etc/xrd.cf.mgm -l…
├─22398 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -s /tmp/xrootd.mgm.pid -Rdaemon
├─22507 /opt/eos/xrootd/bin/xrootd -n mgm -c /etc/xrd.cf.mgm -l /var/log/eos/xrdlog.mgm -s /tmp/xrootd.mgm.pid -Rdaemon
└─22805 eos -b console log MGMID

Feb 24 19:00:02 ornl-eos-01.ornl.gov sh[22396]: =====> XrdAliceTokenAcc: Public key in use is EXPORT PATH:/ VO:* ACCESS:ALLOW CERT:*
Feb 24 19:00:02 ornl-eos-01.ornl.gov sh[22396]: =====> XrdAliceTokenAcc: Cannot load public key !
Feb 24 19:00:02 ornl-eos-01.ornl.gov sh[22396]: =====> XrdAliceTokenAcc: Public key in use is RULE PATH:/ AUTHZ:delete|read|write|write-once| NOAUTHZ:| VO:| CERT:IGNORE
Feb 24 19:00:02 ornl-eos-01.ornl.gov sh[22396]: =====> XrdAliceTokenAcc: Cannot load public key !
Feb 24 19:00:02 ornl-eos-01.ornl.gov sh[22396]: =====> XrdAliceTokenAcc: Public key in use is RULE PATH:/eos/alice/user/ AUTHZ:| NOAUTHZ:delete|read|write|write-once| …CERT:IGNORE
Feb 24 19:00:02 ornl-eos-01.ornl.gov sh[22396]: =====> XrdAliceTokenAcc: Cannot load public key !
Feb 24 19:00:02 ornl-eos-01.ornl.gov sh[22396]: =====> XrdAliceTokenAcc: Public key in use is KEY VO:
PRIVKEY:/etc/grid-security/xrootd/privkey.pem PUBKEY:/etc/grid-secur…/pubkey.pem
Feb 24 19:00:02 ornl-eos-01.ornl.gov sh[22396]: =====> XrdAliceTokenAcc: Cannot load public key !
Feb 24 19:00:08 ornl-eos-01.ornl.gov sh[22396]: ==> new Broker root://localhost:1097//eos/ornl-eos-01.ornl.gov/mgm?&xmqclient.advisory.status=1&xmqclient.advisory.query=1&…shbacklog=1
Feb 24 19:00:08 ornl-eos-01.ornl.gov sh[22396]: ==> new Broker root://daemon@localhost:1097//eos/ornl-eos-01.ornl.gov/report?&xmqclient.advisory.status=0&xmqclient.advisor…shbacklog=0
Hint: Some lines were ellipsized, use -l to show in full.

(awscli) [root@ornl-eos-01 ~]# echo $?
0
(awscli) [root@ornl-eos-01 ~]# eos node ls
error: MGM root://localhost not online/reachable

(awscli) [root@ornl-eos-01 ~]# systemctl status eos@mq
● eos@mq.service - EOS mq
Loaded: loaded (/usr/lib/systemd/system/eos@.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2020-02-24 19:00:01 CET; 49min ago
Process: 22298 ExecStop=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-stop %i (code=exited, status=0/SUCCESS)
Process: 22311 ExecStartPre=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-start-pre %i (code=exited, status=0/SUCCESS)
Main PID: 22342 (sh)
CGroup: /system.slice/system-eos.slice/eos@mq.service
├─22342 /bin/sh -c XRDPROG=/usr/bin/xrootd; test -e /opt/eos/xrootd/bin/xrootd && XRDPROG=/opt/eos/xrootd/bin/xrootd; echo $XRDPROG ; $XRDPROG -n mq -c /etc/xrd.cf.mq -l /…
└─22345 /opt/eos/xrootd/bin/xrootd -n mq -c /etc/xrd.cf.mq -l /var/log/eos/xrdlog.mq -s /tmp/xrootd.mq.pid -Rdaemon

Meanwhile connections still established on FST

[root@warp-ornl-cern-08 ~]# netstat -atlp | grep SYN_RECV
tcp        0      0 warp-ornl-cern:nicelink aliendb7.cern.ch:56328  SYN_RECV    -                   
tcp        0      0 warp-ornl-cern:nicelink alientest07.cern.:38146 SYN_RECV    -                   
tcp        0      0 warp-ornl-cern:nicelink aliendb4.ce:filenet-nch SYN_RECV    -                   
tcp        0      0 warp-ornl-cern:nicelink aliendb10.cern.ch:52636 SYN_RECV    -                   
tcp        0      0 warp-ornl-cern:nicelink aliendb5.cern.ch:33536  SYN_RECV    -                   
tcp        0      0 warp-ornl-cern:nicelink 192.188.182.183:37178   SYN_RECV    -                   
tcp        0      0 warp-ornl-cern:nicelink alientest07.cern.:38104 SYN_RECV    - 
..blah blah