Hi Everyone
Our production EOS instance stopped working this afternoon.
QuarkDB was still alive. While eos@mgm and eos@mq were in a failed status.
We tried to restart all the services a few times, without success.
After a restart, all the services are up, but once we run some commands on the mgm, like eos
, here is the output
[root@s214p mgm]# eos
# ---------------------------------------------------------------------------
# EOS Copyright (C) 2011-2019 CERN/Switzerland
# This program comes with ABSOLUTELY NO WARRANTY; for details type `license'.
# This is free software, and you are welcome to redistribute it
# under certain conditions; type `license' for details.
# ---------------------------------------------------------------------------
EOS_INSTANCE=jeodpp
EOS_SERVER_VERSION=4.5.15 EOS_SERVER_RELEASE=1
EOS_CLIENT_VERSION=4.5.15 EOS_CLIENT_RELEASE=1
The command does not return… same for eos node ls
… same when we try to mount eos from our clients…
QuarkDB seems to work fine
# redis-cli -p 7777 quarkdb-health
1) NODE-HEALTH GREEN
2) NODE s214p:7777
3) VERSION 0.4.2
4) ----------
5) GREEN >> SM-FREE-SPACE 548267864064 bytes (68.5558%)
6) GREEN >> SM-MANIFEST-TIMEDIFF 48 sec
7) GREEN >> PART-OF-QUORUM Yes | LEADER s214p:7777
8) GREEN >> QUORUM-STABILITY Good
9) GREEN >> REPLICA s215p:7777 | ONLINE | UP-TO-DATE | NEXT-INDEX 7912107985 | VERSION 0.4.2
10) GREEN >> REPLICA s216p:7777 | ONLINE | UP-TO-DATE | NEXT-INDEX 7912107985 | VERSION 0.4.2
# eos ns
# ------------------------------------------------------------------------------------
# Namespace Statistics
# ------------------------------------------------------------------------------------
ALL Files 994764740 [booted] (0s)
ALL Directories 197852986
ALL Total boot time 1 s
# ------------------------------------------------------------------------------------
ALL Replication is_master=true master_id=s214p.jrc.it:1094
# ------------------------------------------------------------------------------------
ALL files created since boot 0
ALL container created since boot 0
# ------------------------------------------------------------------------------------
ALL current file id 1677261284
ALL current container id 310379520
# ------------------------------------------------------------------------------------
ALL eosxd caps 0
ALL eosxd clients 26
ALL eosxd active clients 16
ALL eosxd locked clients 7
# ------------------------------------------------------------------------------------
ALL File cache max num 30000000
ALL File cache occupancy 231
ALL In-flight FileMD 0
ALL Container cache max num 3000000
ALL Container cache occupancy 570
ALL In-flight ContainerMD 0
# ------------------------------------------------------------------------------------
ALL memory virtual 10.86 GB
ALL memory resident 2.01 GB
ALL memory share 24.76 MB
ALL memory growths 6.99 GB
ALL threads 1003
ALL fds 1192
ALL uptime 1733
# ------------------------------------------------------------------------------------
ALL drain info thread_pool=central_drain min=80 max=400 size=80 queue_size=0
# ------------------------------------------------------------------------------------
On the mq log… once I restart the service… I see something like
200731 18:35:54 time=1596213354.217430 func=ShouldRedirectQdb level=NOTE logid=755f17cc-d34b-11ea-9c37-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4af39fd700 source=XrdMqOfs:881 tident=<service> sec= uid=0 gid=0 name= geo="" msg="unset or unexpected master identity format" mMasterId=""
200731 18:35:54 time=1596213354.247373 func=ShouldRedirectQdb level=NOTE logid=755f17cc-d34b-11ea-9c37-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4af9afe700 source=XrdMqOfs:881 tident=<service> sec= uid=0 gid=0 name= geo="" msg="unset or unexpected master identity format" mMasterId=""
200731 18:35:54 time=1596213354.247415 func=ShouldRedirectQdb level=NOTE logid=755f17cc-d34b-11ea-9c37-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4af9afe700 source=XrdMqOfs:881 tident=<service> sec= uid=0 gid=0 name= geo="" msg="unset or unexpected master identity format" mMasterId=""
200731 18:35:54 time=1596213354.548520 func=ShouldRedirectQdb level=NOTE logid=755f17cc-d34b-11ea-9c37-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4aedafe700 source=XrdMqOfs:881 tident=<service> sec= uid=0 gid=0 name= geo="" msg="unset or unexpected master identity format" mMasterId=""
200731 18:35:54 time=1596213354.548564 func=ShouldRedirectQdb level=NOTE logid=755f17cc-d34b-11ea-9c37-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4aedafe700 source=XrdMqOfs:881 tident=<service> sec= uid=0 gid=0 name= geo="" msg="unset or unexpected master identity format" mMasterId=""
200731 18:35:54 time=1596213354.971688 func=ShouldRedirectQdb level=NOTE logid=755f17cc-d34b-11ea-9c37-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4af29fd700 source=XrdMqOfs:881 tident=<service> sec= uid=0 gid=0 name= geo="" msg="unset or unexpected master identity format" mMasterId=""
200731 18:35:54 time=1596213354.979266 func=ShouldRedirectQdb level=NOTE logid=755f17cc-d34b-11ea-9c37-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4aeeafe700 source=XrdMqOfs:881 tident=<service> sec= uid=0 gid=0 name= geo="" msg="unset or unexpected master identity format" mMasterId=""
200731 18:35:55 70032 XrootdXeq: daemon.70589:69@s214p pub IPv4 login as daemon
200731 18:35:55 time=1596213355.332348 func=open level=INFO logid=e7cedcf2-d34b-11ea-8f00-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4aee9fd700 source=XrdMqOfs:94 tident=<service> sec= uid=0 gid=0 name= geo="" connecting queue: /eos/s214p.jrc.it/mgm
200731 18:35:55 time=1596213355.332411 func=open level=INFO logid=e7cedcf2-d34b-11ea-8f00-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4aee9fd700 source=XrdMqOfs:137 tident=<service> sec= uid=0 gid=0 name= geo="" connected queue: /eos/s214p.jrc.it/mgm
200731 18:36:00 70033 XrootdXeq: root.71093:70@localhost pvt IPv4 login as daemon
200731 18:36:00 time=1596213360.564362 func=open level=INFO logid=eaed34a6-d34b-11ea-99c0-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4aedbff700 source=XrdMqOfs:94 tident=<service> sec= uid=0 gid=0 name= geo="" connecting queue: /eos/:71093:1/errorreport
200731 18:36:00 time=1596213360.564437 func=open level=INFO logid=eaed34a6-d34b-11ea-99c0-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4aedbff700 source=XrdMqOfs:137 tident=<service> sec= uid=0 gid=0 name= geo="" connected queue: /eos/:71093:1/errorreport
200731 18:37:13 71045 XrdProtocol: ?:71@s-jrciprcid54v terminated handshake not received
200731 18:38:43 71045 XrootdXeq: daemon.70589:69@s214p disc 0:02:48 (idle timeout)
200731 18:38:43 time=1596213523.286064 func=close level=INFO logid=e7cedcf2-d34b-11ea-8f00-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4aed9fd700 source=XrdMqOfs:255 tident=<service> sec= uid=0 gid=0 name= geo="" disconnecting queue: /eos/s214p.jrc.it/mgm
200731 18:38:43 time=1596213523.399616 func=close level=INFO logid=e7cedcf2-d34b-11ea-8f00-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4aed9fd700 source=XrdMqOfs:291 tident=<service> sec= uid=0 gid=0 name= geo="" disconnected queue: /eos/s214p.jrc.it/mgm
And the file xrdlog.mgm contains
200731 18:36:01 time=1596213361.064449 func=Schedule2Balance level=INFO logid=FstOfsStorage unit=mgm@s214p.jrc.it:1094 tid=00007f2d6e3fe700 source=Schedule2Balance:346 tident=daemon.8132:624@s241p sec=sss uid=2 gid=2 name=daemon geo="JRC" cmd=schedule2balance fsid=1417 freebytes=0 logid=FstOfsStorage
200731 18:36:01 time=1596213361.064580 func=open level=INFO logid=eb394512-d34b-11ea-b7fc-48df374dec7c unit=mgm@s214p.jrc.it:1094 tid=00007f2d60bfe700 source=XrdMgmOfsFile:198 tident=nobody.1527:674@s180p sec=unix uid=99 gid=99 name=root geo="JRC" op=read path=/eos/jeodpp/nagios-test.txt info=eos.app=fuse&eos.checksum=ignore&eos.encodepath=1&xrd.wantprot=unix
200731 18:36:01 time=1596213361.065136 func=Schedule2Balance level=INFO logid=FstOfsStorage unit=mgm@s214p.jrc.it:1094 tid=00007f2d77bfc700 source=Schedule2Balance:346 tident=daemon.14993:649@s233p sec=sss uid=2 gid=2 name=daemon geo="JRC" cmd=schedule2balance fsid=0 freebytes=0 logid=FstOfsStorage
200731 18:36:01 time=1596213361.065226 func=BalanceGetFsSrc level=ERROR logid=FstOfsStorage unit=mgm@s214p.jrc.it:1094 tid=00007f2d77bfc700 source=Schedule2Balance:206 tident=daemon.14993:649@s233p sec= uid=0 gid=0 name= geo="" msg="target filesystem not found in the view" fsid=0
200731 18:36:01 time=1596213361.065290 func=Emsg level=ERROR logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@s214p.jrc.it:1094 tid=00007f2d77bfc700 source=XrdMgmOfs:968 tident=<single-exec> sec= uid=0 gid=0 name= geo="" Unable to schedule - fsid not known [EINVAL] 0; Invalid argument
200731 18:36:01 time=1596213361.065406 func=Schedule2Balance level=INFO logid=FstOfsStorage unit=mgm@s214p.jrc.it:1094 tid=00007f2d6f3fe700 source=Schedule2Balance:346 tident=daemon.1670:636@s238p sec=sss uid=2 gid=2 name=daemon geo="JRC" cmd=schedule2balance fsid=1280 freebytes=0 logid=FstOfsStorage
Any suggestion? Let me know if you need more info.
Many thanks!
Marco