EOS stopped working

Hi Everyone

Our production EOS instance stopped working this afternoon.
QuarkDB was still alive. While eos@mgm and eos@mq were in a failed status.

We tried to restart all the services a few times, without success.
After a restart, all the services are up, but once we run some commands on the mgm, like eos, here is the output

[root@s214p mgm]# eos
 # ---------------------------------------------------------------------------
 # EOS  Copyright (C) 2011-2019 CERN/Switzerland
 # This program comes with ABSOLUTELY NO WARRANTY; for details type `license'.
 # This is free software, and you are welcome to redistribute it
 # under certain conditions; type `license' for details.
 # ---------------------------------------------------------------------------
 EOS_INSTANCE=jeodpp
 EOS_SERVER_VERSION=4.5.15 EOS_SERVER_RELEASE=1
 EOS_CLIENT_VERSION=4.5.15 EOS_CLIENT_RELEASE=1

The command does not return… same for eos node ls… same when we try to mount eos from our clients…

QuarkDB seems to work fine

# redis-cli -p 7777 quarkdb-health
 1) NODE-HEALTH GREEN
 2) NODE s214p:7777
 3) VERSION 0.4.2
 4) ----------
 5) GREEN  >> SM-FREE-SPACE 548267864064 bytes (68.5558%)
 6) GREEN  >> SM-MANIFEST-TIMEDIFF 48 sec
 7) GREEN  >> PART-OF-QUORUM Yes | LEADER s214p:7777
 8) GREEN  >> QUORUM-STABILITY Good
 9) GREEN  >> REPLICA s215p:7777 | ONLINE | UP-TO-DATE | NEXT-INDEX 7912107985 | VERSION 0.4.2
10) GREEN  >> REPLICA s216p:7777 | ONLINE | UP-TO-DATE | NEXT-INDEX 7912107985 | VERSION 0.4.2

# eos ns
# ------------------------------------------------------------------------------------
# Namespace Statistics
# ------------------------------------------------------------------------------------
ALL      Files                            994764740 [booted] (0s)
ALL      Directories                      197852986
ALL      Total boot time                  1 s
# ------------------------------------------------------------------------------------
ALL      Replication                      is_master=true master_id=s214p.jrc.it:1094
# ------------------------------------------------------------------------------------
ALL      files created since boot         0
ALL      container created since boot     0
# ------------------------------------------------------------------------------------
ALL      current file id                  1677261284
ALL      current container id             310379520
# ------------------------------------------------------------------------------------
ALL      eosxd caps                       0
ALL      eosxd clients                    26
ALL      eosxd active clients             16
ALL      eosxd locked clients             7
# ------------------------------------------------------------------------------------
ALL      File cache max num               30000000
ALL      File cache occupancy             231
ALL      In-flight FileMD                 0
ALL      Container cache max num          3000000
ALL      Container cache occupancy        570
ALL      In-flight ContainerMD            0
# ------------------------------------------------------------------------------------
ALL      memory virtual                   10.86 GB
ALL      memory resident                  2.01 GB
ALL      memory share                     24.76 MB
ALL      memory growths                   6.99 GB
ALL      threads                          1003
ALL      fds                              1192
ALL      uptime                           1733
# ------------------------------------------------------------------------------------
ALL      drain info                       thread_pool=central_drain min=80 max=400 size=80 queue_size=0
# ------------------------------------------------------------------------------------

On the mq log… once I restart the service… I see something like

200731 18:35:54 time=1596213354.217430 func=ShouldRedirectQdb        level=NOTE  logid=755f17cc-d34b-11ea-9c37-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4af39fd700 source=XrdMqOfs:881                   tident=<service> sec=      uid=0 gid=0 name= geo="" msg="unset or unexpected master identity format" mMasterId=""
200731 18:35:54 time=1596213354.247373 func=ShouldRedirectQdb        level=NOTE  logid=755f17cc-d34b-11ea-9c37-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4af9afe700 source=XrdMqOfs:881                   tident=<service> sec=      uid=0 gid=0 name= geo="" msg="unset or unexpected master identity format" mMasterId=""
200731 18:35:54 time=1596213354.247415 func=ShouldRedirectQdb        level=NOTE  logid=755f17cc-d34b-11ea-9c37-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4af9afe700 source=XrdMqOfs:881                   tident=<service> sec=      uid=0 gid=0 name= geo="" msg="unset or unexpected master identity format" mMasterId=""
200731 18:35:54 time=1596213354.548520 func=ShouldRedirectQdb        level=NOTE  logid=755f17cc-d34b-11ea-9c37-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4aedafe700 source=XrdMqOfs:881                   tident=<service> sec=      uid=0 gid=0 name= geo="" msg="unset or unexpected master identity format" mMasterId=""
200731 18:35:54 time=1596213354.548564 func=ShouldRedirectQdb        level=NOTE  logid=755f17cc-d34b-11ea-9c37-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4aedafe700 source=XrdMqOfs:881                   tident=<service> sec=      uid=0 gid=0 name= geo="" msg="unset or unexpected master identity format" mMasterId=""
200731 18:35:54 time=1596213354.971688 func=ShouldRedirectQdb        level=NOTE  logid=755f17cc-d34b-11ea-9c37-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4af29fd700 source=XrdMqOfs:881                   tident=<service> sec=      uid=0 gid=0 name= geo="" msg="unset or unexpected master identity format" mMasterId=""
200731 18:35:54 time=1596213354.979266 func=ShouldRedirectQdb        level=NOTE  logid=755f17cc-d34b-11ea-9c37-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4aeeafe700 source=XrdMqOfs:881                   tident=<service> sec=      uid=0 gid=0 name= geo="" msg="unset or unexpected master identity format" mMasterId=""
200731 18:35:55 70032 XrootdXeq: daemon.70589:69@s214p pub IPv4 login as daemon
200731 18:35:55 time=1596213355.332348 func=open                     level=INFO  logid=e7cedcf2-d34b-11ea-8f00-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4aee9fd700 source=XrdMqOfs:94                    tident=<service> sec=      uid=0 gid=0 name= geo="" connecting queue: /eos/s214p.jrc.it/mgm
200731 18:35:55 time=1596213355.332411 func=open                     level=INFO  logid=e7cedcf2-d34b-11ea-8f00-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4aee9fd700 source=XrdMqOfs:137                   tident=<service> sec=      uid=0 gid=0 name= geo="" connected queue: /eos/s214p.jrc.it/mgm
200731 18:36:00 70033 XrootdXeq: root.71093:70@localhost pvt IPv4 login as daemon
200731 18:36:00 time=1596213360.564362 func=open                     level=INFO  logid=eaed34a6-d34b-11ea-99c0-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4aedbff700 source=XrdMqOfs:94                    tident=<service> sec=      uid=0 gid=0 name= geo="" connecting queue: /eos/:71093:1/errorreport
200731 18:36:00 time=1596213360.564437 func=open                     level=INFO  logid=eaed34a6-d34b-11ea-99c0-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4aedbff700 source=XrdMqOfs:137                   tident=<service> sec=      uid=0 gid=0 name= geo="" connected queue: /eos/:71093:1/errorreport
200731 18:37:13 71045 XrdProtocol: ?:71@s-jrciprcid54v terminated handshake not received
200731 18:38:43 71045 XrootdXeq: daemon.70589:69@s214p disc 0:02:48 (idle timeout)
200731 18:38:43 time=1596213523.286064 func=close                    level=INFO  logid=e7cedcf2-d34b-11ea-8f00-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4aed9fd700 source=XrdMqOfs:255                   tident=<service> sec=      uid=0 gid=0 name= geo="" disconnecting queue: /eos/s214p.jrc.it/mgm
200731 18:38:43 time=1596213523.399616 func=close                    level=INFO  logid=e7cedcf2-d34b-11ea-8f00-48df374dec7c unit=mq@s214p.jrc.it:1097 tid=00007f4aed9fd700 source=XrdMqOfs:291                   tident=<service> sec=      uid=0 gid=0 name= geo="" disconnected queue: /eos/s214p.jrc.it/mgm

And the file xrdlog.mgm contains

200731 18:36:01 time=1596213361.064449 func=Schedule2Balance         level=INFO  logid=FstOfsStorage unit=mgm@s214p.jrc.it:1094 tid=00007f2d6e3fe700 source=Schedule2Balance:346           tident=daemon.8132:624@s241p sec=sss   uid=2 gid=2 name=daemon geo="JRC" cmd=schedule2balance fsid=1417 freebytes=0 logid=FstOfsStorage
200731 18:36:01 time=1596213361.064580 func=open                     level=INFO  logid=eb394512-d34b-11ea-b7fc-48df374dec7c unit=mgm@s214p.jrc.it:1094 tid=00007f2d60bfe700 source=XrdMgmOfsFile:198              tident=nobody.1527:674@s180p sec=unix  uid=99 gid=99 name=root geo="JRC" op=read path=/eos/jeodpp/nagios-test.txt info=eos.app=fuse&eos.checksum=ignore&eos.encodepath=1&xrd.wantprot=unix
200731 18:36:01 time=1596213361.065136 func=Schedule2Balance         level=INFO  logid=FstOfsStorage unit=mgm@s214p.jrc.it:1094 tid=00007f2d77bfc700 source=Schedule2Balance:346           tident=daemon.14993:649@s233p sec=sss   uid=2 gid=2 name=daemon geo="JRC" cmd=schedule2balance fsid=0 freebytes=0 logid=FstOfsStorage
200731 18:36:01 time=1596213361.065226 func=BalanceGetFsSrc          level=ERROR logid=FstOfsStorage unit=mgm@s214p.jrc.it:1094 tid=00007f2d77bfc700 source=Schedule2Balance:206           tident=daemon.14993:649@s233p sec=      uid=0 gid=0 name= geo="" msg="target filesystem not found in the view" fsid=0
200731 18:36:01 time=1596213361.065290 func=Emsg                     level=ERROR logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@s214p.jrc.it:1094 tid=00007f2d77bfc700 source=XrdMgmOfs:968                  tident=<single-exec> sec=      uid=0 gid=0 name= geo="" Unable to schedule - fsid not known [EINVAL] 0; Invalid argument
200731 18:36:01 time=1596213361.065406 func=Schedule2Balance         level=INFO  logid=FstOfsStorage unit=mgm@s214p.jrc.it:1094 tid=00007f2d6f3fe700 source=Schedule2Balance:346           tident=daemon.1670:636@s238p sec=sss   uid=2 gid=2 name=daemon geo="JRC" cmd=schedule2balance fsid=1280 freebytes=0 logid=FstOfsStorage

Any suggestion? Let me know if you need more info.

Many thanks!
Marco

Some more information about the issue above reported by @marco.

Here is a small backtrace we found in the xrdlog.mgm file at the time of the first issue and that might explain the first crash.

error: received signal 11:
/lib64/libXrdEosMgm.so(_Z20xrdmgmofs_stacktracei+0x44)[0x7fb32e2e8cb4]
/lib64/libc.so.6(+0x36280)[0x7fb33355d280]
/usr/lib64/libjemalloc.so.1(+0xd5cd)[0x7fb334cb65cd]
/usr/lib64/libjemalloc.so.1(+0x266ff)[0x7fb334ccf6ff]
/usr/lib64/libjemalloc.so.1(malloc+0x22f)[0x7fb334caf14f]
/lib64/libstdc++.so.6(_Znwm+0x1d)[0x7fb333e6aecd]
/lib64/libstdc++.so.6(_ZNSs4_Rep9_S_createEmmRKSaIcE+0x59)[0x7fb333ec9a19]
/lib64/libstdc++.so.6(_ZNSs12_S_constructIPKcEEPcT_S3_RKSaIcESt20forward_iterator_tag+0x21)[0x7fb333ecb2a1]
/lib64/libstdc++.so.6(_ZNSsC2EPKcRKSaIcE+0x38)[0x7fb333ecb6d8]
/lib64/libXrdEosMgm.so(_ZN9TableCellC1EdRKSsS1_b19TableFormatterColor+0x1a8)[0x7fb32e53c508]

At that time, the GroupBalancer was enabled and active, busy balancing our groups. I do not know if this had an impact, it has been going on correctly for around a week.

After that, the daemon was still alive, but only the regular messages regarding the AcquireLease, Supervisor and Converter function were logged.

Currently, when the mgm daemon is started, it start booting correctly, but after few seconds, everything is again frozen, as if some dead lock was occurring somewhere, preventing it from operating correctly. The only messages that come after this are the lease management. At some point, I managed to have a correct output for eos ls and eos node ls (FSTs were reporting their information) but it stopped working after a while. I could provide the logs via email if that could help.

I tried to disable the balancer (which gave some issue in the past during startup) and groupbalancer with the following commands on the QuarkDB cluster :slight_smile:

redis-cli -p 7777 hset eos-config:default global:/config/jeodpp/space/default#balancer off  
redis-cli -p 7777 hset eos-config:default global:/config/jeodpp/space/default#groupbalancer off  

But this didn’t change anything in the behavior at startup.

So anything that could help us start again the MGM would be helpful.

Thank you.

Hi Franck,

Is this your production instance or some test instance?
What version of eos are you running?
Can you attach with gdb to the mgm process and get a full stracktrace of all the threads? “thread apply all bt”?

Thanks,
Elvin

Ahh sorry, I read more carefully the initial post - forget the first two questions. Can you send me the stacktrace of the mgm?

Thanks,
Elvin

Thank you so much for answering at that time @esindril, really sorry to disturb you during a summer week-end :frowning:

I sent you the backtrace via email, also a log file.

Please do not ruin your night, nor your week-end, but if you can have a quick solution, it would be very helpful.

Franck

Thanks to the help of @esindril analysing the backtrace and log file, we could successfully restart our instance.

It appeared that some fusex clients were causing a deadlock in the MGM at restart time. By shutting down all fusex clients, before running the restart of EOS MGM, this deadlock didn’t happen any more, and the MGM started correctly.

Is it possible that this deadlock issue is already solved in a more recent version of MGM ?

Could it be that this deadlock is somewhat linked to the cause of the first crash ?

Hi Franck,

Glad to head that!
The crash has nothing to do with the deadlock. We also saw this type of crash in the past and I believe this is fixed in newer versions. There were also some (similar) deadlock related issues that were fixed since 4.5.15, I am not 100% sure this falls in one of those cathegories, but I will have another look at the stacktrace to make sure this fixed.

Cheers,
Elvin

Hi Elvin,

Again, many thanks for your help! With your suggestion we managed to quickly recover our instance.

We have been very happy about 4.5.15, it was working well since the end of 2019.
Now, having to upgrade our instance (MGM 4.5.15, FST 4.5.17, QuarkDB 0.4.2, mix of eosd/eosxd clients 4.5.x), which version would you recommend to use? It will have to work well for the next year since we do not upgrade often :slight_smile:

Many thanks
Marco

Hi Marco,

We are running 4.8.9/4.8.10 in production for some time now and stability wise it looks ok. Therefore, this would be a good candidate for an update.

Cheers,
Elvin

Ok, great! Will take this into account for the next upgrade of our instance.

Many thanks!
Marco

Thank you again Elvin.

One extra question about the version upgrade : do you see any particular care to take in jumping minor version from 4.5.x to 4.8.x ? Or maybe some advice to first upgrade MGM, then FSTs, or the opposite ? Or if there are any known compatibility issue with the FUSE(x) clients to be aware of ?

Anyway, we will first try the upgrade on some test instance.

Hi Franck,

Yes, one thing that you need to take particular care with is that when you update from any pre 4.8 release, one needs to disable the converter until all the FSTs and the MGM are updated to the new 4.8.* version.

This is due to the fact that the converter has been reimplemented and it does not support running with a mix of versions. Otherwise I am not aware of any restrictions.

Cheers,
Elvin