Eosxd client regular crashes

franck-jrc · November 18, 2019, 11:40am

We experience some regular unexplained crashes of the eosxd daemon used by our processing nodes (magnitude of once every 1-2-3 days)

Is there a way to troubleshoot such behavior ? The log files say very few things, it basically says

# umounthandler: executing fusermount -u -z /eos/jeodpp# umounthandler: sighandler received signal 11 - emitting signal 11 again

Signal is sometimes 11, sometimes 15, sometimes some error are logged before, sometimes not, so it is difficult to understand the cause. This happens with either fusex v 4.4.23 and latest v4.5.9.

Do you experience also such unwanted crashes ? Are there some information we can provide for this ?

We have tried to use autofs in order to mitigate the situation and not have to remount eos manually, but ot wasn’t very successful for us (unstable and not very reliable).
On the other hand, it seems that the systemd automount feature works well in our case : if the daemon dies, the mountpoint is restored at next access.

apeters · November 18, 2019, 12:47pm

Hi Franck,
if you set

“backtrace” : 2,

you will get the backtrace of the thread responsible for the SEGV.

Otherwise you can configure core dumping and then we can get a full insight into the problem.

franck-jrc · November 19, 2019, 9:50am

OK, thank you Andreas, we will enable this option.

Another question, we happen to get this kind of messages, sometimes a lot : pre-fetch failed error=[FATAL] Unknown error code: Cannot allocate memory: no free read-ahead buffer

They don’t seem to be always linked to the crash, but we were wondering if they are to be worried about, and if there could be some parameter to extend this read-ahead buffer (on these mounts, we didn’t tweak anything about the cache options)

apeters · November 19, 2019, 10:15am

Hi Franck,
this is not really a problem … I should silence this message. We allow by default 128 MB of read-ahead in flight … if you open many files, you regulary hit the case, where all buffers are currently in use …

franck-jrc · November 20, 2019, 11:38am

Some nodes definitely open many files and read them simultaneously. Probably 128M is quite low for that. Is there a way to increase this value, or the number of buffers ?

In addition, the number of open files has recently increased a lot on the instance, and the eosxd clients have some entries in lsof -p output to files like /var/cache/eos/fusex/cache/jeodpp/A66/51707A660000000.dc (deleted) or /var/cache/eos/fusex/cache/jeodpp/0000014c/0032bdc0.jc (deleted). Could this be linked to the buffer missing messages, or just to a faulty application that does not close the files correctly ?

apeters · November 20, 2019, 11:57am

This settings can be changed in the configuration file of FUSEX:

“cache” : {
“type” : “disk”,
“size-mb” : 512,
“size-ino” : 65536,
“journal-mb” : 2048,
“journal-ino” : 65536,
“clean-threshold” : 85.0,
“location” : “/var/cache/eos/fusex/cache/”,
“journal” : “/var/cache/eos/fusex/journal/”,
“read-ahead-strategy” : “static”,
“read-ahead-bytes-nominal” : 262144,
“read-ahead-bytes-max” : 2097152,
“read-ahead-blocks-max” : 16,
"max-read-ahead-buffer" : 134217728,
** “max-write-buffer” : 134217728**
}

If you have a suspicious mount, you can inspect it on the mount using:

eosxd get eos.stats /eos/…

Or you can see most of the information also using on the MGM:

eos fusex ls -l

Maybe you can paste the output here, then I can tell you if there is somewhere a problem.

The unlinked journal file is not really unusual, but it should go away at some point.

franck-jrc · November 21, 2019, 4:18pm

Thank you for your answer. A client that was having the largest number of unlinked journal files crashed (signal 6) before we could try to inspect it with your suggestions, and this closed ~140K read open files and ~28K write opened files, and those numbers went back to normal , so it was this specific client alone causing the large number of open files.

franck-jrc · December 13, 2019, 4:05pm

Hello,

We have some news about the eosxd crashes. After activating the backtrace to level 2, we had one case of crash with a client v 4.5.15 which produced this output

191213 08:10:04 t=1576221004.001608 f=worker           l=WARN  tid=00007fa9d03fa700 s=EnvironmentReader:144    Reading /proc/39600/environ took 14ms (uid=33)
191213 08:18:22 t=1576221502.815290 f=worker           l=WARN  tid=00007fa9d23fe700 s=EnvironmentReader:144    Reading /proc/21213/environ took 15ms (uid=33)
Stack trace (most recent call last) in thread 26682:
#10   Object ", at 0xffffffffffffffff, in 
#9    Object "/usr/lib64/libc.so.6, at 0x7fa9d74d5eac, in clone
#8    Source "pthread_create.c", line 0, in start_thread [0x7fa9d77acdd4]
#7    Object "/usr/lib64/libfuse.so.2, at 0x7fa9db769400, in fuse_session_loop
#6    Object "/usr/lib64/libfuse.so.2, at 0x7fa9db76cb6a, in fuse_reply_iov
BFD: Dwarf Error: Could not find abbrev number 90.
BFD: Dwarf Error: Could not find abbrev number 84.
BFD: Dwarf Error: Could not find abbrev number 5766.
BFD: Dwarf Error: Could not find abbrev number 5766.
#5    Source "/usr/src/debug/eos-4.5.15-1/fusex/eosfuse.cc", line 2455, in  [0x5dfd8a]
       2452:   memset(&e, 0, sizeof(e));
       2453:   {
       2454:     metad::shared_md md;
      >2455:     md = Instance().mds.lookup(req, parent, name);
       2456: 
       2457:     if (md->id() && !md->deleted()) {
       2458:       cap::shared_cap pcap = Instance().caps.acquire(req, parent,
#4    Source "/usr/src/debug/eos-4.5.15-1/fusex/md/md.cc", line 231, in  [0x63eb5d]
        228:     // try to get the meta data record
        229:     // --------------------------------------------------
        230:     pmd->Locker().UnLock();
      > 231:     md = get(req, inode, "", false, pmd, name);
        232: 
        233:     if (md) {
        234:       md->Locker().Lock();
#3    Source "/usr/src/debug/eos-4.5.15-1/fusex/md/md.cc", line 802, in  [0x63da3e]
        799:         uint64_t l_ino;
        800: 
        801:         // store the retrieved meta data blob
      > 802:         if (!(l_ino = apply(req, *it, listing))) {
        803:           eos_static_crit("msg=\"failed to apply response\"");
        804:         } else {
        805:           ino = l_ino;
#2    Source "/opt/rh/devtoolset-6/root/usr/include/c++/6.3.1/bits/stl_map.h", line 981, in  [0x635f31]
#1    Source "/opt/rh/devtoolset-6/root/usr/include/c++/6.3.1/bits/stl_tree.h", line 2274, in  [0x5fe291]
#0    Object "/usr/lib64/libstdc++.so.6, at 0x7fa9d7f4e5a4, in std::_Rb_tree_rebalance_for_erase(std::_Rb_tree_node_base*, std::_Rb_tree_node_base&)
Segmentation fault (Address not mapped to object [(nil)])
# umounthandler: executing fusermount -u -z /eos/jeodpp# umounthandler: sighandler received signal 11 - emitting signal 11 again
###### cleaning stale cache directory '/var/cache/eos/fusex/md-cache/jeodpp/fb91bd84-11f6-11ea-a19f-e04f4306cd94'
###### cleaning stale cache directory '/var/cache/eos/fusex/md-cache/jeodpp/7d6a3950-166e-11ea-914d-e04f4306cd94'
###### cleaning stale cache directory '/var/cache/eos/fusex/md-cache/jeodpp/309ca5ae-1d85-11ea-b155-e04f4306cd94'
###### cleaning stale cache directory '/var/cache/eos/fusex/md-cache/jeodpp/7aaf34fe-1d85-11ea-9ef6-e04f4306cd94'
191213 09:52:32 t=1576227152.989587 f=mdcommunicate    l=WARN  tid=00007fc98e3fb700 s=md:2691                  MGM asked us to set our heartbeat interval to 10 seconds, enable dentry-messaging, enable writesizeflush, accepts appname, accepts mdquery and server-version=4.5.15::1
191213 09:52:33 t=1576227153.082138 f=run              l=WARN  tid=00007fc9a1df4e00 s=eosfuse:1541             ********************************************************************************
191213 09:52:33 t=1576227153.082193 f=run              l=WARN  tid=00007fc9a1df4e00 s=eosfuse:1543             eosxd started version 4.5.15 - FUSE protocol version 28

Hope that this can help to understand it.

franck-jrc · January 8, 2020, 11:36am

Good morning and happy new year.

We have another trace regarding a fusex client crash. It seems similar, but not identical. to the previous one.

191211 23:09:00 t=1576102140.221317 f=prefetch         l=ERROR ino:8000000054d4aebc s=data:772                 pre-fetch failed error=[FATAL] Unknown error code: Cannot allocate memory: no free read-ahead buffer
                 ---- high rate error messages suppressed ----
Stack trace (most recent call last) in thread 72055:
#10   Object ", at 0xffffffffffffffff, in 
#9    Object "/usr/lib64/libc.so.6, at 0x7f2c0661602c, in clone
#8    Source "pthread_create.c", line 0, in start_thread [0x7f2c068ecdd4]
#7    Object "/usr/lib64/libfuse.so.2, at 0x7f2c0a8c2400, in fuse_session_loop
#6    Object "/usr/lib64/libfuse.so.2, at 0x7f2c0a8c5b6a, in fuse_reply_iov
BFD: Dwarf Error: Could not find abbrev number 90.
BFD: Dwarf Error: Could not find abbrev number 84.
BFD: Dwarf Error: Could not find abbrev number 5766.
BFD: Dwarf Error: Could not find abbrev number 5766.
#5    Source "/usr/src/debug/eos-4.5.15-1/fusex/eosfuse.cc", line 2455, in  [0x5dfd8a]
       2452:   memset(&e, 0, sizeof(e));
       2453:   {
       2454:     metad::shared_md md;
      >2455:     md = Instance().mds.lookup(req, parent, name);
       2456: 
       2457:     if (md->id() && !md->deleted()) {
       2458:       cap::shared_cap pcap = Instance().caps.acquire(req, parent,
#4    Source "/usr/src/debug/eos-4.5.15-1/fusex/md/md.cc", line 167, in  [0x63e900]
        164:   // --------------------------------------------------
        165:   // STEP 1 : retrieve the required parent MD
        166:   // --------------------------------------------------
      > 167:   shared_md pmd = get(req, parent, "", false);
        168:   shared_md md;
        169: 
        170:   if (pmd->id() == parent) {
#3    Source "/usr/src/debug/eos-4.5.15-1/fusex/md/md.cc", line 600, in  [0x63d387]
        597:   shared_md md;
        598: 
        599:   if (ino) {
      > 600:     if (!mdmap.retrieveTS(ino, md)) {
        601:       md = std::make_shared<mdx>();
        602:       md->set_md_ino(inomap.backward(ino));
        603:     } else {
#2    Source "/usr/src/debug/eos-4.5.15-1/fusex/md/md.cc", line 3197, in  [0x63183c]
       3194: metad::pmap::retrieveTS(fuse_ino_t ino, shared_md& ret)
       3195: {
       3196:   XrdSysMutexHelper mLock(this);
      >3197:   return this->retrieve(ino, ret);
       3198: }
       3199: 
       3200: /* -------------------------------------------------------------------------- */
#1    Source "/usr/src/debug/eos-4.5.15-1/fusex/md/md.cc", line 3249, in  [0x631677]
       3246: void
       3247: metad::pmap::lru_add(fuse_ino_t ino, shared_md md)
       3248: {
      >3249:   if (ino == 1) {
       3250:     return;
       3251:   }
#0    Source "/opt/rh/devtoolset-6/root/usr/include/c++/6.3.1/bits/atomic_base.h", line 374, in  [0x62b6bb]
Segmentation fault (Address not mapped to object [0x330])
# umounthandler: executing fusermount -u -z /eos/jeodpp# umounthandler: sighandler received signal 11 - emitting signal 11 again

franck-jrc · February 25, 2020, 3:25pm

Hello,

We had another stacktrace from a fusex client crash (v4.5.15) :

Stack trace (most recent call last) in thread 52293:
#10   Object ", at 0xffffffffffffffff, in 
#9    Object "/usr/lib64/libc.so.6, at 0x7f70b452588c, in clone
#8    Source "pthread_create.c", line 0, in start_thread [0x7f70b47fce64]
#7    Object "/usr/lib64/libfuse.so.2, at 0x7f70b87d4400, in fuse_session_loop
#6    Object "/usr/lib64/libfuse.so.2, at 0x7f70b87d7b6a, in fuse_reply_iov
BFD: Dwarf Error: Could not find abbrev number 90.
BFD: Dwarf Error: Could not find abbrev number 84.
BFD: Dwarf Error: Could not find abbrev number 5766.
BFD: Dwarf Error: Could not find abbrev number 5766.
#5    Source "/usr/src/debug/eos-4.5.15-1/fusex/eosfuse.cc", line 2455, in  [0x5dfd8a]
       2452:   memset(&e, 0, sizeof(e));
       2453:   {
       2454:     metad::shared_md md;
      >2455:     md = Instance().mds.lookup(req, parent, name);
       2456: 
       2457:     if (md->id() && !md->deleted()) {
       2458:       cap::shared_cap pcap = Instance().caps.acquire(req, parent,
#4    Source "/usr/src/debug/eos-4.5.15-1/fusex/md/md.cc", line 231, in  [0x63eb5d]
        228:     // try to get the meta data record
        229:     // --------------------------------------------------
        230:     pmd->Locker().UnLock();
      > 231:     md = get(req, inode, "", false, pmd, name);
        232: 
        233:     if (md) {
        234:       md->Locker().Lock();
#3    Source "/usr/src/debug/eos-4.5.15-1/fusex/md/md.cc", line 802, in  [0x63da3e]
        799:         uint64_t l_ino;
        800: 
        801:         // store the retrieved meta data blob
      > 802:         if (!(l_ino = apply(req, *it, listing))) {
        803:           eos_static_crit("msg=\"failed to apply response\"");
        804:         } else {
        805:           ino = l_ino;
#2    Source "/opt/rh/devtoolset-6/root/usr/include/c++/6.3.1/bits/stl_map.h", line 981, in  [0x635f31]
#1    Source "/opt/rh/devtoolset-6/root/usr/include/c++/6.3.1/bits/stl_tree.h", line 2274, in  [0x5fe291]
#0    Object "/usr/lib64/libstdc++.so.6, at 0x7f70b4f9e5aa, in std::_Rb_tree_rebalance_for_erase(std::_Rb_tree_node_base*, std::_Rb_tree_node_base&)
Segmentation fault (Address not mapped to object [0x10])
# umounthandler: executing fusermount -u -z /eos/jeodpp# umounthandler: sighandler received signal 11 - emitting signal 11 again

If this is already fixed, what version can we use to remove it ?

Thank you

franck-jrc · March 23, 2020, 10:36am

Hello, this eosxd crash doesn’t look like the previous ones (v 4.5.15)

200321 04:01:03 t=1584759663.712832 f=attach           l=CRIT  ino:8000000006ddebed s=data:592                 attach to cache failed - ino=0x8000000006ddebed errno=2
terminate called after throwing an instance of 'std::runtime_error'
  what():  attach to cache failed - ino=0x8000000006ddebed errno=2
Stack trace (most recent call last) in thread 123505:
#13   Object ", at 0xffffffffffffffff, in 
#12   Object "/usr/lib64/libc.so.6, at 0x7f8c81ff188c, in clone
#11   Source "pthread_create.c", line 0, in start_thread [0x7f8c822c8e64]
#10   Object "/usr/lib64/libfuse.so.2, at 0x7f8c862a0400, in fuse_session_loop
#9    Object "/usr/lib64/libfuse.so.2, at 0x7f8c862a3b6a, in fuse_reply_iov
#8    Object "/usr/lib64/libfuse.so.2, at 0x7f8c862a452b, in fuse_reply_open
BFD: Dwarf Error: Could not find abbrev number 90.
BFD: Dwarf Error: Could not find abbrev number 84.
BFD: Dwarf Error: Could not find abbrev number 5766.
BFD: Dwarf Error: Could not find abbrev number 5766.
#7    Source "/usr/src/debug/eos-4.5.15-1/fusex/eosfuse.cc", line 3768, in  [0x5dd8d8]
       3765:                                   md_pino,
       3766:                                   req,
       3767:                                   (mode == U_OK));
      >3768:           bool outdated = (io->ioctx()->attach(req, cookie,
       3769:                                                fi->flags | cache_flag) == EKEYEXPIRED);
       3770:           fi->keep_cache = outdated ? 0 : Instance().Config().options.data_kernelcache;
BFD: Dwarf Error: Offset (1994106625) greater than or equal to .debug_str size (19800455).
#6    Source "/usr/src/debug/eos-4.5.15-1/fusex/data/data.cc", line 593, in  [0x6562cc]
        591:     snprintf(msg, sizeof(msg), "attach to cache failed - ino=%#lx errno=%d", id(), errno);
        592:     eos_crit("%s", msg);
      > 593:     throw std::runtime_error(msg);
        594:   }
        595: 
        596:   if (jcache < 0) {
#5    Object "/usr/lib64/libstdc++.so.6, at 0x7f8c82a53992, in __cxa_throw
#4    Object "/usr/lib64/libstdc++.so.6, at 0x7f8c82a53772, in std::terminate()
#3    Object "/usr/lib64/libstdc++.so.6, at 0x7f8c82a53745, in std::rethrow_exception(std::__exception_ptr::exception_ptr)
#2    Object "/usr/lib64/libstdc++.so.6, at 0x7f8c82a557d4, in __gnu_cxx::__verbose_terminate_handler()
#1    Object "/usr/lib64/libc.so.6, at 0x7f8c81f2aa27, in abort
#0    Object "/usr/lib64/libc.so.6, at 0x7f8c81f29337, in gsignal
Aborted (Signal sent by tkill() 265262 0)
# umounthandler: executing fusermount -u -z /eos/jeodpp# umounthandler: sighandler received signal 6 - emitting signal 6 again

franck-jrc · July 24, 2020, 1:01pm

We have another crash to report with a generated backtrace :

Stack trace (most recent call last) in thread 35074:
#10   Object ", at 0xffffffffffffffff, in 
#9    Object "/usr/lib64/libc.so.6, at 0x7ff8c583302c, in clone
#8    Source "pthread_create.c", line 0, in start_thread [0x7ff8c5b09dd4]
#7    Object "/opt/eos/xrootd/lib64/libXrdCl.so.2, at 0x7ff8c98752e8, in XrdCl::JobManager::RunJobs()
#6    Object "/opt/eos/xrootd/lib64/libXrdCl.so.2, at 0x7ff8c987508e, in XrdCl::JobManager::RunJobs()
#5    Object "/opt/eos/xrootd/lib64/libXrdCl.so.2, at 0x7ff8c980c72d, in XrdCl::Stream::HandleIncMsgJob::Run(void*)
#4    Object "/opt/eos/xrootd/lib64/libXrdCl.so.2, at 0x7ff8c982bc47, in XrdCl::XRootDMsgHandler::Process(XrdCl::Message*)
#3    Object "/opt/eos/xrootd/lib64/libXrdCl.so.2, at 0x7ff8c98281c8, in XrdCl::XRootDMsgHandler::HandleResponse()
#2    Object "/opt/eos/xrootd/lib64/libXrdCl.so.2, at 0x7ff8c9846fd5, in XrdCl::FileStateHandler::OnStateError(XrdCl::XRootDStatus*, XrdCl::Message*, XrdCl::ResponseHandler*, XrdCl::MessageSendParams&)
BFD: Dwarf Error: Could not find abbrev number 90.
BFD: Dwarf Error: Could not find abbrev number 84.
BFD: Dwarf Error: Could not find abbrev number 5766.
BFD: Dwarf Error: Could not find abbrev number 5766.
BFD: Dwarf Error: Offset (1994106625) greater than or equal to .debug_str size (19800455).
#1    Source "/opt/eos/xrootd/include/xrootd/XrdCl/XrdClXRootDResponses.hh", line 873, in  [0x692e3d]
        870:                                             HostList     *hostList )
        871:       {
        872:         delete hostList;
      > 873:         HandleResponse( status, response );
        874:       }
        875: 
        876:       //------------------------------------------------------------------------
#0    Source "/opt/rh/devtoolset-6/root/usr/include/c++/6.3.1/bits/stl_vector.h", line 656, in  [0x69ad3f]
Segmentation fault (Address not mapped to object [0x8])
# umounthandler: executing fusermount -u -z /eos/jeodpp# umounthandler: sighandler received signal 11 - emitting signal 11 again
###### cleaning stale cache directory '/var/cache/eos/fusex/md-cache/jeodpp/f3b1d0f6-51a5-11ea-affe-39c430d8ae18'
200721 17:45:34 t=1595346334.431071 f=mdcommunicate    l=WARN  tid=00007f202f3fa700 s=md:2691                  MGM asked us to set our heartbeat interval to 10 seconds, enable dentry-messaging, enable writesizeflush, accepts appname, accepts mdquery and server-version=4.5.15::1
200721 17:45:34 t=1595346334.529307 f=run              l=WARN  tid=00007f204499fe00 s=eosfuse:1541             ********************************************************************************
200721 17:45:34 t=1595346334.529331 f=run              l=WARN  tid=00007f204499fe00 s=eosfuse:1543             eosxd started version 4.5.15 - FUSE protocol version 28

CERN Accelerating science

Eosxd client regular crashes