EOS fst daemon terminated with error 134

GeonmoRyu · May 31, 2025, 10:15pm

Hello, everyone.

Three hours ago, I experienced a situation where eight FSTs simultaneously stopped with a 134 error.

For reference, I upgraded to 5.3.13 last Friday.

I have occasionally experienced individual FST daemons dying with 139 or 134 errors, but this is the first time I have experienced multiple FSTs dying simultaneously.

The only message left in the xrootd log is as follows.

terminate called after throwing an instance of ‘std::length_error’
what(): basic_string::_M_create

If anyone knows of any additional logs to check or has any insight into the cause, please share your knowledge.

Regards,
– Geonmo

esindril · June 2, 2025, 6:47am

Hi Geonmo,

You can enable to following env variable in your FSTs so that we get a full stacktrace when such a crash happens:
EOS_FST_ENABLE_STACKTRACE=1 in /etc/sysconfig/eos_env

Once this is enabled and an FST crash happens you should see extra information printed in the FST log file. Is this crash reproducible or was this a one off?

Thanks,
Elvin

GeonmoRyu · June 4, 2025, 5:19am

Hello, Elvin.

I checked some core files to debug this issue.

Here is backtrace of gdb

(gdb) bt
#0 0x00007fd1b7847ad4 in AssistedThread::join (this=0xbf0) at /root/rpmbuild/BUILD/eos-5.3.13-1/common/AssistedThread.hh:246
#1 eos::fst::Storage::ShutdownThreads (this=0x0) at /root/rpmbuild/BUILD/eos-5.3.13-1/fst/storage/Storage.cc:418
#2 0x00007fd1b7848179 in eos::fst::Storage::Shutdown (this=0x0) at /root/rpmbuild/BUILD/eos-5.3.13-1/fst/storage/Storage.cc:383
#3 0x00007fd1b77d07a6 in eos::fst::XrdFstOfs::xrdfstofs_shutdown (sig=15) at /root/rpmbuild/BUILD/eos-5.3.13-1/fst/XrdFstOfs.cc:268
#4
#5 0x00007fd1bb370f0a in read () from /lib64/libc.so.6
#6 0x00007fd1b6847b0c in eos::common::ShellExecutor::run_child (this=0x7fd1b69031b0 eos::common::ShellExecutor::instance()::executor) at /root/rpmbuild/BUILD/eos-5.3.13-1/common/ShellExecutor.cc:171
#7 0x00007fd1b684838f in eos::common::ShellExecutor::instance () at /root/rpmbuild/BUILD/eos-5.3.13-1/common/ShellExecutor.hh:83
#8 eos::common::ShellCmd::ShellCmd (this=this@entry=0x7ffe4a671490, cmd=“uname -a”) at /root/rpmbuild/BUILD/eos-5.3.13-1/common/ShellCmd.cc:69
#9 0x00007fd1b77d4344 in eos::fst::XrdFstOfs::Configure (this=this@entry=0x7fd1b78f2d40 eos::fst::gOFS, Eroute=…, envP=envP@entry=0x7ffe4a671950) at /root/rpmbuild/BUILD/eos-5.3.13-1/fst/XrdFstOfs.cc:502
#10 0x00007fd1b77d6ced in XrdSfsGetFileSystem2 (nativeFS=, Logger=, configFn=, envP=0x7ffe4a671950) at /root/rpmbuild/BUILD/eos-5.3.13-1/fst/XrdFstOfs.cc:131
#11 0x00007fd1bb96ee19 in XrdXrootdloadFileSystem (eDest=eDest@entry=0x7fd1bb9ced90 XrdXrootd::eLog, prevFS=prevFS@entry=0x0, fslib=0x7fd1ba854260 “libXrdEosFst.so”,
cfn=cfn@entry=0x7fd1ba8541d0 “/etc/xrd.cf.fst”, envP=envP@entry=0x7ffe4a671950) at /usr/src/debug/eos-xrootd-5.8.2-1.el9.x86_64/src/XrdSys/XrdSysError.hh:144
#12 0x00007fd1bb963ab0 in XrdXrootdProtocol::ConfigFS (xEnv=…, cfn=0x7fd1ba8541d0 “/etc/xrd.cf.fst”) at /usr/src/debug/eos-xrootd-5.8.2-1.el9.x86_64/src/XrdXrootd/XrdXrootdConfig.cc:675
#13 0x00007fd1bb967d1c in XrdXrootdProtocol::Configure (parms=parms@entry=0x0, pi=pi@entry=0x415400 XrdMain::Config) at /usr/src/debug/eos-xrootd-5.8.2-1.el9.x86_64/src/XrdXrootd/XrdXrootdConfig.cc:303
#14 0x00007fd1bb9781b1 in XrdgetProtocol (pname=, parms=0x0, pi=0x415400 XrdMain::Config) at /usr/src/debug/eos-xrootd-5.8.2-1.el9.x86_64/src/XrdXrootd/XrdXrootdProtocol.cc:211
#15 0x000000000040e048 in XrdProtLoad::Load (lname=0x0, pname=0x7fd1ba808040 “xroot”, parms=0x0, pi=pi@entry=0x415400 XrdMain::Config, istls=)
at /usr/src/debug/eos-xrootd-5.8.2-1.el9.x86_64/src/Xrd/XrdProtLoad.cc:135
#16 0x000000000040a090 in XrdConfig::Setup (this=this@entry=0x415400 XrdMain::Config, dfltp=dfltp@entry=0x7fd1ba808030 “xroot”, libProt=libProt@entry=0x0)
at /usr/src/debug/eos-xrootd-5.8.2-1.el9.x86_64/src/Xrd/XrdConfig.cc:1376
#17 0x000000000040c040 in XrdConfig::Configure (this=this@entry=0x415400 XrdMain::Config, argc=argc@entry=13, argv=argv@entry=0x7ffe4a673398)
at /usr/src/debug/eos-xrootd-5.8.2-1.el9.x86_64/src/Xrd/XrdConfig.cc:753
#18 0x0000000000406088 in main (argc=13, argv=0x7ffe4a673398) at /usr/src/debug/eos-xrootd-5.8.2-1.el9.x86_64/src/Xrd/XrdMain.cc:191

(gdb) bt
#0 eos::common::RWMutexReadLock::Grab (this=0x7ffe4a670580, mutex=…, function=0x7fd1b78b4c4d “GetFSCount”, file=0x7fd1b78b3d80 “/root/rpmbuild/BUILD/eos-5.3.13-1/fst/storage/Storage.cc”, line=1392)
at /root/rpmbuild/BUILD/eos-5.3.13-1/common/RWMutex.hh:112
#1 0x00007fd1b51773a4 in eos::common::RWMutexReadLock::RWMutexReadLock (this=, mutex=…, function=, file=, line=)
at /root/rpmbuild/BUILD/eos-5.3.13-1/common/RWMutex.cc:1431
#2 0x00007fd1b7845201 in eos::fst::Storage::GetFSCount (this=0x0) at /root/rpmbuild/BUILD/eos-5.3.13-1/fst/storage/Storage.cc:1392
#3 0x00007fd1b77d096a in eos::fst::XrdFstOfs::xrdfstofs_shutdown (sig=15) at /root/rpmbuild/BUILD/eos-5.3.13-1/fst/XrdFstOfs.cc:246
#4
#5 0x00007fd1bb370f0a in read () from /lib64/libc.so.6
#6 0x00007fd1b6847b0c in eos::common::ShellExecutor::run_child (this=0x7fd1b69031b0 eos::common::ShellExecutor::instance()::executor) at /root/rpmbuild/BUILD/eos-5.3.13-1/common/ShellExecutor.cc:171
#7 0x00007fd1b684838f in eos::common::ShellExecutor::instance () at /root/rpmbuild/BUILD/eos-5.3.13-1/common/ShellExecutor.hh:83
#8 eos::common::ShellCmd::ShellCmd (this=this@entry=0x7ffe4a671490, cmd=“uname -a”) at /root/rpmbuild/BUILD/eos-5.3.13-1/common/ShellCmd.cc:69
#9 0x00007fd1b77d4344 in eos::fst::XrdFstOfs::Configure (this=this@entry=0x7fd1b78f2d40 eos::fst::gOFS, Eroute=…, envP=envP@entry=0x7ffe4a671950) at /root/rpmbuild/BUILD/eos-5.3.13-1/fst/XrdFstOfs.cc:502
#10 0x00007fd1b77d6ced in XrdSfsGetFileSystem2 (nativeFS=, Logger=, configFn=, envP=0x7ffe4a671950) at /root/rpmbuild/BUILD/eos-5.3.13-1/fst/XrdFstOfs.cc:131
#11 0x00007fd1bb96ee19 in XrdXrootdloadFileSystem (eDest=eDest@entry=0x7fd1bb9ced90 XrdXrootd::eLog, prevFS=prevFS@entry=0x0, fslib=0x7fd1ba854260 “libXrdEosFst.so”,
cfn=cfn@entry=0x7fd1ba8541d0 “/etc/xrd.cf.fst”, envP=envP@entry=0x7ffe4a671950) at /usr/src/debug/eos-xrootd-5.8.2-1.el9.x86_64/src/XrdSys/XrdSysError.hh:144
#12 0x00007fd1bb963ab0 in XrdXrootdProtocol::ConfigFS (xEnv=…, cfn=0x7fd1ba8541d0 “/etc/xrd.cf.fst”) at /usr/src/debug/eos-xrootd-5.8.2-1.el9.x86_64/src/XrdXrootd/XrdXrootdConfig.cc:675
#13 0x00007fd1bb967d1c in XrdXrootdProtocol::Configure (parms=parms@entry=0x0, pi=pi@entry=0x415400 XrdMain::Config) at /usr/src/debug/eos-xrootd-5.8.2-1.el9.x86_64/src/XrdXrootd/XrdXrootdConfig.cc:303
#14 0x00007fd1bb9781b1 in XrdgetProtocol (pname=, parms=0x0, pi=0x415400 XrdMain::Config) at /usr/src/debug/eos-xrootd-5.8.2-1.el9.x86_64/src/XrdXrootd/XrdXrootdProtocol.cc:211
#15 0x000000000040e048 in XrdProtLoad::Load (lname=0x0, pname=0x7fd1ba808040 “xroot”, parms=0x0, pi=pi@entry=0x415400 XrdMain::Config, istls=)
at /usr/src/debug/eos-xrootd-5.8.2-1.el9.x86_64/src/Xrd/XrdProtLoad.cc:135
#16 0x000000000040a090 in XrdConfig::Setup (this=this@entry=0x415400 XrdMain::Config, dfltp=dfltp@entry=0x7fd1ba808030 “xroot”, libProt=libProt@entry=0x0)
at /usr/src/debug/eos-xrootd-5.8.2-1.el9.x86_64/src/Xrd/XrdConfig.cc:1376
#17 0x000000000040c040 in XrdConfig::Configure (this=this@entry=0x415400 XrdMain::Config, argc=argc@entry=13, argv=argv@entry=0x7ffe4a673398)
at /usr/src/debug/eos-xrootd-5.8.2-1.el9.x86_64/src/Xrd/XrdConfig.cc:753
#18 0x0000000000406088 in main (argc=13, argv=0x7ffe4a673398) at /usr/src/debug/eos-xrootd-5.8.2-1.el9.x86_64/src/Xrd/XrdMain.cc:191

Below is the backtrace information above and attached is the analysis by CHAT GPT.

While analyzing recent segmentation faults, I encountered two different stack traces which point to a common issue related to the eos::fst::Storage class. In both cases, a nullptr is being dereferenced during the shutdown process, leading to a crash.

Here are the findings:

In the first trace, the segfault occurs in Storage::ShutdownThreads():
```
#1  eos::fst::Storage::ShutdownThreads(this=0x0)
```
The this pointer is 0x0, which means the method is being called on a null object.
In the second trace, the crash occurs in Storage::GetFSCount():
```
#2  eos::fst::Storage::GetFSCount(this=0x0)
```
Again, this is null and the code attempts to lock a mutex through RWMutexReadLock, which results in a crash.

In both cases, the calls originate from xrdfstofs_shutdown(), which suggests that the Storage object might have already been destroyed or never initialized properly at the time of shutdown.

Please let me know if you’d like me to prepare a patch or dive deeper into any specific part of the code.

Best regards,

GeonmoRyu · June 4, 2025, 8:37am

Sorry for the confusion.

The two GDB backtraces above appear to be code from a routine that runs when FST exits, so I’m reattaching the gdb below because the above gdb bt doesn’t seem to be a production issue.

We believe that the GDB below is the one generated when the issue occurred during the actual service. We would appreciate it if you could review the GDB below.

(gdb) 
#0  0x00007fc3b7c80e2c in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007fc3b7c33b46 in raise () from /lib64/libc.so.6
#2  0x00007fc3b7c1d833 in abort () from /lib64/libc.so.6
#3  0x00007fc3b7f96b21 in __gnu_cxx::__verbose_terminate_handler() [clone .cold] () from /lib64/libstdc++.so.6
#4  0x00007fc3b7fa252c in __cxxabiv1::__terminate(void (*)()) () from /lib64/libstdc++.so.6
#5  0x00007fc3b7fa14f9 in __cxa_call_terminate () from /lib64/libstdc++.so.6
#6  0x00007fc3b7fa1c7a in __gxx_personality_v0 () from /lib64/libstdc++.so.6
#7  0x00007fc3b7e102c4 in _Unwind_RaiseException_Phase2 () from /lib64/libgcc_s.so.1
#8  0x00007fc3b7e10cfe in _Unwind_Resume () from /lib64/libgcc_s.so.1
#9  0x00007fc3b3f9966c in eos::fst::ScanDir::CheckFile (this=this@entry=0x7fc39afc3c00, fpath="/jbod/box_03_disk_082/00000127/002d2139") at /usr/include/c++/11/ext/new_allocator.h:89
#10 0x00007fc3b3fbde1b in eos::fst::ScanDir::ScanSubtree (this=this@entry=0x7fc39afc3c00, assistant=...) at /root/rpmbuild/BUILD/eos-5.3.13-1/fst/ScanDir.cc:547
#11 0x00007fc3b3fbe5ae in eos::fst::ScanDir::RunDiskScan (this=0x7fc39afc3c00, assistant=...) at /root/rpmbuild/BUILD/eos-5.3.13-1/fst/ScanDir.cc:485
#12 0x00007fc3b7fd0ad4 in execute_native_thread_routine () from /lib64/libstdc++.so.6
#13 0x00007fc3b7c7f0ea in start_thread () from /lib64/libc.so.6
#14 0x00007fc3b7d03444 in clone () from /lib64/libc.so.6

esindril · June 4, 2025, 9:05am

Hi Geonmo,

Thanks for the traces. Indeed, this last one looks like the correct one when explaining the initial abort message that you get in the xrootd logs. I checked the code and it’s not immediately obvious where the problem comes from. If you are able to attach with gdb can you go to frame 9 and print all the available information. In gdb this would mean:

frame 9
info local
print *this

Thank you,
Elvin

GeonmoRyu · June 4, 2025, 9:18am

Our system runs two EOS FSTs on a single machine with podman, and since I deleted the core file I showed earlier, I’m using the core file of another EOS FST on the same machine to share the information.

The content is presumably the same.

(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007fe8e9073e93 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007fe8e9026b46 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007fe8e9010833 in __GI_abort () at abort.c:79
#4  0x00007fe8e9389b21 in __gnu_cxx::__verbose_terminate_handler() [clone .cold] () from /lib64/libstdc++.so.6
#5  0x00007fe8e939552c in __cxxabiv1::__terminate(void (*)()) () from /lib64/libstdc++.so.6
#6  0x00007fe8e93944f9 in __cxa_call_terminate () from /lib64/libstdc++.so.6
#7  0x00007fe8e9394c7a in __gxx_personality_v0 () from /lib64/libstdc++.so.6
#8  0x00007fe8e92032c4 in _Unwind_RaiseException_Phase2 () from /lib64/libgcc_s.so.1
#9  0x00007fe8e9203cfe in _Unwind_Resume () from /lib64/libgcc_s.so.1
#10 0x00007fe8e539966c in eos::fst::ScanDir::CheckFile (this=this@entry=0x7fe8cc00c200, fpath="/jbod/box_04_disk_011/00000188/003bf615") at /usr/include/c++/11/ext/new_allocator.h:89
#11 0x00007fe8e53bde1b in eos::fst::ScanDir::ScanSubtree (this=this@entry=0x7fe8cc00c200, assistant=...) at /root/rpmbuild/BUILD/eos-5.3.13-1/fst/ScanDir.cc:547
#12 0x00007fe8e53be5ae in eos::fst::ScanDir::RunDiskScan (this=0x7fe8cc00c200, assistant=...) at /root/rpmbuild/BUILD/eos-5.3.13-1/fst/ScanDir.cc:485
#13 0x00007fe8e93c3ad4 in execute_native_thread_routine () from /lib64/libstdc++.so.6
#14 0x00007fe8e90720ea in start_thread (arg=<optimized out>) at pthread_create.c:443
#15 0x00007fe8e90f6444 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100
(gdb) frame 10
#10 0x00007fe8e539966c in eos::fst::ScanDir::CheckFile (this=this@entry=0x7fe8cc00c200, fpath="/jbod/box_04_disk_011/00000188/003bf615") at /usr/include/c++/11/ext/new_allocator.h:89
warning: 89     /usr/include/c++/11/ext/new_allocator.h: No such file or directory
(gdb) info local
__FUNCTION__ = "CheckFile"
io = std::unique_ptr<eos::fst::FileIo> = {get() = 0x7fe8ab3def00}
fid = <optimized out>
info = {st_dev = 64856, st_ino = 3291398962, st_nlink = 1, st_mode = 33152, st_uid = 2, st_gid = 2, __pad0 = 0, st_rdev = 0, st_size = 155193344, st_blksize = 4096, st_blocks = 303112, st_atim = {
    tv_sec = 1651201478, tv_nsec = 739552645}, st_mtim = {tv_sec = 1651202685, tv_nsec = 495433298}, st_ctim = {tv_sec = 1745004769, tv_nsec = 626498083}, __glibc_reserved = {0, 0, 0}}
fmd = std::unique_ptr<eos::common::FmdHelper> = {get() = 0x7fe8ab3e0800}
(gdb) print *this
$1 = {<eos::common::LogId> = {_vptr.LogId = 0x7fe8e54ea960 <vtable for eos::fst::ScanDir+16>, logId = "8cafcb58-3c30-11f0-a243-b8599fa512a0\000\245\245\245", 
    cident = "<service>\000", '\245' <repeats 246 times>, vid = {uid = 0, gid = 0, uid_string = "", gid_string = "", allowed_uids = std::set with 0 elements, allowed_gids = std::set with 0 elements, tident = {
        _vptr.XrdOucString = 0x413c10 <vtable for XrdOucString+16>, str = 0x7fe8cc001218 "", len = 0, siz = 1, static blksize = -1}, name = {_vptr.XrdOucString = 0x413c10 <vtable for XrdOucString+16>, 
        str = 0x7fe8cc001210 "", len = 0, siz = 1, static blksize = -1}, prot = {_vptr.XrdOucString = 0x413c10 <vtable for XrdOucString+16>, str = 0x7fe8cc001220 "", len = 0, siz = 1, static blksize = -1}, 
      host = "", domain = "", grps = "", role = "", dn = "", geolocation = "", app = "", key = "", email = "", fullname = "", federation = "", scope = "", trace = "", onbehalf = "", sudoer = false, 
      gateway = false, token = std::shared_ptr<Token> (empty) = {get() = 0x0}}}, static sDefaultNsScanRate = 50, mFstLoad = 0x7fe8e066d938, mFsId = 264, mDirPath = "/jbod/box_04_disk_011", 
  mRateBandwidth = {<std::__atomic_base<int>> = {static _S_alignment = 4, _M_i = 100}, static is_always_lock_free = true}, mEntryIntervalSec = {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, 
      _M_i = 518400}, static is_always_lock_free = true}, mRainEntryIntervalSec = {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 2419200}, static is_always_lock_free = true}, 
  mDiskIntervalSec = {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 14400}, static is_always_lock_free = true}, mNsIntervalSec = {<std::__atomic_base<unsigned long>> = {
      static _S_alignment = 8, _M_i = 259200}, static is_always_lock_free = true}, mConfDiskIntervalSec = 14400, mNumScannedFiles = 0, mNumCorruptedFiles = 0, mNumHWCorruptedFiles = 0, mTotalScanSize = 0, 
  mNumTotalFiles = 23654, mNumSkippedFiles = 1302, 
  mBuffer = 0x7fe8aaaff000 "\215fz~0\277L֛\355\357\271\006\301\320\036\235\200Dn\201\037\002+\206\225ZD߬\017\240.D\b\261\250L\247\216\312(\211\231\230\377\216%\271\034\021\031\215\257\231-\270\344\353*\v\236x_\222Ҵ\231\2720\274\363]\241\240'\276\255\363m\374\037\343\333u|M\215C8G\215'k\017\267&[J*\235%>\244b\224t\223\214\345\312cD\241{\036\032c\301\362i\270\255\347r\322`\331bM\302\215b%\037\341\345O\376H\332]J\357 \367`\232t[\322\304\327\343!т\324}йԮ\302\300z\215\323SAOV[J\271\234<\271\252үY\225\031ާ]뱾\027\361\242\022\320"..., mBufferSize = 1048576, mBgThread = true, mDiskThread = {
    _vptr.AssistedThread = 0x7fe8e54e96a8 <vtable for AssistedThread+16>, assistant = std::unique_ptr<ThreadAssistant> = {get() = 0x7fe8cc011b00}, joined = false, th = {_M_id = {
        _M_thread = 140637267555904}}}, mNsThread = {_vptr.AssistedThread = 0x7fe8e54e96a8 <vtable for AssistedThread+16>, assistant = std::unique_ptr<ThreadAssistant> = {get() = 0x7fe8cc011b80}, 
    joined = false, th = {_M_id = {_M_thread = 140637259163200}}}, mClock = {mFake = false, mtx = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, 
            __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, <No data fields>}, fakeTimepoint = {__d = {__r = 0}}}, 
  mRateLimit = std::unique_ptr<eos::common::IRateLimit> = {get() = 0x7fe8cc11c700}}

esindril · June 4, 2025, 9:50am

Hi Genomo,

Thank you for the trace. Just to make sure, you get the same error message in the log?

terminate called after throwing an instance of ‘std::length_error’
what(): basic_string::_M_create

One more thing I forgot to ask, can you also print the fpath value in gdb:
p fpath

Thanks,
Elvin

GeonmoRyu · June 5, 2025, 12:37am

Hello, Elvin.

First of all, fpath value is

(gdb) p fpath
$2 = "/jbod/box_04_disk_011/00000188/003bf615"

std::length_error is present, and the time is estimated to be around the same time as the core file.

The core file was created at 6/1 03:34 KST (UTC+9)

[root@jbod-mgmt-02 core]# ls -alh
total 36G
drwxrwxrwx. 2 daemon daemon  20 Jun  4 17:51 .
drwxr-xr-x. 6 daemon daemon  52 May  8 18:18 ..
-rw-------. 1 root   daemon 89G Jun  1 03:34 core.8

std::length_error occurred 5/31 18:33 UTC (maybe KST 03:33).

[root@jbod-mgmt-02 fst]# cat xrootd.fst.log-20250602 | grep -A5 -B1 "std::length_error"
250531 18:33:47 time=1748716427.315724 func=open                     level=ERROR logid=52f89c70-3e45-11f0-a67c-b8599f9c4330 unit=fst@jbod-mgmt-02.sdfarm.kr:1096 tid=00007fe5a4d1f640 source=XrdFstOfsFile:194              tident=4.2:5959@jbod-mgmt-11 sec=unix  uid=0 gid=0 name=daemon geo="" xt="" ob="" msg="failed while processing TPC/open opaque" path="/eos/gsdc/grid/12/57549/cfbacb7c-8db4-11ef-800d-0242e036e049"
terminate called after throwing an instance of 'std::length_error'
250531 18:33:47 time=1748716427.389361 func=_close_wr                level=ERROR logid=cbf12c24-3e49-11f0-af8c-b8599f9c4330 unit=fst@jbod-mgmt-02.sdfarm.kr:1096 tid=00007fe7ff784640 source=XrdFstOfsFile:1912             tident=189.7:3228@jbod-mgmt-05 sec=      uid=10367 gid=1395 name=nobody geo="" xt="" ob="" msg="delete on close" fxid=02dad199 ns_path="/eos/gsdc/grid/06/61093/b2169347-a15b-11ef-800c-0242743e8647" 
  what():  basic_string::_M_create
250531 21:59:22 008 Removed log file /var/log/eos/fst/xrootd.fst.log-20250526.gz
250531 21:59:22 008 Removed log file /var/log/eos/fst/xrootd.fst.log-20250527.gz
250531 21:59:22 008 Removed log file /var/log/eos/fst/xrootd.fst.log-20250528.gz

I hope this information is valuable to you.

Regards,

– Geonmo

gdelmont · June 6, 2025, 9:55am

Hi Geonmo,

Can you please paste here the output of the command eos-fmd-tool inspect --path /jbod/box_04_disk_011/00000188/003bf615.

Thanks

Regards,
Gianmaria

GeonmoRyu · June 7, 2025, 12:06am

Hello, Gianmaria.

Here it is.

[root@jbod-mgmt-02 /]# eos-fmd-tool inspect --path /jbod/box_04_disk_011/00000188/003bf615
fid: 3929621
cid: 7260
fsid: 264
ctime: 1651201477
ctime_ns: 320171159
mtime: 1749235360
mtime_ns: 959433000
atime: 1749235360
atime_ns: 959433000
checktime: 1744978984
size: 1860734464
disksize: 155193344
mgmsize: 1860734464
checksum: "a47921b0"
diskchecksum: ""
mgmchecksum: "a47921b0"
lid: 1080299346
uid: 10367
gid: 1395
filecxerror: 0
blockcxerror: 0
layouterror: 0
locations: "1440,768,264,852,12,516,1356,96,600,936,1020,684,1104,23011,1272,22011"
25: "7f1e2a28"

GeonmoRyu · June 9, 2025, 2:10am

Hello, everyone.

We’ve noticed something about memory usage that we’d like to share. I’m currently running 5.3.13 and have downgraded it to 5.3.11 because it appears to be broken. I am contacting you because I have left the jbod-mgmt-12 machine alone, but the memory usage seems to be too different compared to the other machines.

The other machines have not been restarted in a while, so I don’t know if it’s fair to compare them directly, but I think that if memory usage~=111GB of rss like jbod-mgmt-12 does, the other machines must be running out of memory because they have 2 FSTs. (other old machines’ memory are 192GB)

Regards,
– Geonmo

EOS Console [root://localhost] |/eos/gsdc/grid/> node ls --sys
┌────────────────────────────────┬────────────────┬────────────┬────────────┬────────────┬──────────┬────────┬──────────────┬────────────┬──────────────────────────────┬────────────────────────────────┬─────────────────────────────────────────────────────────────────┐
│hostport                        │          geotag│       vsize│         rss│     threads│   sockets│kworkers│           eos│      xrootd│                kernel version│                           start│                                                           uptime│
└────────────────────────────────┴────────────────┴────────────┴────────────┴────────────┴──────────┴────────┴──────────────┴────────────┴──────────────────────────────┴────────────────────────────────┴─────────────────────────────────────────────────────────────────┘
 jbod-mgmt-01.sdfarm.kr:1095      kisti::gsdc::g01      48.01 G       8.87 G       1.13 K       7013        0       5.3.11-1        5.8.0   3.10.0-1160.119.1.el7.x86_64         Fri Jun  6 23:25:39 2025 02:03:46 up 199 days, 17:13, load average: 132.35, 127.49, 123.18 
 jbod-mgmt-01.sdfarm.kr:1096      kisti::gsdc::g01      43.91 G       8.35 G       1.09 K       7011        0       5.3.11-1        5.8.0   3.10.0-1160.119.1.el7.x86_64         Fri Jun  6 23:36:41 2025 02:03:58 up 199 days, 17:13, load average: 136.18, 128.49, 123.56 
 jbod-mgmt-02.sdfarm.kr:1095      kisti::gsdc::g01      45.26 G       8.79 G       1.04 K       4331        0       5.3.11-1        5.8.0    3.10.0-1160.62.1.el7.x86_64         Fri Jun  6 23:27:25 2025     02:04:08 up 199 days, 18:3, load average: 66.35, 64.19, 63.50 
 jbod-mgmt-02.sdfarm.kr:1096      kisti::gsdc::g01       7.23 G       1.94 G          811       4320        0       5.3.11-1        5.8.0    3.10.0-1160.62.1.el7.x86_64         Mon Jun  9 01:44:51 2025     02:04:04 up 199 days, 18:3, load average: 65.07, 63.90, 63.41 
 jbod-mgmt-03.sdfarm.kr:1095      kisti::gsdc::g01      37.09 G       9.65 G          991       6685        0       5.3.11-1        5.8.0    3.10.0-1160.62.1.el7.x86_64         Fri Jun  6 23:47:21 2025    02:03:58 up 199 days, 18:3, load average: 90.64, 99.34, 101.57 
 jbod-mgmt-03.sdfarm.kr:1096      kisti::gsdc::g01      41.62 G       9.48 G       1.07 K       6758        0       5.3.11-1        5.8.0    3.10.0-1160.62.1.el7.x86_64         Fri Jun  6 23:48:02 2025   02:03:34 up 199 days, 18:3, load average: 95.57, 101.01, 102.16 
 jbod-mgmt-04.sdfarm.kr:1095      kisti::gsdc::g02      46.34 G       8.84 G       1.10 K       7071        0       5.3.11-1        5.8.0    3.10.0-1160.62.1.el7.x86_64         Fri Jun  6 23:37:52 2025   02:04:06 up 83 days, 0:25, load average: 132.21, 137.00, 136.36 
 jbod-mgmt-04.sdfarm.kr:1096      kisti::gsdc::g02      44.94 G       8.73 G       1.06 K       7071        0       5.3.11-1        5.8.0    3.10.0-1160.62.1.el7.x86_64         Fri Jun  6 23:29:01 2025   02:04:07 up 83 days, 0:25, load average: 132.21, 137.00, 136.36 
 jbod-mgmt-05.sdfarm.kr:1095      kisti::gsdc::g02      39.30 G       9.59 G       1.05 K       6278        0       5.3.11-1        5.8.0    3.10.0-1160.62.1.el7.x86_64         Fri Jun  6 23:39:43 2025 02:04:04 up 272 days, 17:39, load average: 114.20, 109.67, 111.35 
 jbod-mgmt-05.sdfarm.kr:1096      kisti::gsdc::g02      35.82 G       9.52 G       1.00 K       6277        0       5.3.11-1        5.8.0    3.10.0-1160.62.1.el7.x86_64         Fri Jun  6 23:30:13 2025 02:04:01 up 272 days, 17:39, load average: 114.04, 109.56, 111.32 
 jbod-mgmt-06.sdfarm.kr:1095      kisti::gsdc::g02      41.65 G       8.50 G       1.02 K       6175        0       5.3.11-1        5.8.0    3.10.0-1160.62.1.el7.x86_64         Fri Jun  6 23:31:18 2025 02:04:03 up 272 days, 16:21, load average: 137.02, 137.72, 138.67 
 jbod-mgmt-06.sdfarm.kr:1096      kisti::gsdc::g02      40.90 G       8.50 G       1.07 K       6206        0       5.3.11-1        5.8.0    3.10.0-1160.62.1.el7.x86_64         Fri Jun  6 23:44:07 2025 02:03:51 up 272 days, 16:21, load average: 138.45, 138.00, 138.77 
 jbod-mgmt-07.sdfarm.kr:1095      kisti::gsdc::g03      34.08 G      10.71 G          955       5444        0       5.3.11-1        5.8.0    3.10.0-1160.62.1.el7.x86_64         Fri Jun  6 23:45:31 2025    02:04:03 up 271 days, 23:56, load average: 97.18, 97.34, 93.40 
 jbod-mgmt-07.sdfarm.kr:1096      kisti::gsdc::g03      10.17 G       3.76 G          990       5444        0       5.3.11-1        5.8.0    3.10.0-1160.62.1.el7.x86_64         Mon Jun  9 00:49:14 2025    02:04:01 up 271 days, 23:56, load average: 97.20, 97.35, 93.38 
 jbod-mgmt-08.sdfarm.kr:1095      kisti::gsdc::g03      43.66 G       8.58 G       1.01 K       5076        0       5.3.11-1        5.8.0    3.10.0-1160.62.1.el7.x86_64         Fri Jun  6 23:32:59 2025 02:04:04 up 271 days, 21:57, load average: 107.35, 114.22, 113.56 
 jbod-mgmt-08.sdfarm.kr:1096      kisti::gsdc::g03       8.95 G       3.68 G          899       5079        0       5.3.11-1        5.8.0    3.10.0-1160.62.1.el7.x86_64         Mon Jun  9 00:41:20 2025 02:04:02 up 271 days, 21:57, load average: 110.69, 114.98, 113.80 
 jbod-mgmt-09.sdfarm.kr:1095      kisti::gsdc::g03      41.17 G       9.63 G       1.10 K       6426        0       5.3.11-1        5.8.0    3.10.0-1160.62.1.el7.x86_64         Fri Jun  6 23:34:53 2025  02:04:04 up 168 days, 0:34, load average: 110.83, 111.98, 114.44 
 jbod-mgmt-09.sdfarm.kr:1096      kisti::gsdc::g03      41.11 G       9.76 G       1.07 K       6435        0       5.3.11-1        5.8.0    3.10.0-1160.62.1.el7.x86_64         Fri Jun  6 23:35:33 2025  02:04:05 up 168 days, 0:34, load average: 110.83, 111.98, 114.44 
 jbod-mgmt-11.sdfarm.kr:1095      kisti::gsdc::e01      22.25 G       8.40 G          520       2305        0       5.3.11-1        5.8.0   5.14.0-503.31.1.el9_5.x86_64         Fri Jun  6 23:57:55 2025        02:04:04 up 59 days, 20:33, load average: 6.02, 5.79, 6.58 
 jbod-mgmt-11.sdfarm.kr:1096      kisti::gsdc::e01      25.98 G       9.58 G          586       2304        0       5.3.11-1        5.8.0   5.14.0-503.31.1.el9_5.x86_64         Fri Jun  6 23:59:47 2025        02:04:06 up 59 days, 20:33, load average: 6.02, 5.79, 6.58 
 jbod-mgmt-12.sdfarm.kr:1095      kisti::gsdc::e01     316.48 G     111.47 G          616       1162        0       5.3.13-1        5.8.2   5.14.0-503.31.1.el9_5.x86_64         Wed Jun  4 04:57:04 2025        02:04:03 up 59 days, 20:56, load average: 5.21, 5.84, 6.09

esindril · June 10, 2025, 6:13am

Hi Geonmo,

Thank you for the report! Indeed, there is a memory leak when scanning RAIN files in 5.3.13 which we’ve identified and fixed. We are looking also into providing a fix for the crashes that you have seen and we’ll shortly release 5.3.14 containing both fixed.
I will notify you here once the release is ready and you can upgrade your FSTs.

Cheers,
Elvin

esindril · June 10, 2025, 9:38am

Hi Geonmo,

5.3.14 is now out and contains the fixes for the the memory leak and the crashes that you observed. Please let us know how things go after the upgrade.

Thanks,
Elvin

CERN Accelerating science

EOS fst daemon terminated with error 134

Below is the backtrace information above and attached is the analysis by CHAT GPT.