Hello,
It appears when a find or fileinfo command, requested either via HTTP or Xrootd, produces a result file that’s over 2GiB, the non-QDB MGM thread crashes. I have not managed to capture a core dump to be used, as it wipes itself for some reason, and the core dump is approximately 200GB as well.
What I have captured, is the log entries from when this occurs. I don’t understand how the sending of results from /tmp/eos.mgm/ works, so I haven’t been able to write a patch.
In this case, we’ve had a user successfully write in 6.8 million files into a single directory, and they can readily fetch them singly or do a simple ls over the contents. If I request metadata in the form of a find -I, or a fileinfo in json format, the result of the query is successfully staged through /tmp/eos.mgm/ as a multi-gigabyte file, but the MGM thread associated crashes and the client never sees any reply to the query.
The log entry when doing this via HTTP reads (paths somewhat sanitised):
190725 09:50:28 time=1564048228.435535 func=Emsg level=ERROR logid=57393b86-aebe-11e9-b6f8-506b4b385e7c unit=mgm@gdpt-a1.cdn.aarnet.edu.au:1094 tid=00007f404c7df700 source=XrdMgmOfsFile:2813 tident=minio.1:1@crlt-a2.cdn.aarnet.edu.au sec=http uid=48 gid=48 name=48 geo="crlt" Unable to read ; File too large
terminate called after throwing an instance of 'std::length_error'
what(): basic_string::append
error: received signal 6:
/lib64/libXrdEosMgm.so(_Z20xrdmgmofs_stacktracei+0x44)[0x7f41928f8c74]
/lib64/libc.so.6(+0x36340)[0x7f419886c340]
/lib64/libc.so.6(gsignal+0x37)[0x7f419886c2c7]
/lib64/libc.so.6(abort+0x148)[0x7f419886d9b8]
/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x165)[0x7f419917b7d5]
/lib64/libstdc++.so.6(+0x5e746)[0x7f4199179746]
/lib64/libstdc++.so.6(+0x5e773)[0x7f4199179773]
/lib64/libstdc++.so.6(+0x5e993)[0x7f4199179993]
/lib64/libstdc++.so.6(_ZSt20__throw_length_errorPKc+0x77)[0x7f41991ce7a7]
/lib64/libstdc++.so.6(+0xbe9ed)[0x7f41991d99ed]
#########################################################################
# stack trace exec=/usr/bin/xrootd pid=1 what='thread apply all bt'
#########################################################################
.....
#########################################################################
# -----------------------------------------------------------------------
# Responsible thread =>
# -----------------------------------------------------------------------
# Thread 5 (Thread 0x7f404c7df700 (LWP 491)):
#########################################################################
#4 <signal handler called>
#5 0x00007f419886c2c7 in raise () from /lib64/libc.so.6
#6 0x00007f419886d9b8 in abort () from /lib64/libc.so.6
#7 0x00007f419917b7d5 in __gnu_cxx::__verbose_terminate_handler() ()
from /lib64/libstdc++.so.6
#8 0x00007f4199179746 in ?? () from /lib64/libstdc++.so.6
#9 0x00007f4199179773 in std::terminate() () from /lib64/libstdc++.so.6
#10 0x00007f4199179993 in __cxa_throw () from /lib64/libstdc++.so.6
#11 0x00007f41991ce7a7 in std::__throw_length_error(char const*) ()
from /lib64/libstdc++.so.6
#12 0x00007f41991d99ed in std::string::append(char const*, unsigned long) ()
from /lib64/libstdc++.so.6
#13 0x00007f4192ae4072 in eos::mgm::HttpHandler::Get(eos::common::HttpRequest*, bool) () from /lib64/libXrdEosMgm.so
#14 0x00007f4192ae8aec in eos::mgm::HttpHandler::HandleRequest(eos::common::HttpRequest*) () from /lib64/libXrdEosMgm.so
#15 0x00007f4192adc5da in eos::mgm::HttpServer::Handler(void*, MHD_Connection*, char const*, char const*, char const*, char const*, unsigned long*, void**) ()
from /lib64/libXrdEosMgm.so
#16 0x00007f4192269858 in eos::common::HttpServer::StaticHandler(void*, MHD_Connection*, char const*, char const*, char const*, char const*, unsigned long*, void**) () from /usr/lib64/libeosCommonServer.so.4
#17 0x00007f418ea91df0 in call_connection_handler ()
from /usr/lib64/libmicrohttpd.so.10
#18 0x00007f418ea93a67 in MHD_connection_handle_idle ()
from /usr/lib64/libmicrohttpd.so.10
#19 0x00007f418ea99730 in MHD_epoll () from /usr/lib64/libmicrohttpd.so.10
#20 0x00007f418ea998fe in MHD_select_thread ()
from /usr/lib64/libmicrohttpd.so.10
#21 0x00007f4199631dd5 in start_thread () from /lib64/libpthread.so.0
#22 0x00007f419893402d in clone () from /lib64/libc.so.6
#########################################################################
# stack trace exec=/usr/bin/xrootd pid=1 what='generate-core-file'
#########################################################################
If I search for matching logid lines, I only see:
xrdlog.mgm:190725 09:26:54 time=1564046814.190081 func=open level=INFO logid=57393b86-aebe-11e9-b6f8-506b4b385e7c unit=mgm@gdpt-a1.cdn.aarnet.edu.au:1094 tid=00007f404c7df700 source=XrdMgmOfsFile:193 tident=minio.1:1@crlt-a2.cdn.aarnet.edu.au sec=http uid=48 gid=48 name=48 geo="crlt" op=read path=/proc/user/ info=mgm.cmd=find&mgm.option=fI&mgm.path=/eos/..../commvault-library/files/&eos.app=http
xrdlog.mgm:190725 09:50:28 time=1564048228.435535 func=Emsg level=ERROR logid=57393b86-aebe-11e9-b6f8-506b4b385e7c unit=mgm@gdpt-a1.cdn.aarnet.edu.au:1094 tid=00007f404c7df700 source=XrdMgmOfsFile:2813 tident=minio.1:1@crlt-a2.cdn.aarnet.edu.au sec=http uid=48 gid=48 name=48 geo="crlt" Unable to read ; File too large
On the MGM after the crash, I have:
[root@gdpt-a1 eos.mgm]# pwd
/tmp/eos.mgm
[root@gdpt-a1 eos.mgm]# ls -al
total 8768708
drwx------ 2 daemon daemon 43 Jul 25 09:50 .
drwxrwxrwt 1 root root 233 Jul 25 09:53 ..
-rw-r--r-- 1 daemon daemon 8979155718 Jul 25 09:50 139914137958144.mResultstream
The contents of 139914137958144.mResultstream is valid and complete, and has the correct starting and terminating fields the MGM writes into the stream.
Repeating the same test using xrdcp, adjusting the syntax for source appropriately, I get:
190726 01:03:03 055 XrdLink: Unable to send to root.242004:172@crlt-a2; broken pipe
190726 01:03:03 055 XrootdXeq: root.242004:172@crlt-a2 disc 0:14:44 (send failure)
190726 01:05:18 043 XrdLink: Unable to send to root.242004:175@crlt-a2; broken pipe
190726 01:05:18 043 XrootdXeq: root.242004:175@crlt-a2 disc 0:13:59 (send failure)
Nothing in the error.log when doing this with xrdcp.
The impact of this is that there’s no fast way for me to request a metadata listing across this directory, and the only alternative is to literally throw millions of stat calls against the contents.