ccaffy
(Cedric Caffy)
October 19, 2023, 12:12pm
21
useren:
ccaffy:
p adminpath
[Switching to Thread 0x7f98a37fd700 (LWP 254145)]
0x00007f9905d07fc7 in malloc () from /usr/lib64/libjemalloc.so.1
Missing separate debuginfos, use: debuginfo-install nss-pem-1.0.3-7.el7.x86_64 nss-softokn-3.67.0-3.el7_9.x86_64 nss-sysinit-3.67.0-4.el7_9.x86_64 sqlite-3.7.17-8.el7_7.1.x86_64
(gdb) f 5
#5 0x00007f9905732dfa in XrdTlsTempCA::Maintenance (this=0x7f98eabf7780) at /usr/src/debug/xrootd-5.6.2/src/XrdTls/XrdTlsTempCA.cc:396
396 std::string ca_tmp_dir = std::string(adminpath) + "/.xrdtls";
(gdb) p adminpath
$1 = <optimized out>
Haha I was sure you would get thisβ¦
Anyway, may I first suggest you change your configuration by removing http.cert, http.key and http.cadir and replace them with the following:
xrd.tls /etc/grid-security/hostcert.pem /etc/grid-security/hostkey.pem
xrd.tlsca certdir /etc/grid-security/certificates/
Please tell me if that unblocks you. In the meantime, Iβll investigate further that issueβ¦
Cheers,
Cedric
amadio
(Guilherme Amadio)
October 19, 2023, 12:13pm
22
Can you please try to print out the environment variable and also ca_tmp_dir
, which is the local variable?
It seems ca_tmp_dir
, which is a variable on the stack, is being passed by reference in
std::unique_ptr<TempCAGuard> new_file(TempCAGuard::create(m_log, ca_tmp_dir));
so eventually there is a use after free problem:
std::unique_ptr<XrdTlsTempCA::TempCAGuard>
XrdTlsTempCA::TempCAGuard::create(XrdSysError &err, const std::string &ca_tmp_dir) {
useren
(Uemit Seren)
October 19, 2023, 12:20pm
23
Ok I changed the config and restarted the fst service. Letβs wait 15 minutes and see if it fixes the crashes.
I checked the other fst that is already running 5.2.0 and it contains the same config as the one that is crashing.
amadio
(Guilherme Amadio)
October 19, 2023, 12:34pm
24
Could you please check with coredumpctl
if thatβs available if you have more information about this crash? The string is copied inside the function I mentioned above, so the problem is likely somewhere else. Thank you.
useren
(Uemit Seren)
October 19, 2023, 12:52pm
25
Running the above in GDB got me this error message:
0x00007f4730be7fc7 in malloc () from /usr/lib64/libjemalloc.so.1
(gdb) f 5
#5 0x00007f4730612dfa in XrdTlsTempCA::Maintenance (this=0x7f4715bf1dc0) at /usr/src/debug/xrootd-5.6.2/src/XrdTls/XrdTlsTempCA.cc:396
396 std::string ca_tmp_dir = std::string(adminpath) + "/.xrdtls";
(gdb) p (char*)getenv("XRDADMINPATH")
[New Thread 0x7f46767ee700 (LWP 82332)]
[New Thread 0x7f4676fef700 (LWP 82333)]
[Thread 0x7f46767ee700 (LWP 82332) exited]
Thread 89 "xrootd" received signal SIGSEGV, Segmentation fault.
0x00007f4730be7fc7 in malloc () from /usr/lib64/libjemalloc.so.1
The program being debugged was signaled while in a function called from GDB.
GDB remains in the frame where the signal was received.
To change this behavior use "set unwindonsignal on".
Evaluation of the expression containing the function
(malloc) will be abandoned.
When the function is done executing, GDB will silently stop.
(gdb) bt
#0 0x00007f4730be7fc7 in malloc () from /usr/lib64/libjemalloc.so.1
#1 <function called from gdb>
#2 0x00007f4730be7fc7 in malloc () from /usr/lib64/libjemalloc.so.1
#3 0x00007f472fefa18d in operator new(unsigned long) () from /lib64/libstdc++.so.6
#4 0x00007f472ff58cd9 in std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) () from /lib64/libstdc++.so.6
#5 0x00007f472ff5a561 in char* std::string::_S_construct<char const*>(char const*, char const*, std::allocator<char> const&, std::forward_iterator_tag) () from /lib64/libstdc++.so.6
#6 0x00007f472ff5a998 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&) () from /lib64/libstdc++.so.6
#7 0x00007f4730612dfa in XrdTlsTempCA::Maintenance (this=0x7f4715bf1dc0) at /usr/src/debug/xrootd-5.6.2/src/XrdTls/XrdTlsTempCA.cc:396
#8 0x00007f47306142c8 in XrdTlsTempCA::MaintenanceThread (myself_raw=0x7f4715bf1dc0) at /usr/src/debug/xrootd-5.6.2/src/XrdTls/XrdTlsTempCA.cc:495
#9 0x00007f473060ad77 in XrdSysThread_Xeq (myargs=0x7f47163ffca0) at /usr/src/debug/xrootd-5.6.2/src/XrdSys/XrdSysPthread.cc:86
#10 0x00007f472f76eea5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f472f497b0d in clone () from /lib64/libc.so.6
useren
(Uemit Seren)
October 19, 2023, 12:52pm
26
Unfortunately this didnβt help. The segefault is still occurring.
useren
(Uemit Seren)
October 19, 2023, 12:58pm
27
Itβs available but currently no coredump is generated when the segfault occurs. I suspect this is due to selinux. I will try to fix this.
ccaffy
(Cedric Caffy)
October 19, 2023, 1:09pm
28
Very funny, I tried on my local box and I see no such thing. Can you please paste/attach the logs that you have from the point you start the FST process and the segmentation fault happening?
Thanks again
useren
(Uemit Seren)
October 19, 2023, 1:56pm
29
Here is the fst log until the crash: xrd.cf.fst.crash
The last log line is:
231019 15:28:09 114496 TPC_TempCA: Reloading the list of CAs and CRLs in directory
useren
(Uemit Seren)
October 19, 2023, 2:50pm
30
I managed to generate a coredump. I can upload it or you can tell me what to look for exactly
(gdb) f 5
#5 0x00007f0444a37dfa in XrdTlsTempCA::Maintenance (this=0x7f0429b98dc0) at /usr/src/debug/xrootd-5.6.2/src/XrdTls/XrdTlsTempCA.cc:396
396 std::string ca_tmp_dir = std::string(adminpath) + "/.xrdtls";
(gdb) info locals
adminpath = <optimized out>
ca_tmp_dir = <error reading variable: Cannot access memory at address 0xffffffffffffffe8>
new_file = std::unique_ptr<XrdTlsTempCA::TempCAGuard> = {get() = 0x0}
fddir = <optimized out>
dirp = <optimized out>
result = <optimized out>
(gdb) list
391 auto adminpath = getenv("XRDADMINPATH");
392 if (!adminpath) {
393 m_log.Emsg("TempCA", "Admin path is not set!");
394 return false;
395 }
396 std::string ca_tmp_dir = std::string(adminpath) + "/.xrdtls";
397
398 std::unique_ptr<TempCAGuard> new_file(TempCAGuard::create(m_log, ca_tmp_dir));
399 if (!new_file) {
400 m_log.Emsg("TempCA", "Failed to create a new temp CA / CRL file");
ccaffy
(Cedric Caffy)
October 19, 2023, 3:02pm
31
Thanks, if you can upload it that would be great as well Then we have everything for debugging
useren
(Uemit Seren)
October 19, 2023, 3:48pm
32
Here it is: core_xrootd_267646
Interestingly since the last crash (with the coredump), the FST service has now be running without a crash for an hour I didnβt change anything except setting fs.suid_dumpable=2
and setting the kernel.core_pattern
to /tmp
Letβs see how it goes over night.
ccaffy
(Cedric Caffy)
October 19, 2023, 4:17pm
33
Thank you. Will take a look at your coredump tomorrow morning.
I will keep you updated!
Cheers,
Cedric
ccaffy
(Cedric Caffy)
October 20, 2023, 9:55am
34
Hi Uemit,
I do not manage to see what the problem is right now.
Though from the logs of the FST, I see the following line: Plugin unreleased XrdHttpProtocolTest v5.7-rc20231013 is using unreleased EOSFSTHTTP v5.7-rc20230918 version in exthandlerlib /usr/lib64/libEosFstHttp.so
.
Can you please give me the output of systemctl status eos@fst
on your problematic FST and on your non-problematic one?
Can you please also give me the output of rpm -qa | grep eos
on the problematic FST and on the non-problematic one?
Thanks again!
useren
(Uemit Seren)
October 23, 2023, 8:44am
35
ccaffy:
Can you please give me the output of systemctl status eos@fst
on your problematic FST and on your non-problematic one?
Can you please also give me the output of rpm -qa | grep eos
on the problematic FST and on the non-problematic one?
Ok for some reason the segefaults stopped on all FSTs on October 19th (we are not sure why because we didntβt change anything). Nevertheless below the outpus you asked for.
FST-1 (broken fst):
[root@fst-1 fst]# systemctl status eos@fst
β eos@fst.service - EOS fst
Loaded: loaded (/usr/lib/systemd/system/eos@.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2023-10-19 16:44:01 CEST; 3 days ago
Process: 264152 ExecStop=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-stop %i (code=exited, status=0/SUCCESS)
Process: 4295 ExecStartPre=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-start-pre %i (code=exited, status=0/SUCCESS)
Main PID: 4327 (xrootd)
CGroup: /system.slice/system-eos.slice/eos@fst.service
ββ4327 /opt/eos/xrootd/bin/xrootd -n fst -c /etc/xrd.cf.fst -l /var/log/eos/xrdlog.fst -Rdaemon
ββ4349 /opt/eos/xrootd/bin/xrootd -n fst -c /etc/xrd.cf.fst -l /var/log/eos/xrdlog.fst -Rdaemon
Oct 23 08:48:15 fst-1.eos.grid.vbc.ac.at scandir[4327]: [ScanDir] Directory: /srv/data/data.05 files=96461 scanduration=3640 [s] scansize=233028010380 [Bytes] [ 233028 MB ] scannedfiles=5423 corruptedfiles=0 hwcorrupted=0 skippedfiles=91038 disk_scan_interval_sec=14400
Oct 23 08:59:58 fst-1.eos.grid.vbc.ac.at scandir[4327]: [ScanDir] Directory: /srv/data/data.19 files=95928 scanduration=4864 [s] scansize=223052778587 [Bytes] [ 223053 MB ] scannedfiles=5468 corruptedfiles=0 hwcorrupted=0 skippedfiles=90460 disk_scan_interval_sec=14400
Oct 23 09:29:36 fst-1.eos.grid.vbc.ac.at scandir[4327]: [ScanDir] Directory: /srv/data/data.01 files=112608 scanduration=3618 [s] scansize=234448306636 [Bytes] [ 234448 MB ] scannedfiles=6725 corruptedfiles=0 hwcorrupted=0 skippedfiles=105883 disk_scan_interval_sec=14400
Oct 23 09:29:50 fst-1.eos.grid.vbc.ac.at scandir[4327]: [ScanDir] Directory: /srv/data/data.09 files=98118 scanduration=3095 [s] scansize=198831128576 [Bytes] [ 198831 MB ] scannedfiles=5412 corruptedfiles=0 hwcorrupted=0 skippedfiles=92706 disk_scan_interval_sec=14400
Oct 23 09:40:20 fst-1.eos.grid.vbc.ac.at scandir[4327]: skipping scan w-open file: localpath=/srv/data/data.04/000014f4/03329012 fsid=5 fxid=03329012
Oct 23 09:40:20 fst-1.eos.grid.vbc.ac.at scandir[4327]: [ScanDir] Directory: /srv/data/data.04 files=95987 scanduration=3852 [s] scansize=217912942592 [Bytes] [ 217913 MB ] scannedfiles=5370 corruptedfiles=0 hwcorrupted=0 skippedfiles=90616 disk_scan_interval_sec=14400
Oct 23 09:51:56 fst-1.eos.grid.vbc.ac.at scandir[4327]: [ScanDir] Directory: /srv/data/data.26 files=19401 scanduration=882 [s] scansize=54085201920 [Bytes] [ 54085.2 MB ] scannedfiles=939 corruptedfiles=0 hwcorrupted=0 skippedfiles=18462 disk_scan_interval_sec=14400
Oct 23 09:53:29 fst-1.eos.grid.vbc.ac.at scandir[4327]: skipping scan w-open file: localpath=/srv/data/data.12/000014f4/0332912c fsid=13 fxid=0332912c
Oct 23 09:53:29 fst-1.eos.grid.vbc.ac.at scandir[4327]: [ScanDir] Directory: /srv/data/data.12 files=98805 scanduration=3240 [s] scansize=214743099574 [Bytes] [ 214743 MB ] scannedfiles=5631 corruptedfiles=0 hwcorrupted=0 skippedfiles=93173 disk_scan_interval_sec=14400
Oct 23 10:17:09 fst-1.eos.grid.vbc.ac.at scandir[4327]: [ScanDir] Directory: /srv/data/data.20 files=97095 scanduration=3389 [s] scansize=229245644800 [Bytes] [ 229246 MB ] scannedfiles=5552 corruptedfiles=0 hwcorrupted=0 skippedfiles=91543 disk_scan_interval_sec=14400
[root@fst-1 fst]# rpm -qa | grep eos
eos-xrootd-5.6.2-1.el7.cern.x86_64
eos-libmicrohttpd-0.9.38-eos.el7.cern.x86_64
eos-grpc-1.56.1-2.el7.x86_64
eos-server-5.2.0-1.el7.cern.x86_64
eos-folly-deps-2019.11.11.00-1.el7.cern.x86_64
eos-client-5.2.0-1.el7.cern.x86_64
eos-grpc-gateway-0.1-1.el7.x86_64
eos-xrootd-debuginfo-5.6.2-1.el7.cern.x86_64
eos-folly-2019.11.11.00-1.el7.cern.x86_64
FST-3 (working fst):
[root@fst-3 ~]# systemctl status eos@fst
β eos@fst.service - EOS fst
Loaded: loaded (/usr/lib/systemd/system/eos@.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2023-10-18 14:52:03 CEST; 4 days ago
Process: 3351 ExecStartPre=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-start-pre %i (code=exited, status=0/SUCCESS)
Main PID: 3418 (xrootd)
CGroup: /system.slice/system-eos.slice/eos@fst.service
ββ3418 /opt/eos/xrootd/bin/xrootd -n fst -c /etc/xrd.cf.fst -l /var/log/eos/xrdlog.fst -Rdaemon
ββ3763 /opt/eos/xrootd/bin/xrootd -n fst -c /etc/xrd.cf.fst -l /var/log/eos/xrdlog.fst -Rdaemon
Oct 23 08:58:08 fst-3.eos.grid.vbc.ac.at scandir[3418]: [ScanDir] Directory: /srv/data/data.20 files=102543 scanduration=3341 [s] scansize=223979351399 [Bytes] [ 223979 MB ] scannedfiles=5702 corruptedfiles=0 hwcorrupted=0 skippedfiles=96841 disk_scan_interval_sec=14400
Oct 23 09:37:34 fst-3.eos.grid.vbc.ac.at scandir[3418]: [ScanDir] Directory: /srv/data/data.19 files=85633 scanduration=3674 [s] scansize=242564161536 [Bytes] [ 242564 MB ] scannedfiles=4663 corruptedfiles=0 hwcorrupted=0 skippedfiles=80970 disk_scan_interval_sec=14400
Oct 23 09:46:45 fst-3.eos.grid.vbc.ac.at scandir[3418]: [ScanDir] Directory: /srv/data/data.17 files=99693 scanduration=3940 [s] scansize=264593406193 [Bytes] [ 264593 MB ] scannedfiles=5604 corruptedfiles=0 hwcorrupted=0 skippedfiles=94089 disk_scan_interval_sec=14400
Oct 23 09:51:36 fst-3.eos.grid.vbc.ac.at scandir[3418]: [ScanDir] Directory: /srv/data/data.04 files=94374 scanduration=3524 [s] scansize=222502768640 [Bytes] [ 222503 MB ] scannedfiles=5165 corruptedfiles=0 hwcorrupted=0 skippedfiles=89209 disk_scan_interval_sec=14400
Oct 23 09:59:32 fst-3.eos.grid.vbc.ac.at scandir[3418]: [ScanDir] Directory: /srv/data/data.21 files=101136 scanduration=3481 [s] scansize=232748916216 [Bytes] [ 232749 MB ] scannedfiles=5640 corruptedfiles=0 hwcorrupted=0 skippedfiles=95496 disk_scan_interval_sec=14400
Oct 23 10:01:38 fst-3.eos.grid.vbc.ac.at scandir[3418]: [ScanDir] Directory: /srv/data/data.05 files=104374 scanduration=3821 [s] scansize=261810569389 [Bytes] [ 261811 MB ] scannedfiles=5894 corruptedfiles=0 hwcorrupted=0 skippedfiles=98480 disk_scan_interval_sec=14400
Oct 23 10:02:08 fst-3.eos.grid.vbc.ac.at scandir[3418]: [ScanDir] Directory: /srv/data/data.07 files=98853 scanduration=3168 [s] scansize=207125298223 [Bytes] [ 207125 MB ] scannedfiles=5365 corruptedfiles=0 hwcorrupted=0 skippedfiles=93488 disk_scan_interval_sec=14400
Oct 23 10:04:33 fst-3.eos.grid.vbc.ac.at scandir[3418]: [ScanDir] Directory: /srv/data/data.27 files=101660 scanduration=3730 [s] scansize=248612135372 [Bytes] [ 248612 MB ] scannedfiles=5620 corruptedfiles=0 hwcorrupted=0 skippedfiles=96040 disk_scan_interval_sec=14400
Oct 23 10:25:06 fst-3.eos.grid.vbc.ac.at scandir[3418]: [ScanDir] Directory: /srv/data/data.11 files=98548 scanduration=3732 [s] scansize=246690361344 [Bytes] [ 246690 MB ] scannedfiles=5443 corruptedfiles=0 hwcorrupted=0 skippedfiles=93105 disk_scan_interval_sec=14400
Oct 23 10:25:40 fst-3.eos.grid.vbc.ac.at scandir[3418]: [ScanDir] Directory: /srv/data/data.03 files=84067 scanduration=3282 [s] scansize=225487174410 [Bytes] [ 225487 MB ] scannedfiles=4598 corruptedfiles=0 hwcorrupted=0 skippedfiles=79469 disk_scan_interval_sec=14400
[root@fst-3 ~]# rpm -qa | grep eos
eos-grpc-1.56.1-2.el7.x86_64
eos-folly-2019.11.11.00-1.el7.cern.x86_64
eos-grpc-gateway-0.1-1.el7.x86_64
eos-server-5.2.0-1.el7.cern.x86_64
eos-libmicrohttpd-0.9.38-eos.el7.cern.x86_64
eos-client-5.2.0-1.el7.cern.x86_64
eos-folly-deps-2019.11.11.00-1.el7.cern.x86_64
eos-jemalloc-5.2.1-0.x86_64
eos-xrootd-5.6.2-1.el7.cern.x86_64
However on 2 nodes we now see that the fst service in a crash loop with following error:
231020 15:06:56 time=1697807216.526681 func=Boot level=CRIT logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007f1d807fd700 source=Storage:456 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="files don't have Fmd info in xattr" fs_path="/srv/data/data.26"
231020 15:06:56 time=1697807216.526699 func=Boot level=CRIT logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007f1d807fd700 source=Storage:458 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="process will abort now, please convert your file systems to drop LeveDB and use xattrs"
This is very peciluar because the marker file exists on all filesystems on that fst (we have 28 filesystems on each fst)
[root@fst-2 ~]# ls -la /srv/data/data.??/.eosattrconverted | wc -l
28
According to the fst logs this error started to occur some time around the 18th. Before that we didnβt see the error.
ccaffy
(Cedric Caffy)
October 23, 2023, 10:01am
36
Hi Uemit,
Thanks for your input. The only difference I see between the two FST rpm -qa | grep eos
is that the eos-jemalloc package is not installed on the FST that is βbrokenβ. But I am not sure this has an impactβ¦ This segfault is very weird, I donβt manage to reproduce that in my environment
For the other problem, it looks like you have filesβ metadata that havenβt been properly converted from LevelDB to xattr.
Do you see in the logs something like file not matching condition
?
Can you spot, under the /srv/data/data.26
some files that do not have the "user.eos.fmd"
xattr set?
You can check that by doing getting some random files from the /srv/data/data.26
directory and do getfattr -d /srv/data/data.26/path/to/file | grep 'user.eos.fmd'
.
Thanks again.
Cheers,
Cedric
useren
(Uemit Seren)
October 23, 2023, 11:41am
37
Hi Cedric,
yes we uninstalled the eos-jemalloc
package to see if it would fix the issue which it didnβt. But right now we donβt see any of the segfaults on any of the FSTs. So it looks like the issue βmagicallyβ went away/fixed itself
Yes we can see file not matching
condition logs on the FSTs that are in the crash loop:
231023 13:36:26 time=1698060986.246534 func=WalkFsTreeCheckCond level=CRIT logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007fc9887fd700 source=FTSWalkTree:138 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:36:33 time=1698060993.988287 func=WalkFsTreeCheckCond level=CRIT logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007f580bbfd700 source=FTSWalkTree:138 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:36:41 time=1698061001.777586 func=WalkFsTreeCheckCond level=CRIT logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007fb1af7fd700 source=FTSWalkTree:138 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:36:49 time=1698061009.477813 func=WalkFsTreeCheckCond level=CRIT logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007fee9cffe700 source=FTSWalkTree:138 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:36:57 time=1698061017.272236 func=WalkFsTreeCheckCond level=CRIT logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007fd2b3bfd700 source=FTSWalkTree:138 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:37:05 time=1698061025.015028 func=WalkFsTreeCheckCond level=CRIT logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007ff23cbfd700 source=FTSWalkTree:138 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:37:12 time=1698061032.766090 func=WalkFsTreeCheckCond level=CRIT logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007f3bc4ffd700 source=FTSWalkTree:138 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:37:20 time=1698061040.491600 func=WalkFsTreeCheckCond level=CRIT logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007f2c433fd700 source=FTSWalkTree:138 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:37:28 time=1698061048.268569 func=WalkFsTreeCheckCond level=CRIT logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007f5e96bfd700 source=FTSWalkTree:138 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:37:36 time=1698061056.001946 func=WalkFsTreeCheckCond level=CRIT logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007f9ce43fd700 source=FTSWalkTree:138 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:37:43 time=1698061063.763183 func=WalkFsTreeCheckCond level=CRIT logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007fe50c7fd700 source=FTSWalkTree:138 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:37:51 time=1698061071.502266 func=WalkFsTreeCheckCond level=CRIT logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007f6390ffd700 source=FTSWalkTree:138 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:37:59 time=1698061079.244662 func=WalkFsTreeCheckCond level=CRIT logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007f72e83fd700 source=FTSWalkTree:138 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:38:07 time=1698061087.008612 func=WalkFsTreeCheckCond level=CRIT logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007fe6e4bfd700 source=FTSWalkTree:138 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:38:14 time=1698061094.756390 func=WalkFsTreeCheckCond level=CRIT logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007f89e8bfd700 source=FTSWalkTree:138 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:38:22 time=1698061102.508131 func=WalkFsTreeCheckCond level=CRIT logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007ffb80bfd700 source=FTSWalkTree:138 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:38:30 time=1698061110.258241 func=WalkFsTreeCheckCond level=CRIT logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007f9db07fd700 source=FTSWalkTree:138 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:38:38 time=1698061118.016373 func=WalkFsTreeCheckCond level=CRIT logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007fdaf77fd700 source=FTSWalkTree:138 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:38:45 time=1698061125.740076 func=WalkFsTreeCheckCond level=CRIT logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007fd292bfd700 source=FTSWalkTree:138 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
and running getfattr -d
on one of the files in the log shows an empty response:
[root@fst-2 ~]# getfattr -d /srv/data/data.26/00000190/003d20fb | grep 'user.eos.fmd'
However other files in the same directory have the attribute:
[root@fst-2 ~]# getfattr -d /srv/data/data.26/00000190/003d26bf | grep 'user.eos.fmd'
getfattr: Removing leading '/' from absolute path names
user.eos.fmd=0sCb8mPQAAAAAAEfIIAQAAAAAAHTcAAAAl2ZtmXy0FRXATNUnkO1M9AAAAAEWqFlpkTagPuRtVh2gnZVl2AQAAAAAAAGEAEBAAAAAAAGl2AQAAAAAAAHIIZmNjMjc1M2N6AIIBCGZjYzI3NTNjjQFCBmQglQFNLwAAnQENLgAAoAEAqAEAsAEAugEaMjIzLDU1LDE2NywzNjMsNDE5LDI3OSwzMDc=
We had converted all filesystem to the stateless mode before upgrading to 5.2.0 tough, so not sure why there are still files without the attributes
ccaffy
(Cedric Caffy)
October 23, 2023, 12:28pm
38
Hi Uemit,
OK, can you give the output of eos fileinfo fxid:003d20fb
?
Iβm afraid that you may need to downgrade your FST and re-trigger the file MD conversionβ¦
useren
(Uemit Seren)
October 23, 2023, 12:43pm
39
[root@fst-2 ~]# eosadmin fileinfo fxid:003d20fb
File: '/eos/vbc/group/darkmatter/experiments/cresst/hephy_at_project/cresst/backup_nuc/2019_07_01/guetlein/.eclipse/org.eclipse.platform_4.3.2_1473617060_linux_gtk_x86_64/p2/org.eclipse.equinox.p2.engine/profileRegistry/epp.package.standard.profile/.lock' Flags: 0770 Clock: 17909229c50026e2
Size: 0
Status: healthy
Modify: Mon Mar 3 10:33:09 2014 Timestamp: 1393839189.000000000
Change: Sun Sep 20 01:58:45 2020 Timestamp: 1600559925.791750260
Access: Thu Jan 1 01:00:00 1970 Timestamp: 0.139757569018342
Birth: Sun Sep 20 01:58:45 2020 Timestamp: 1600559925.760428130
CUid: 12109 CGid: 11789 Fxid: 003d20fb Fid: 4006139 Pid: 67661 Pxid: 0001084d
XStype: adler XS: 00 00 00 01 ETAGs: "1075389749264384:00000001"
Layout: raid6 Stripes: 7 Blocksize: 1M LayoutId: 20640642 Redundancy: d3::t0
#Rep: 7
βββββ¬βββββββ¬βββββββββββββββββββββββββ¬βββββββββββββββββ¬ββββββββββββββββββ¬βββββββββββ¬βββββββββββββββ¬βββββββββββββ¬βββββββββ¬βββββββββββββββββββββββββ
βno.β fs-idβ hostβ schedgroupβ pathβ bootβ configstatusβ drainβ activeβ geotagβ
βββββ΄βββββββ΄βββββββββββββββββββββββββ΄βββββββββββββββββ΄ββββββββββββββββββ΄βββββββββββ΄βββββββββββββββ΄βββββββββββββ΄βββββββββ΄βββββββββββββββββββββββββ
0 27 fst-1.eos.grid.vbc.ac.at default.26 /srv/data/data.26 booted drain failed online vbc::rack1::pod1
1 223 fst-8.eos.grid.vbc.ac.at default.26 /srv/data/data.26 booted rw nodrain online vbc::rack1::pod3
2 251 fst-9.eos.grid.vbc.ac.at default.26 /srv/data/data.26 booted rw nodrain online vbc::rack1::pod3
3 167 fst-6.eos.grid.vbc.ac.at default.26 /srv/data/data.26 booted rw nodrain online vbc::rack1::pod2
4 195 fst-7.eos.grid.vbc.ac.at default.26 /srv/data/data.26 booted rw nodrain online vbc::rack1::pod3
5 83 fst-3.eos.grid.vbc.ac.at default.26 /srv/data/data.26 booted rw nodrain online vbc::rack1::pod1
6 139 fst-5.eos.grid.vbc.ac.at default.26 /srv/data/data.26 down rw nodrain online vbc::rack1::pod2
*******
fst-5.eos.grid.vbc.ac.at
is the other fst that has the same crash loop as fst-2 due to files not havenβt been converted.
Shouldnβt one of the stripes for the file also be located on fst-2 ?
ccaffy
(Cedric Caffy)
October 23, 2023, 12:53pm
40
Very interesting, itβs a 0-size file isnβt it?
ls -alhrt/srv/data/data.26/00000190/003d26bf
would give 0 as size?
EDIT: I just tried uploading a 0-size file to EOS, no problem for having the extended attributes set properlyβ¦