EOS 5.2.0 update progress (was: mgm startup error: libXrdEosMgm.so not found)

ccaffy · October 19, 2023, 12:12pm

useren:

[Switching to Thread 0x7f98a37fd700 (LWP 254145)]
0x00007f9905d07fc7 in malloc () from /usr/lib64/libjemalloc.so.1
Missing separate debuginfos, use: debuginfo-install nss-pem-1.0.3-7.el7.x86_64 nss-softokn-3.67.0-3.el7_9.x86_64 nss-sysinit-3.67.0-4.el7_9.x86_64 sqlite-3.7.17-8.el7_7.1.x86_64
(gdb) f 5
#5  0x00007f9905732dfa in XrdTlsTempCA::Maintenance (this=0x7f98eabf7780) at /usr/src/debug/xrootd-5.6.2/src/XrdTls/XrdTlsTempCA.cc:396
396	    std::string ca_tmp_dir = std::string(adminpath) + "/.xrdtls";
(gdb) p adminpath
$1 = <optimized out>

Haha I was sure you would get this…

Anyway, may I first suggest you change your configuration by removing http.cert, http.key and http.cadir and replace them with the following:

xrd.tls  /etc/grid-security/hostcert.pem /etc/grid-security/hostkey.pem
xrd.tlsca  certdir /etc/grid-security/certificates/

Please tell me if that unblocks you. In the meantime, I’ll investigate further that issue…

Cheers,
Cedric

amadio · October 19, 2023, 12:13pm

Can you please try to print out the environment variable and also ca_tmp_dir, which is the local variable?

It seems ca_tmp_dir, which is a variable on the stack, is being passed by reference in

std::unique_ptr<TempCAGuard> new_file(TempCAGuard::create(m_log, ca_tmp_dir));

so eventually there is a use after free problem:

std::unique_ptr<XrdTlsTempCA::TempCAGuard>
XrdTlsTempCA::TempCAGuard::create(XrdSysError &err, const std::string &ca_tmp_dir) {

useren · October 19, 2023, 12:20pm

Ok I changed the config and restarted the fst service. Let’s wait 15 minutes and see if it fixes the crashes.

I checked the other fst that is already running 5.2.0 and it contains the same config as the one that is crashing.

amadio · October 19, 2023, 12:34pm

Could you please check with coredumpctl if that’s available if you have more information about this crash? The string is copied inside the function I mentioned above, so the problem is likely somewhere else. Thank you.

useren · October 19, 2023, 12:52pm

Running the above in GDB got me this error message:

0x00007f4730be7fc7 in malloc () from /usr/lib64/libjemalloc.so.1
(gdb) f 5
#5  0x00007f4730612dfa in XrdTlsTempCA::Maintenance (this=0x7f4715bf1dc0) at /usr/src/debug/xrootd-5.6.2/src/XrdTls/XrdTlsTempCA.cc:396
396         std::string ca_tmp_dir = std::string(adminpath) + "/.xrdtls";
(gdb) p (char*)getenv("XRDADMINPATH")
[New Thread 0x7f46767ee700 (LWP 82332)]
[New Thread 0x7f4676fef700 (LWP 82333)]
[Thread 0x7f46767ee700 (LWP 82332) exited]

Thread 89 "xrootd" received signal SIGSEGV, Segmentation fault.
0x00007f4730be7fc7 in malloc () from /usr/lib64/libjemalloc.so.1
The program being debugged was signaled while in a function called from GDB.
GDB remains in the frame where the signal was received.
To change this behavior use "set unwindonsignal on".
Evaluation of the expression containing the function
(malloc) will be abandoned.
When the function is done executing, GDB will silently stop.
(gdb) bt
#0  0x00007f4730be7fc7 in malloc () from /usr/lib64/libjemalloc.so.1
#1  <function called from gdb>
#2  0x00007f4730be7fc7 in malloc () from /usr/lib64/libjemalloc.so.1
#3  0x00007f472fefa18d in operator new(unsigned long) () from /lib64/libstdc++.so.6
#4  0x00007f472ff58cd9 in std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) () from /lib64/libstdc++.so.6
#5  0x00007f472ff5a561 in char* std::string::_S_construct<char const*>(char const*, char const*, std::allocator<char> const&, std::forward_iterator_tag) () from /lib64/libstdc++.so.6
#6  0x00007f472ff5a998 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&) () from /lib64/libstdc++.so.6
#7  0x00007f4730612dfa in XrdTlsTempCA::Maintenance (this=0x7f4715bf1dc0) at /usr/src/debug/xrootd-5.6.2/src/XrdTls/XrdTlsTempCA.cc:396
#8  0x00007f47306142c8 in XrdTlsTempCA::MaintenanceThread (myself_raw=0x7f4715bf1dc0) at /usr/src/debug/xrootd-5.6.2/src/XrdTls/XrdTlsTempCA.cc:495
#9  0x00007f473060ad77 in XrdSysThread_Xeq (myargs=0x7f47163ffca0) at /usr/src/debug/xrootd-5.6.2/src/XrdSys/XrdSysPthread.cc:86
#10 0x00007f472f76eea5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f472f497b0d in clone () from /lib64/libc.so.6

useren · October 19, 2023, 12:52pm

Unfortunately this didn’t help. The segefault is still occurring.

useren · October 19, 2023, 12:58pm

It’s available but currently no coredump is generated when the segfault occurs. I suspect this is due to selinux. I will try to fix this.

ccaffy · October 19, 2023, 1:09pm

Very funny, I tried on my local box and I see no such thing. Can you please paste/attach the logs that you have from the point you start the FST process and the segmentation fault happening?

Thanks again

useren · October 19, 2023, 1:56pm

Here is the fst log until the crash: xrd.cf.fst.crash

The last log line is:

231019 15:28:09 114496 TPC_TempCA: Reloading the list of CAs and CRLs in directory

useren · October 19, 2023, 2:50pm

I managed to generate a coredump. I can upload it or you can tell me what to look for exactly

(gdb) f 5
#5  0x00007f0444a37dfa in XrdTlsTempCA::Maintenance (this=0x7f0429b98dc0) at /usr/src/debug/xrootd-5.6.2/src/XrdTls/XrdTlsTempCA.cc:396
396         std::string ca_tmp_dir = std::string(adminpath) + "/.xrdtls";
(gdb) info locals
adminpath = <optimized out>
ca_tmp_dir = <error reading variable: Cannot access memory at address 0xffffffffffffffe8>
new_file = std::unique_ptr<XrdTlsTempCA::TempCAGuard> = {get() = 0x0}
fddir = <optimized out>
dirp = <optimized out>
result = <optimized out>
(gdb) list
391         auto adminpath = getenv("XRDADMINPATH");
392         if (!adminpath) {
393             m_log.Emsg("TempCA", "Admin path is not set!");
394             return false;
395         }
396         std::string ca_tmp_dir = std::string(adminpath) + "/.xrdtls";
397
398         std::unique_ptr<TempCAGuard> new_file(TempCAGuard::create(m_log, ca_tmp_dir));
399         if (!new_file) {
400             m_log.Emsg("TempCA", "Failed to create a new temp CA / CRL file");

ccaffy · October 19, 2023, 3:02pm

Thanks, if you can upload it that would be great as well Then we have everything for debugging

useren · October 19, 2023, 3:48pm

Here it is: core_xrootd_267646

Interestingly since the last crash (with the coredump), the FST service has now be running without a crash for an hour I didn’t change anything except setting fs.suid_dumpable=2 and setting the kernel.core_pattern to /tmp Let’s see how it goes over night.

ccaffy · October 19, 2023, 4:17pm

Thank you. Will take a look at your coredump tomorrow morning.
I will keep you updated!

Cheers,
Cedric

ccaffy · October 20, 2023, 9:55am

Hi Uemit,

I do not manage to see what the problem is right now.

Though from the logs of the FST, I see the following line: Plugin unreleased XrdHttpProtocolTest v5.7-rc20231013 is using unreleased EOSFSTHTTP v5.7-rc20230918 version in exthandlerlib /usr/lib64/libEosFstHttp.so.

Can you please give me the output of systemctl status eos@fst on your problematic FST and on your non-problematic one?

Can you please also give me the output of rpm -qa | grep eos on the problematic FST and on the non-problematic one?

Thanks again!

useren · October 23, 2023, 8:44am

Ok for some reason the segefaults stopped on all FSTs on October 19th (we are not sure why because we didnt’t change anything). Nevertheless below the outpus you asked for.

FST-1 (broken fst):

[root@fst-1 fst]# systemctl status eos@fst
● eos@fst.service - EOS fst
   Loaded: loaded (/usr/lib/systemd/system/eos@.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2023-10-19 16:44:01 CEST; 3 days ago
  Process: 264152 ExecStop=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-stop %i (code=exited, status=0/SUCCESS)
  Process: 4295 ExecStartPre=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-start-pre %i (code=exited, status=0/SUCCESS)
 Main PID: 4327 (xrootd)
   CGroup: /system.slice/system-eos.slice/eos@fst.service
           ├─4327 /opt/eos/xrootd/bin/xrootd -n fst -c /etc/xrd.cf.fst -l /var/log/eos/xrdlog.fst -Rdaemon
           └─4349 /opt/eos/xrootd/bin/xrootd -n fst -c /etc/xrd.cf.fst -l /var/log/eos/xrdlog.fst -Rdaemon

Oct 23 08:48:15 fst-1.eos.grid.vbc.ac.at scandir[4327]: [ScanDir] Directory: /srv/data/data.05 files=96461 scanduration=3640 [s] scansize=233028010380 [Bytes] [ 233028 MB ] scannedfiles=5423 corruptedfiles=0 hwcorrupted=0 skippedfiles=91038 disk_scan_interval_sec=14400
Oct 23 08:59:58 fst-1.eos.grid.vbc.ac.at scandir[4327]: [ScanDir] Directory: /srv/data/data.19 files=95928 scanduration=4864 [s] scansize=223052778587 [Bytes] [ 223053 MB ] scannedfiles=5468 corruptedfiles=0 hwcorrupted=0 skippedfiles=90460 disk_scan_interval_sec=14400
Oct 23 09:29:36 fst-1.eos.grid.vbc.ac.at scandir[4327]: [ScanDir] Directory: /srv/data/data.01 files=112608 scanduration=3618 [s] scansize=234448306636 [Bytes] [ 234448 MB ] scannedfiles=6725 corruptedfiles=0 hwcorrupted=0 skippedfiles=105883 disk_scan_interval_sec=14400
Oct 23 09:29:50 fst-1.eos.grid.vbc.ac.at scandir[4327]: [ScanDir] Directory: /srv/data/data.09 files=98118 scanduration=3095 [s] scansize=198831128576 [Bytes] [ 198831 MB ] scannedfiles=5412 corruptedfiles=0 hwcorrupted=0 skippedfiles=92706 disk_scan_interval_sec=14400
Oct 23 09:40:20 fst-1.eos.grid.vbc.ac.at scandir[4327]: skipping scan w-open file: localpath=/srv/data/data.04/000014f4/03329012 fsid=5 fxid=03329012
Oct 23 09:40:20 fst-1.eos.grid.vbc.ac.at scandir[4327]: [ScanDir] Directory: /srv/data/data.04 files=95987 scanduration=3852 [s] scansize=217912942592 [Bytes] [ 217913 MB ] scannedfiles=5370 corruptedfiles=0 hwcorrupted=0 skippedfiles=90616 disk_scan_interval_sec=14400
Oct 23 09:51:56 fst-1.eos.grid.vbc.ac.at scandir[4327]: [ScanDir] Directory: /srv/data/data.26 files=19401 scanduration=882 [s] scansize=54085201920 [Bytes] [ 54085.2 MB ] scannedfiles=939 corruptedfiles=0 hwcorrupted=0 skippedfiles=18462 disk_scan_interval_sec=14400
Oct 23 09:53:29 fst-1.eos.grid.vbc.ac.at scandir[4327]: skipping scan w-open file: localpath=/srv/data/data.12/000014f4/0332912c fsid=13 fxid=0332912c
Oct 23 09:53:29 fst-1.eos.grid.vbc.ac.at scandir[4327]: [ScanDir] Directory: /srv/data/data.12 files=98805 scanduration=3240 [s] scansize=214743099574 [Bytes] [ 214743 MB ] scannedfiles=5631 corruptedfiles=0 hwcorrupted=0 skippedfiles=93173 disk_scan_interval_sec=14400
Oct 23 10:17:09 fst-1.eos.grid.vbc.ac.at scandir[4327]: [ScanDir] Directory: /srv/data/data.20 files=97095 scanduration=3389 [s] scansize=229245644800 [Bytes] [ 229246 MB ] scannedfiles=5552 corruptedfiles=0 hwcorrupted=0 skippedfiles=91543 disk_scan_interval_sec=14400

[root@fst-1 fst]# rpm -qa | grep eos
eos-xrootd-5.6.2-1.el7.cern.x86_64
eos-libmicrohttpd-0.9.38-eos.el7.cern.x86_64
eos-grpc-1.56.1-2.el7.x86_64
eos-server-5.2.0-1.el7.cern.x86_64
eos-folly-deps-2019.11.11.00-1.el7.cern.x86_64
eos-client-5.2.0-1.el7.cern.x86_64
eos-grpc-gateway-0.1-1.el7.x86_64
eos-xrootd-debuginfo-5.6.2-1.el7.cern.x86_64
eos-folly-2019.11.11.00-1.el7.cern.x86_64

FST-3 (working fst):

[root@fst-3 ~]# systemctl status eos@fst
● eos@fst.service - EOS fst
   Loaded: loaded (/usr/lib/systemd/system/eos@.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2023-10-18 14:52:03 CEST; 4 days ago
  Process: 3351 ExecStartPre=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-start-pre %i (code=exited, status=0/SUCCESS)
 Main PID: 3418 (xrootd)
   CGroup: /system.slice/system-eos.slice/eos@fst.service
           ├─3418 /opt/eos/xrootd/bin/xrootd -n fst -c /etc/xrd.cf.fst -l /var/log/eos/xrdlog.fst -Rdaemon
           └─3763 /opt/eos/xrootd/bin/xrootd -n fst -c /etc/xrd.cf.fst -l /var/log/eos/xrdlog.fst -Rdaemon

Oct 23 08:58:08 fst-3.eos.grid.vbc.ac.at scandir[3418]: [ScanDir] Directory: /srv/data/data.20 files=102543 scanduration=3341 [s] scansize=223979351399 [Bytes] [ 223979 MB ] scannedfiles=5702 corruptedfiles=0 hwcorrupted=0 skippedfiles=96841 disk_scan_interval_sec=14400
Oct 23 09:37:34 fst-3.eos.grid.vbc.ac.at scandir[3418]: [ScanDir] Directory: /srv/data/data.19 files=85633 scanduration=3674 [s] scansize=242564161536 [Bytes] [ 242564 MB ] scannedfiles=4663 corruptedfiles=0 hwcorrupted=0 skippedfiles=80970 disk_scan_interval_sec=14400
Oct 23 09:46:45 fst-3.eos.grid.vbc.ac.at scandir[3418]: [ScanDir] Directory: /srv/data/data.17 files=99693 scanduration=3940 [s] scansize=264593406193 [Bytes] [ 264593 MB ] scannedfiles=5604 corruptedfiles=0 hwcorrupted=0 skippedfiles=94089 disk_scan_interval_sec=14400
Oct 23 09:51:36 fst-3.eos.grid.vbc.ac.at scandir[3418]: [ScanDir] Directory: /srv/data/data.04 files=94374 scanduration=3524 [s] scansize=222502768640 [Bytes] [ 222503 MB ] scannedfiles=5165 corruptedfiles=0 hwcorrupted=0 skippedfiles=89209 disk_scan_interval_sec=14400
Oct 23 09:59:32 fst-3.eos.grid.vbc.ac.at scandir[3418]: [ScanDir] Directory: /srv/data/data.21 files=101136 scanduration=3481 [s] scansize=232748916216 [Bytes] [ 232749 MB ] scannedfiles=5640 corruptedfiles=0 hwcorrupted=0 skippedfiles=95496 disk_scan_interval_sec=14400
Oct 23 10:01:38 fst-3.eos.grid.vbc.ac.at scandir[3418]: [ScanDir] Directory: /srv/data/data.05 files=104374 scanduration=3821 [s] scansize=261810569389 [Bytes] [ 261811 MB ] scannedfiles=5894 corruptedfiles=0 hwcorrupted=0 skippedfiles=98480 disk_scan_interval_sec=14400
Oct 23 10:02:08 fst-3.eos.grid.vbc.ac.at scandir[3418]: [ScanDir] Directory: /srv/data/data.07 files=98853 scanduration=3168 [s] scansize=207125298223 [Bytes] [ 207125 MB ] scannedfiles=5365 corruptedfiles=0 hwcorrupted=0 skippedfiles=93488 disk_scan_interval_sec=14400
Oct 23 10:04:33 fst-3.eos.grid.vbc.ac.at scandir[3418]: [ScanDir] Directory: /srv/data/data.27 files=101660 scanduration=3730 [s] scansize=248612135372 [Bytes] [ 248612 MB ] scannedfiles=5620 corruptedfiles=0 hwcorrupted=0 skippedfiles=96040 disk_scan_interval_sec=14400
Oct 23 10:25:06 fst-3.eos.grid.vbc.ac.at scandir[3418]: [ScanDir] Directory: /srv/data/data.11 files=98548 scanduration=3732 [s] scansize=246690361344 [Bytes] [ 246690 MB ] scannedfiles=5443 corruptedfiles=0 hwcorrupted=0 skippedfiles=93105 disk_scan_interval_sec=14400
Oct 23 10:25:40 fst-3.eos.grid.vbc.ac.at scandir[3418]: [ScanDir] Directory: /srv/data/data.03 files=84067 scanduration=3282 [s] scansize=225487174410 [Bytes] [ 225487 MB ] scannedfiles=4598 corruptedfiles=0 hwcorrupted=0 skippedfiles=79469 disk_scan_interval_sec=14400

[root@fst-3 ~]# rpm -qa | grep eos
eos-grpc-1.56.1-2.el7.x86_64
eos-folly-2019.11.11.00-1.el7.cern.x86_64
eos-grpc-gateway-0.1-1.el7.x86_64
eos-server-5.2.0-1.el7.cern.x86_64
eos-libmicrohttpd-0.9.38-eos.el7.cern.x86_64
eos-client-5.2.0-1.el7.cern.x86_64
eos-folly-deps-2019.11.11.00-1.el7.cern.x86_64
eos-jemalloc-5.2.1-0.x86_64
eos-xrootd-5.6.2-1.el7.cern.x86_64

However on 2 nodes we now see that the fst service in a crash loop with following error:

231020 15:06:56 time=1697807216.526681 func=Boot                     level=CRIT  logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007f1d807fd700 source=Storage:456                    tident= sec=(null) uid=99 gid=99 name=- geo="" msg="files don't have Fmd info in xattr" fs_path="/srv/data/data.26"
231020 15:06:56 time=1697807216.526699 func=Boot                     level=CRIT  logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007f1d807fd700 source=Storage:458                    tident= sec=(null) uid=99 gid=99 name=- geo="" msg="process will abort now, please convert your file systems to drop LeveDB and use xattrs"

This is very peciluar because the marker file exists on all filesystems on that fst (we have 28 filesystems on each fst)

[root@fst-2 ~]# ls -la /srv/data/data.??/.eosattrconverted | wc -l
28

According to the fst logs this error started to occur some time around the 18th. Before that we didn’t see the error.

ccaffy · October 23, 2023, 10:01am

Hi Uemit,

Thanks for your input. The only difference I see between the two FST rpm -qa | grep eos is that the eos-jemalloc package is not installed on the FST that is “broken”. But I am not sure this has an impact… This segfault is very weird, I don’t manage to reproduce that in my environment

For the other problem, it looks like you have files’ metadata that haven’t been properly converted from LevelDB to xattr.
Do you see in the logs something like file not matching condition?
Can you spot, under the /srv/data/data.26 some files that do not have the "user.eos.fmd" xattr set?
You can check that by doing getting some random files from the /srv/data/data.26 directory and do getfattr -d /srv/data/data.26/path/to/file | grep 'user.eos.fmd'.

Thanks again.

Cheers,
Cedric

useren · October 23, 2023, 11:41am

Hi Cedric,

yes we uninstalled the eos-jemalloc package to see if it would fix the issue which it didn’t. But right now we don’t see any of the segfaults on any of the FSTs. So it looks like the issue “magically” went away/fixed itself

Yes we can see file not matching condition logs on the FSTs that are in the crash loop:

231023 13:36:26 time=1698060986.246534 func=WalkFsTreeCheckCond      level=CRIT  logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007fc9887fd700 source=FTSWalkTree:138                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:36:33 time=1698060993.988287 func=WalkFsTreeCheckCond      level=CRIT  logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007f580bbfd700 source=FTSWalkTree:138                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:36:41 time=1698061001.777586 func=WalkFsTreeCheckCond      level=CRIT  logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007fb1af7fd700 source=FTSWalkTree:138                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:36:49 time=1698061009.477813 func=WalkFsTreeCheckCond      level=CRIT  logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007fee9cffe700 source=FTSWalkTree:138                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:36:57 time=1698061017.272236 func=WalkFsTreeCheckCond      level=CRIT  logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007fd2b3bfd700 source=FTSWalkTree:138                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:37:05 time=1698061025.015028 func=WalkFsTreeCheckCond      level=CRIT  logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007ff23cbfd700 source=FTSWalkTree:138                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:37:12 time=1698061032.766090 func=WalkFsTreeCheckCond      level=CRIT  logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007f3bc4ffd700 source=FTSWalkTree:138                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:37:20 time=1698061040.491600 func=WalkFsTreeCheckCond      level=CRIT  logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007f2c433fd700 source=FTSWalkTree:138                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:37:28 time=1698061048.268569 func=WalkFsTreeCheckCond      level=CRIT  logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007f5e96bfd700 source=FTSWalkTree:138                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:37:36 time=1698061056.001946 func=WalkFsTreeCheckCond      level=CRIT  logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007f9ce43fd700 source=FTSWalkTree:138                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:37:43 time=1698061063.763183 func=WalkFsTreeCheckCond      level=CRIT  logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007fe50c7fd700 source=FTSWalkTree:138                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:37:51 time=1698061071.502266 func=WalkFsTreeCheckCond      level=CRIT  logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007f6390ffd700 source=FTSWalkTree:138                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:37:59 time=1698061079.244662 func=WalkFsTreeCheckCond      level=CRIT  logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007f72e83fd700 source=FTSWalkTree:138                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:38:07 time=1698061087.008612 func=WalkFsTreeCheckCond      level=CRIT  logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007fe6e4bfd700 source=FTSWalkTree:138                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:38:14 time=1698061094.756390 func=WalkFsTreeCheckCond      level=CRIT  logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007f89e8bfd700 source=FTSWalkTree:138                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:38:22 time=1698061102.508131 func=WalkFsTreeCheckCond      level=CRIT  logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007ffb80bfd700 source=FTSWalkTree:138                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:38:30 time=1698061110.258241 func=WalkFsTreeCheckCond      level=CRIT  logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007f9db07fd700 source=FTSWalkTree:138                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:38:38 time=1698061118.016373 func=WalkFsTreeCheckCond      level=CRIT  logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007fdaf77fd700 source=FTSWalkTree:138                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333
231023 13:38:45 time=1698061125.740076 func=WalkFsTreeCheckCond      level=CRIT  logid=static.............................. unit=fst@fst-2.eos.grid.vbc.ac.at:1095 tid=00007fd292bfd700 source=FTSWalkTree:138                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="file not matching condition" fn="/srv/data/data.26/00000190/003d20fb" index=33333

and running getfattr -d on one of the files in the log shows an empty response:

[root@fst-2 ~]# getfattr -d /srv/data/data.26/00000190/003d20fb | grep 'user.eos.fmd'

However other files in the same directory have the attribute:

[root@fst-2 ~]# getfattr -d /srv/data/data.26/00000190/003d26bf | grep 'user.eos.fmd'
getfattr: Removing leading '/' from absolute path names
user.eos.fmd=0sCb8mPQAAAAAAEfIIAQAAAAAAHTcAAAAl2ZtmXy0FRXATNUnkO1M9AAAAAEWqFlpkTagPuRtVh2gnZVl2AQAAAAAAAGEAEBAAAAAAAGl2AQAAAAAAAHIIZmNjMjc1M2N6AIIBCGZjYzI3NTNjjQFCBmQglQFNLwAAnQENLgAAoAEAqAEAsAEAugEaMjIzLDU1LDE2NywzNjMsNDE5LDI3OSwzMDc=

We had converted all filesystem to the stateless mode before upgrading to 5.2.0 tough, so not sure why there are still files without the attributes

ccaffy · October 23, 2023, 12:28pm

Hi Uemit,

OK, can you give the output of eos fileinfo fxid:003d20fb ?

I’m afraid that you may need to downgrade your FST and re-trigger the file MD conversion…

useren · October 23, 2023, 12:43pm

[root@fst-2 ~]# eosadmin fileinfo fxid:003d20fb
  File: '/eos/vbc/group/darkmatter/experiments/cresst/hephy_at_project/cresst/backup_nuc/2019_07_01/guetlein/.eclipse/org.eclipse.platform_4.3.2_1473617060_linux_gtk_x86_64/p2/org.eclipse.equinox.p2.engine/profileRegistry/epp.package.standard.profile/.lock'  Flags: 0770  Clock: 17909229c50026e2
  Size: 0
Status: healthy
Modify: Mon Mar  3 10:33:09 2014 Timestamp: 1393839189.000000000
Change: Sun Sep 20 01:58:45 2020 Timestamp: 1600559925.791750260
Access: Thu Jan  1 01:00:00 1970 Timestamp: 0.139757569018342
 Birth: Sun Sep 20 01:58:45 2020 Timestamp: 1600559925.760428130
  CUid: 12109 CGid: 11789 Fxid: 003d20fb Fid: 4006139 Pid: 67661 Pxid: 0001084d
XStype: adler    XS: 00 00 00 01    ETAGs: "1075389749264384:00000001"
Layout: raid6 Stripes: 7 Blocksize: 1M LayoutId: 20640642 Redundancy: d3::t0
  #Rep: 7
┌───┬──────┬────────────────────────┬────────────────┬─────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│                    host│      schedgroup│             path│      boot│  configstatus│       drain│  active│                  geotag│
└───┴──────┴────────────────────────┴────────────────┴─────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
 0       27 fst-1.eos.grid.vbc.ac.at       default.26 /srv/data/data.26     booted          drain       failed   online         vbc::rack1::pod1
 1      223 fst-8.eos.grid.vbc.ac.at       default.26 /srv/data/data.26     booted             rw      nodrain   online         vbc::rack1::pod3
 2      251 fst-9.eos.grid.vbc.ac.at       default.26 /srv/data/data.26     booted             rw      nodrain   online         vbc::rack1::pod3
 3      167 fst-6.eos.grid.vbc.ac.at       default.26 /srv/data/data.26     booted             rw      nodrain   online         vbc::rack1::pod2
 4      195 fst-7.eos.grid.vbc.ac.at       default.26 /srv/data/data.26     booted             rw      nodrain   online         vbc::rack1::pod3
 5       83 fst-3.eos.grid.vbc.ac.at       default.26 /srv/data/data.26     booted             rw      nodrain   online         vbc::rack1::pod1
 6      139 fst-5.eos.grid.vbc.ac.at       default.26 /srv/data/data.26       down             rw      nodrain   online         vbc::rack1::pod2

*******

fst-5.eos.grid.vbc.ac.at is the other fst that has the same crash loop as fst-2 due to files not haven’t been converted.
Shouldn’t one of the stripes for the file also be located on fst-2 ?

ccaffy · October 23, 2023, 12:53pm

Very interesting, it’s a 0-size file isn’t it?
ls -alhrt/srv/data/data.26/00000190/003d26bf would give 0 as size?

EDIT: I just tried uploading a 0-size file to EOS, no problem for having the extended attributes set properly…

CERN Accelerating science

EOS 5.2.0 update progress (was: mgm startup error: libXrdEosMgm.so not found)