EOS 5.2.0 update progress (was: mgm startup error: libXrdEosMgm.so not found)

Hi all,
We’re testing the update to 5.2.0 (from 5.1.16).
Package installation went fine, no conflicts. These are my currently installed packages (on Centos 7).
The same happens if i uninstall folly,libmicrohttp,eos-server,eos-client and then reinstall.

# rpm -qa '*eos*' '*xrootd*' | sort
eos-client-5.2.0-1.el7.cern.x86_64
eos-folly-2019.11.11.00-1.el7.cern.x86_64
eos-folly-deps-2019.11.11.00-1.el7.cern.x86_64
eos-grpc-1.56.1-2.el7.x86_64
eos-grpc-devel-1.56.1-2.el7.x86_64
eos-jemalloc-5.2.1-0.x86_64
eos-jemalloc-debuginfo-5.2.1-0.x86_64
eos-libmicrohttpd-0.9.38-eos.el7.cern.x86_64
eos-nginx-1.24.0-0.x86_64
eos-ns-inspect-5.2.0-1.el7.cern.x86_64
eos-quarkdb-5.2.0-1.el7.cern.x86_64
eos-server-5.2.0-1.el7.cern.x86_64
eos-xrootd-5.6.2-1.el7.cern.x86_64
eos-xrootd-debuginfo-5.6.2-1.el7.cern.x86_64
xrootd-client-5.2.0-1.el7.x86_64
xrootd-client-libs-5.2.0-1.el7.x86_64
xrootd-libs-5.2.0-1.el7.x86_64
xrootd-scitokens-5.2.0-1.el7.x86_64
xrootd-server-5.2.0-1.el7.x86_64
xrootd-server-libs-5.2.0-1.el7.x86_64
xrootd-voms-5.2.0-1.el7.x86_64

I have these references to libxrd*.so files in the config:

# grep -i libxrd /etc/xrd.cf.mgm
xrootd.fslib libXrdEosMgm.so
xrootd.seclib libXrdSec.so
sec.protocol gsi -cert:/etc/grid-security/daemon/test-eos-mgm-1.vbc.ac.at.crt -key:/etc/grid-security/daemon/test-eos-mgm-1.vbc.ac.at.key -gridmap:/etc/grid-security/grid-mapfile -crl:1 -d:1 -gmapopt:11 -gmapto:60 -vomsat:1 -moninfo:1 -exppxy:/var/eos/auth/gsi#<uid> -vomsfun:libXrdVoms.so -vomsfunparms:grpopt=0
mgmofs.authlib /usr/lib64/libXrdAliceTokenAcc.so
xrd.protocol XrdHttp:8443 libXrdHttp.so
http.secxtractor libXrdVoms.so
http.exthandler xrdtpc libXrdHttpTPC.so
mgmofs.macaroonslib libXrdMacaroons.so libXrdAccSciTokens.so 

the eos_env config has an LD_LIBRARY_PATH setting:

# set LD_LIBRARY_PATH order
LD_LIBRARY_PATH="/opt/eos/xrootd/lib64/:$LD_LIBRARY_PATH"

I’ve tried several symlinks, but to no avail:

]# ll /usr/lib64/libXrdEos* /opt/eos/xrootd/lib64/libXrdEos*
lrwxrwxrwx. 1 root root        28 Oct 12 13:58 /opt/eos/xrootd/lib64/libXrdEosMgm.so -> /usr/lib64/libXrdEosMgm-5.so
lrwxrwxrwx. 1 root root        28 Oct 12 13:58 /opt/eos/xrootd/lib64/libXrdEosMgm.so.5 -> /usr/lib64/libXrdEosMgm-5.so
lrwxrwxrwx. 1 root root        28 Oct 12 13:58 /opt/eos/xrootd/lib64/libXrdEosMgm.so.5.2.0 -> /usr/lib64/libXrdEosMgm-5.so
-rwxr-xr-x. 1 root root  45872456 Oct 10 13:55 /usr/lib64/libXrdEosFst-5.so
-rwxr-xr-x. 1 root root 465899456 Oct 10 13:57 /usr/lib64/libXrdEosMgm-5.so
lrwxrwxrwx. 1 root root        17 Oct 12 13:37 /usr/lib64/libXrdEosMgm.so -> libXrdEosMgm-5.so

this is what I get in the logs, mgm failing to start. mq and quarkdb services are OK.

# tail /var/log/eos/mgm/xrdlog.mgm
------ Protection system initialization completed.
Config Routing for 172.24.19.67: local pub4 prv4 
Config Route all4: 172.24.19.67 Dest=[::172.24.19.67]:1094
Plugin fslib libXrdEosMgm-5.so not found; falling back to using libXrdEosMgm.so
Plugin No such file or directory loading fslib libXrdEosMgm.so
231012 13:59:39 82194 XrootdConfig: Unable to load file system via libXrdEosMgm.so
231012 13:59:39 82194 XrootdConfig: Unable to load base file system using libXrdEosMgm.so
------ xroot protocol initialization failed.
231012 13:59:39 82194 XrdProtocol: Protocol xroot could not be loaded
------ xrootd mgm@test-eos-mgm-1.vbc.ac.at:-1 initialization failed.

Any help is appreciated,
Best
Erich

Hi Erich,
try to see what you is missing:

ldd /usr/lib64/libXrdEosMgm-5.so

Cheers Andreas.

Hi Erich,

Due to a regex mismatch the eos-grpc-gateway package is not brought in automatically when installing eos-server. So plase install this package and your service should start fine.

Cheers,
Elvin

Hi Erich,

The error message is misleading. This is caused by missing /usr/lib64/libEosGrpcGateway.so which is now required by libXrdEosMgm-5.so, but they forgot to add package dependency.
“yum install eos-grpc-gateway” should fix this.

Thanks for all your quick answers, indeed,

# ldd /usr/lib64/libXrdEosMgm-5.so
[...]
	/usr/lib64/libEosGrpcGateway.so => not found
[...]

and installing eos-grpc-gateway fixes it.
I’ll continue the update on the test instance and will let you know if I find anything else

Hi all,
Quick update from our end: we’ve EOS 5.2.0 running on the test cluster. (I had to remove and re-add a node which was my own fault, as files were not converted to xattr, but I delted the files on the fs and in the namespace and removed the leveldb directories, removed the node from mgm config and then rejoined everything).
Now the cluster is up and running, We’ll try a few test casesand let you know if anything pops up.

I’ve updated 2 FST nodes to 5.2.0 - they are looking good so far. I’ll leave them as canaries over night and continue with the rest of the cluster tomorrow, if all is fine.

One thing caught my eye: I’m still seeing the ReadV errors "vector read error" in the logs - but I’ll revisit this topic once the whole cluster is in 5.2.0

on the older FSTs (that have been updated to 5.1.23 since 4.8.x) I’m seeing a package conflict now:
grpc and eos-grpc

Transaction check error:
  file /opt/eos/grpc/share/grpc/roots.pem from install of eos-grpc-1.56.1-2.el7.x86_64 conflicts with file from package grpc-1.36.0-1.el7.x86_64

… both packages bring the same file, this can be worked around with yum swap:

[root@fst-2 ~]# yum swap grpc eos-grpc
Loaded plugins: enabled_repos_upload, package_upload, product-id, search-disabled-repos, subscription-manager, tracer_upload, versionlock
CLIP_Batch_EOS_eos5                                                                                                                                                                                                        | 1.5 kB  00:00:00     
CLIP_Batch_EOS_eos5-depend                                                                                                                                                                                                 | 1.5 kB  00:00:00     
CLIP_Batch_EPEL_EPEL7                                                                                                                                                                                                      | 2.3 kB  00:00:00     
Resolving Dependencies
--> Running transaction check
---> Package eos-grpc.x86_64 0:1.41.0-1.el7 will be updated
--> Processing Dependency: libabsl_base.so.2103.0.1()(64bit) for package: eos-server-5.1.23-1.el7.cern.x86_64
--> Processing Dependency: libabsl_synchronization.so.2103.0.1()(64bit) for package: eos-server-5.1.23-1.el7.cern.x86_64
--> Processing Dependency: libgpr.so.19()(64bit) for package: eos-server-5.1.23-1.el7.cern.x86_64
--> Processing Dependency: libgrpc++.so.1.41()(64bit) for package: eos-server-5.1.23-1.el7.cern.x86_64
--> Processing Dependency: libgrpc.so.19()(64bit) for package: eos-server-5.1.23-1.el7.cern.x86_64
---> Package eos-grpc.x86_64 0:1.56.1-2.el7 will be obsoleting
---> Package eos-protobuf3.x86_64 0:3.17.3-1.el7.cern.eos will be obsoleted
--> Processing Dependency: eos-protobuf3 >= 3.3 for package: eos-client-5.1.23-1.el7.cern.x86_64
---> Package grpc.x86_64 0:1.36.0-1.el7 will be erased
--> Processing Dependency: grpc(x86-64) = 1.36.0-1.el7 for package: grpc-devel-1.36.0-1.el7.x86_64
--> Processing Dependency: libaddress_sorting.so.15()(64bit) for package: grpc-devel-1.36.0-1.el7.x86_64
--> Processing Dependency: libgpr.so.15()(64bit) for package: grpc-devel-1.36.0-1.el7.x86_64
--> Processing Dependency: libgrpc++.so.1()(64bit) for package: grpc-devel-1.36.0-1.el7.x86_64
--> Processing Dependency: libgrpc++_alts.so.1()(64bit) for package: grpc-devel-1.36.0-1.el7.x86_64
--> Processing Dependency: libgrpc++_error_details.so.1()(64bit) for package: grpc-devel-1.36.0-1.el7.x86_64
--> Processing Dependency: libgrpc++_reflection.so.1()(64bit) for package: grpc-devel-1.36.0-1.el7.x86_64
--> Processing Dependency: libgrpc++_unsecure.so.1()(64bit) for package: grpc-devel-1.36.0-1.el7.x86_64
--> Processing Dependency: libgrpc.so.15()(64bit) for package: grpc-devel-1.36.0-1.el7.x86_64
--> Processing Dependency: libgrpc_plugin_support.so.1()(64bit) for package: grpc-devel-1.36.0-1.el7.x86_64
--> Processing Dependency: libgrpc_unsecure.so.15()(64bit) for package: grpc-devel-1.36.0-1.el7.x86_64
--> Processing Dependency: libgrpcpp_channelz.so.1()(64bit) for package: grpc-devel-1.36.0-1.el7.x86_64
--> Processing Dependency: libprotobuf-lite.so.3.14.0.0()(64bit) for package: grpc-devel-1.36.0-1.el7.x86_64
--> Processing Dependency: libprotobuf.so.3.14.0.0()(64bit) for package: grpc-devel-1.36.0-1.el7.x86_64
--> Processing Dependency: libprotoc.so.3.14.0.0()(64bit) for package: grpc-devel-1.36.0-1.el7.x86_64
--> Processing Dependency: libupb.so.15()(64bit) for package: grpc-devel-1.36.0-1.el7.x86_64
--> Running transaction check
---> Package eos-client.x86_64 0:5.1.23-1.el7.cern will be updated
---> Package eos-client.x86_64 0:5.2.0-1.el7.cern will be an update
--> Processing Dependency: eos-xrootd = 5.6.2 for package: eos-client-5.2.0-1.el7.cern.x86_64
---> Package eos-server.x86_64 0:5.1.23-1.el7.cern will be updated
---> Package eos-server.x86_64 0:5.2.0-1.el7.cern will be an update
---> Package grpc-devel.x86_64 0:1.36.0-1.el7 will be erased
--> Running transaction check
---> Package eos-xrootd.x86_64 0:5.5.10-1.el7.cern will be updated
---> Package eos-xrootd.x86_64 0:5.6.2-1.el7.cern will be an update
--> Finished Dependency Resolution

Dependencies Resolved

==================================================================================================================================================================================================================================================
 Package                                               Arch                                              Version                                                     Repository                                                              Size
==================================================================================================================================================================================================================================================
Installing:
 eos-grpc                                              x86_64                                            1.56.1-2.el7                                                CLIP_Batch_EOS_eos5-depend                                             8.0 M
     replacing  eos-protobuf3.x86_64 3.17.3-1.el7.cern.eos
Removing:
 grpc                                                  x86_64                                            1.36.0-1.el7                                                @CLIP_Batch_EOS_eos5-depend                                             18 M
Updating for dependencies:
 eos-client                                            x86_64                                            5.2.0-1.el7.cern                                            CLIP_Batch_EOS_eos5                                                     34 M
 eos-server                                            x86_64                                            5.2.0-1.el7.cern                                            CLIP_Batch_EOS_eos5                                                    179 M
 eos-xrootd                                            x86_64                                            5.6.2-1.el7.cern                                            CLIP_Batch_EOS_eos5-depend                                             3.4 M
Removing for dependencies:
 grpc-devel                                            x86_64                                            1.36.0-1.el7                                                @CLIP_Batch_EOS_eos5-depend                                            5.5 M

Transaction Summary
==================================================================================================================================================================================================================================================
Install  1 Package
Upgrade             ( 3 Dependent packages)
Remove   1 Package  (+1 Dependent package)

Total size: 225 M
Is this ok [y/d/N]: y
[installs packages, conflict resolved]

Full trace of conflicts for reference, naive attempt of resolution:


[root@fst-2 ~]# yum update eos-grpc
Loaded plugins: enabled_repos_upload, package_upload, product-id, search-disabled-repos, subscription-manager, tracer_upload, versionlock
CLIP_Batch_EOS_eos5                                                                                                                                       | 1.5 kB  00:00:00     
CLIP_Batch_EOS_eos5-depend                                                                                                                                | 1.5 kB  00:00:00     
CLIP_Batch_EPEL_EPEL7                                                                                                                                     | 2.3 kB  00:00:00     
Resolving Dependencies
--> Running transaction check
---> Package eos-grpc.x86_64 0:1.41.0-1.el7 will be updated
--> Processing Dependency: libabsl_base.so.2103.0.1()(64bit) for package: eos-server-5.1.23-1.el7.cern.x86_64
--> Processing Dependency: libabsl_synchronization.so.2103.0.1()(64bit) for package: eos-server-5.1.23-1.el7.cern.x86_64
--> Processing Dependency: libgpr.so.19()(64bit) for package: eos-server-5.1.23-1.el7.cern.x86_64
--> Processing Dependency: libgrpc++.so.1.41()(64bit) for package: eos-server-5.1.23-1.el7.cern.x86_64
--> Processing Dependency: libgrpc.so.19()(64bit) for package: eos-server-5.1.23-1.el7.cern.x86_64
---> Package eos-grpc.x86_64 0:1.56.1-2.el7 will be obsoleting
---> Package eos-protobuf3.x86_64 0:3.17.3-1.el7.cern.eos will be obsoleted
--> Processing Dependency: eos-protobuf3 >= 3.3 for package: eos-client-5.1.23-1.el7.cern.x86_64
--> Running transaction check
---> Package eos-client.x86_64 0:5.1.23-1.el7.cern will be updated
---> Package eos-client.x86_64 0:5.2.0-1.el7.cern will be an update
--> Processing Dependency: eos-xrootd = 5.6.2 for package: eos-client-5.2.0-1.el7.cern.x86_64
---> Package eos-server.x86_64 0:5.1.23-1.el7.cern will be updated
---> Package eos-server.x86_64 0:5.2.0-1.el7.cern will be an update
--> Running transaction check
---> Package eos-xrootd.x86_64 0:5.5.10-1.el7.cern will be updated
---> Package eos-xrootd.x86_64 0:5.6.2-1.el7.cern will be an update
--> Finished Dependency Resolution

Dependencies Resolved

=================================================================================================================================================================================
 Package                               Arch                              Version                                     Repository                                             Size
=================================================================================================================================================================================
Installing:
 eos-grpc                              x86_64                            1.56.1-2.el7                                CLIP_Batch_EOS_eos5-depend                            8.0 M
     replacing  eos-protobuf3.x86_64 3.17.3-1.el7.cern.eos
Updating for dependencies:
 eos-client                            x86_64                            5.2.0-1.el7.cern                            CLIP_Batch_EOS_eos5                                    34 M
 eos-server                            x86_64                            5.2.0-1.el7.cern                            CLIP_Batch_EOS_eos5                                   179 M
 eos-xrootd                            x86_64                            5.6.2-1.el7.cern                            CLIP_Batch_EOS_eos5-depend                            3.4 M

Transaction Summary
=================================================================================================================================================================================
Install  1 Package
Upgrade             ( 3 Dependent packages)

Total size: 225 M
Is this ok [y/d/N]: y
Downloading packages:
Running transaction check
Running transaction test


Transaction check error:
  file /opt/eos/grpc/share/grpc/roots.pem from install of eos-grpc-1.56.1-2.el7.x86_64 conflicts with file from package grpc-1.36.0-1.el7.x86_64

Error Summary
-------------

Uploading Enabled Repositories Report
Loaded plugins: product-id, subscription-manager, versionlock

the installed packages at the moment

[root@fst-2 ~]# rpm -qa | grep eos
eos-xrootd-debuginfo-5.5.10-1.el7.cern.x86_64
eos-protobuf3-3.17.3-1.el7.cern.eos.x86_64
eos-folly-2019.11.11.00-1.el7.cern.x86_64
eos-jemalloc-5.0.1-0.x86_64
eos-xrootd-5.5.10-1.el7.cern.x86_64
eos-client-5.1.23-1.el7.cern.x86_64
eos-server-5.1.23-1.el7.cern.x86_64
eos-libmicrohttpd-0.9.38-eos.el7.cern.x86_64
eos-grpc-1.41.0-1.el7.x86_64
eos-folly-deps-2019.11.11.00-1.el7.cern.x86_64

Hi Erich,

Yes, this problem was already there since we moved from EOS4 which was using the unfortunate name grpc for the EOS grpc compiled package to EOS5 when we switched to eos-grpc.
In EOS-5.2.0 we added eos-grpc as a mandatory dependency so it now forces you to install it so you hit this. The workaround you used is exactly what needs to be done.

Cheers,
Elvin

1 Like

I have updated 13 out of 15 fst nodes. I’m halting here for the moment, as I’ve hit some interesing problem: some FSTs are segfaulting almost exactly every15 minutes. I’m unable to determine the common feature between those hosts.:

[root@fst-1 ~]# dmesg -T | grep -i segfault
[Wed Oct 18 12:13:33 2023] xrootd[5323]: segfault at 30 ip 00007fe225d30dff sp 00007ffca42f54e0 error 4 in libjemalloc.so.1[7fe225d14000+31000]
[Wed Oct 18 12:28:34 2023] xrootd[5588]: segfault at 30 ip 00007f435254cfc7 sp 00007f42f61fb9d0 error 4 in libjemalloc.so.1[7f4352547000+31000]
[Wed Oct 18 12:43:39 2023] xrootd[36519]: segfault at 30 ip 00007f3214b9bfc7 sp 00007f31b83fc9d0 error 4 in libjemalloc.so.1[7f3214b96000+31000]
[Wed Oct 18 12:58:45 2023] xrootd[67316]: segfault at 30 ip 00007f6be3e89fc7 sp 00007f6b83bfc9d0 error 4 in libjemalloc.so.1[7f6be3e84000+31000]
[Wed Oct 18 13:13:52 2023] xrootd[98717]: segfault at 30 ip 00007f10c63ecfc7 sp 00007f1065ffc9d0 error 4 in libjemalloc.so.1[7f10c63e7000+31000]
[Wed Oct 18 13:28:58 2023] xrootd[129565]: segfault at 30 ip 00007f3d26fb5fc7 sp 00007f3cc4ffc9d0 error 4 in libjemalloc.so.1[7f3d26fb0000+31000]
[Wed Oct 18 13:44:04 2023] xrootd[160529]: segfault at 30 ip 00007f3e16f6ffc7 sp 00007f3daeffc9d0 error 4 in libjemalloc.so.1[7f3e16f6a000+31000]
[Wed Oct 18 13:59:10 2023] xrootd[191452]: segfault at 30 ip 00007fdbc1ca5fc7 sp 00007fdb66ffc9d0 error 4 in libjemalloc.so.1[7fdbc1ca0000+31000]
[Wed Oct 18 14:14:16 2023] xrootd[222628]: segfault at 30 ip 00007f9890a56fc7 sp 00007f98333fc9d0 error 4 in libjemalloc.so.1[7f9890a51000+31000]
[Wed Oct 18 14:29:22 2023] xrootd[254038]: segfault at 30 ip 00007f687f570fc7 sp 00007f681effc9d0 error 4 in libjemalloc.so.1[7f687f56b000+31000]
[Wed Oct 18 14:44:29 2023] xrootd[285053]: segfault at 30 ip 00007f512bb34fc7 sp 00007f50cb3fc9d0 error 4 in libjemalloc.so.1[7f512bb2f000+31000]
[Wed Oct 18 14:59:35 2023] xrootd[22227]: segfault at 30 ip 00007f5cb79b1fc7 sp 00007f5c5cbfc9d0 error 4 in libjemalloc.so.1[7f5cb79ac000+31000]
[Wed Oct 18 15:14:41 2023] xrootd[59009]: segfault at 30 ip 00007ff096c43fc7 sp 00007ff033ffc9d0 error 4 in libjemalloc.so.1[7ff096c3e000+31000]
[Wed Oct 18 15:29:47 2023] xrootd[98071]: segfault at 30 ip 00007f975afa2fc7 sp 00007f9702bfd9d0 error 4 in libjemalloc.so.1[7f975af9d000+31000]
[Wed Oct 18 15:44:53 2023] xrootd[136637]: segfault at 30 ip 00007f5d450aefc7 sp 00007f5cec7fc9d0 error 4 in libjemalloc.so.1[7f5d450a9000+31000]
[Wed Oct 18 15:59:59 2023] xrootd[171160]: segfault at 30 ip 00007f4e3fb39fc7 sp 00007f4ddb7fc9d0 error 4 in libjemalloc.so.1[7f4e3fb34000+31000]
[Wed Oct 18 16:15:05 2023] xrootd[201871]: segfault at 30 ip 00007ff860382fc7 sp 00007ff7ffbfc9d0 error 4 in libjemalloc.so.1[7ff86037d000+31000]
[Wed Oct 18 16:30:11 2023] xrootd[233074]: segfault at 30 ip 00007f8f31ef7fc7 sp 00007f8ed3bfc9d0 error 4 in libjemalloc.so.1[7f8f31ef2000+31000]
[Wed Oct 18 16:45:17 2023] xrootd[264362]: segfault at 30 ip 00007fd9e87befc7 sp 00007fd98bffc9d0 error 4 in libjemalloc.so.1[7fd9e87b9000+31000]
[Wed Oct 18 17:00:23 2023] xrootd[793]: segfault at 30 ip 00007fd4da08dfc7 sp 00007fd47bbfc9d0 error 4 in libjemalloc.so.1[7fd4da088000+31000]
[Wed Oct 18 17:15:29 2023] xrootd[32307]: segfault at 30 ip 00007f9f28778fc7 sp 00007f9ec43fc9d0 error 4 in libjemalloc.so.1[7f9f28773000+31000]
[Wed Oct 18 17:30:35 2023] xrootd[63027]: segfault at 30 ip 00007f12659a2fc7 sp 00007f12037fc9d0 error 4 in libjemalloc.so.1[7f126599d000+31000]
[Wed Oct 18 17:45:41 2023] xrootd[93877]: segfault at 30 ip 00007feb03b3cfc7 sp 00007feaa6bfc9d0 error 4 in libjemalloc.so.1[7feb03b37000+31000]
[Wed Oct 18 18:00:47 2023] xrootd[125463]: segfault at 30 ip 00007f50f0735fc7 sp 00007f50903fc9d0 error 4 in libjemalloc.so.1[7f50f0730000+31000]
[Wed Oct 18 18:15:53 2023] xrootd[156347]: segfault at 30 ip 00007f9db2790fc7 sp 00007f9d597fd9d0 error 4 in libjemalloc.so.1[7f9db278b000+31000]
[Wed Oct 18 18:19:05 2023] xrootd[188330]: segfault at 30 ip 00007f5dfa689dff sp 00007fff7afb0a60 error 4 in libjemalloc.so.1[7f5dfa66d000+31000]
[Wed Oct 18 18:34:06 2023] xrootd[188660]: segfault at 30 ip 00007f28b9735fc7 sp 00007f285cffd9d0 error 4 in libjemalloc.so.1[7f28b9730000+31000]
[Wed Oct 18 18:49:12 2023] xrootd[219551]: segfault at 30 ip 00007f1ae236efc7 sp 00007f1a8affc9d0 error 4 in libjemalloc.so.1[7f1ae2369000+31000]
[Wed Oct 18 19:04:18 2023] xrootd[250956]: segfault at 30 ip 00007fad65feafc7 sp 00007fad05bfc9d0 error 4 in libjemalloc.so.1[7fad65fe5000+31000]
[Wed Oct 18 19:12:10 2023] gdb[285156]: segfault at 7ffc6990fff8 ip 000000000079c820 sp 00007ffc6990ff20 error 6 in gdb[400000+659000]
[Wed Oct 18 19:16:21 2023] gdb[17884]: segfault at 7ffd45116f58 ip 000000000079d355 sp 00007ffd45116f60 error 6 in gdb[400000+659000]
[Wed Oct 18 19:17:01 2023] gdb[18056]: segfault at 7ffe6ff5af88 ip 000000000079dd28 sp 00007ffe6ff5af90 error 6 in gdb[400000+659000]
[Wed Oct 18 19:19:24 2023] xrootd[281924]: segfault at 30 ip 00007f6a8a12afc7 sp 00007f6a2dbfc9d0 error 4 in libjemalloc.so.1[7f6a8a125000+31000]
[Wed Oct 18 19:34:30 2023] xrootd[19221]: segfault at 30 ip 00007f445156ffc7 sp 00007f43f8ffc9d0 error 4 in libjemalloc.so.1[7f445156a000+31000]

the gdb segfaults are me trying to attach to the running process, but then eos-folly … and I don’t have the devtools installed atm.

This is where it get’s a bit tricky, almost all FSTs are on 5.2.0 - the MGMs are still on 5.1.23:

 % for II in $(seq 1 15); do echo -n "fst-${II}: "; ssh fst-${II}.eos dmesg -T | grep -c segfault; done
fst-1: 33
fst-2: 6
fst-3: 0
fst-4: 0
fst-5: 16
fst-6: 12
fst-7: 7
fst-8: 0 # EOS 5.1.23
fst-9: 0  # EOS 5.1.23
fst-10: 0
fst-11: 0
fst-12: 0
fst-13: 0
fst-14: 0 # first udpated to EOS 5.2.0, > 24h ago, running stable since
fst-15: 0 # first udpated to EOS 5.2.0, > 24h ago, running stable since

The counts are since today morning, when I continued the update process (after the initial 2 updates fst-14,15 yesterday).
I’m unable to determine the difference between those machines. EOS Packages are the same, config also (one working node has devtools installed, but the others don’t).
The update process should be consistent across the nodes, as I’re using an Ansible playbook for the whole process.
fst-10…15 are some newer hardware, but identical processor.
Segfault happens every 15minutes +/- a few seconds - this is what I find odd, as if it was triggered by some housekeeping calls from MGM - but this is just guesswork at this point.

jemalloc and eos-jemalloc are installed. The former is preloaded in eos_env

% ssh fst-5.eos grep -i jemalloc /etc/sysconfig/eos_env
# Preload jemalloc
# for centos jemalloc
LD_PRELOAD="/usr/lib64/libjemalloc.so.1"

Hi Erich,

Can you please install devtoolset-9-gdb and attach gdb to one of these regularly crashing FSTs and wait for a crash and then provide a trace for it?

Thanks,
Elvin

Hi Elvin,

Erich is today on holidays, so I attachd to the crashing fst process. Here is the backtrace:

Thread 88 "xrootd" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ff0d97fd700 (LWP 6180)]
0x00007ff13f508fc7 in malloc () from /usr/lib64/libjemalloc.so.1
#0  0x00007ff13f508fc7 in malloc () from /usr/lib64/libjemalloc.so.1
#1  0x00007ff13e81b18d in operator new(unsigned long) () from /lib64/libstdc++.so.6
#2  0x00007ff13e879cd9 in std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) () from /lib64/libstdc++.so.6
#3  0x00007ff13e87b561 in char* std::string::_S_construct<char const*>(char const*, char const*, std::allocator<char> const&, std::forward_iterator_tag) () from /lib64/libstdc++.so.6
#4  0x00007ff13e87b998 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&) () from /lib64/libstdc++.so.6
#5  0x00007ff13ef33dfa in XrdTlsTempCA::Maintenance (this=0x7ff1243f7780) at /usr/src/debug/xrootd-5.6.2/src/XrdTls/XrdTlsTempCA.cc:396
#6  0x00007ff13ef352c8 in XrdTlsTempCA::MaintenanceThread (myself_raw=0x7ff1243f7780) at /usr/src/debug/xrootd-5.6.2/src/XrdTls/XrdTlsTempCA.cc:495
#7  0x00007ff13ef2bd77 in XrdSysThread_Xeq (myargs=0x7ff123bff640) at /usr/src/debug/xrootd-5.6.2/src/XrdSys/XrdSysPthread.cc:86
#8  0x00007ff13e08fea5 in start_thread () from /lib64/libpthread.so.0
#9  0x00007ff13ddb8b0d in clone () from /lib64/libc.so.6
No symbol table info available.

Hi Uemit,

Thanks for the trace, we’ll look into it. At first, it looks like this is coming from the XRootD framework. Do you have proper certificates configured for your FSTs?
By this I mean if you have certificates installed in /etc/grid-security-certificates? These are in general added by the following package which bundles all the certificates for all the grid CA: ca-policy-egi-core-1.123-1.noarch

Either way, this is a bug in XRootD that should not crash no matter what. I will keep you updated on the progress.

Thanks,
Elvin

Hi Uemit,

Thanks for reporting this.

Can you please try to dump the content of the variable adminpath?
In your GDB, go to frame 5 “f 5”, then “p adminpath”.

Many thanks in advance.

Cheers,
Cedric

Hi Elvin,

thanks for the update
The package policy-egi-core-1.123-1.noarch is already installed on the FST that is crashing.

[root@fst-1 ~]# rpm -ql ca-policy-egi-core-1.123-1.noarch
/etc/grid-security/certificates
/etc/grid-security/certificates/policy-egi-core.info
[root@fst-1 ~]# ls -la /etc/grid-security/certificates | wc -l
1239

The certifications however are not in /etc/grid-security-certificates but in /etc/grid-security/certificates

In the meantime, can you also paste here your XRootD server configuration file?

Thanks very much again!

###########################################################
set MGM=$EOS_MGM_ALIAS
###########################################################

xrootd.fslib -2 libXrdEosFst.so
xrootd.async off nosf
xrd.network keepalive
xrootd.redirect $(MGM):1094 chksum

###########################################################
xrootd.seclib libXrdSec.so
sec.protocol unix
sec.protocol sss -c /etc/eos.keytab -s /etc/eos.keytab
sec.protbind * only unix sss
###########################################################
#TODO is this the export to the global namespace? must be /? don't put actual fileystem path in here
all.export / nolock
all.trace none
all.manager localhost 2131
#ofs.trace open
###########################################################
xrd.port 1095
ofs.persist off
ofs.osslib libEosFstOss.so
ofs.tpc pgm /opt/eos/xrootd/bin/xrdcp
###########################################################
# this URL can be overwritten by EOS_BROKER_URL defined /etc/sysconfig/xrd
fstofs.broker root://eos.grid.vbc.ac.at:1097//eos/
fstofs.autoboot true
fstofs.quotainterval 10
fstofs.metalog /var/eos/md/
#fstofs.authdir /var/eos/auth/
#fstofs.trace client

# FSCK attr + quarkdb conversion
# default is leveldb
fstofs.filemd_handler attr


###########################################################

# Quarkdb custer configuration used for the namespace
fstofs.qdbcluster mgm-1.eos.grid.vbc.ac.at:9999 mgm-2.eos.grid.vbc.ac.at:9999 mgm-3.eos.grid.vbc.ac.at:9999
fstofs.qdbpassword *******

#-------------------------------------------------------------------------------
# Configuration for XrdHttp http(s) service on port 11000
#-------------------------------------------------------------------------------

# Enable the XrdHttp plugin and listen on port 9001 for connections
xrd.protocol XrdHttp:9001 /usr/lib64/libXrdHttp.so
# Load the libEosFstHttp external handler
http.exthandler EosFstHttp /usr/lib64/libEosFstHttp.so none
# Load the XrdTpc external handler which deals with COPY and OPTIONS http
# verbs and provides the default HTTP TPC functionality
http.exthandler xrdtpc /usr/lib64/libXrdHttpTPC.so

# will need to add certs for this, but _should_ also work without
http.cert /etc/grid-security/hostcert.pem
http.key /etc/grid-security/hostkey.pem
http.cadir /etc/grid-security/certificates/
# http.cafile /etc/grid-security/daemon/ca.cert
[Switching to Thread 0x7f98a37fd700 (LWP 254145)]
0x00007f9905d07fc7 in malloc () from /usr/lib64/libjemalloc.so.1
Missing separate debuginfos, use: debuginfo-install nss-pem-1.0.3-7.el7.x86_64 nss-softokn-3.67.0-3.el7_9.x86_64 nss-sysinit-3.67.0-4.el7_9.x86_64 sqlite-3.7.17-8.el7_7.1.x86_64
(gdb) f 5
#5  0x00007f9905732dfa in XrdTlsTempCA::Maintenance (this=0x7f98eabf7780) at /usr/src/debug/xrootd-5.6.2/src/XrdTls/XrdTlsTempCA.cc:396
396	    std::string ca_tmp_dir = std::string(adminpath) + "/.xrdtls";
(gdb) p adminpath
$1 = <optimized out>

Yes, the path you have is correct, I made a typo in my previous post.

Thanks!

What about: p (char*)getenv("XRDADMINPATH")