EOS connection re-use

georgep · February 2, 2024, 12:11pm

Hello,

We have been trying to debug a long ongoing issue we have with TPC transfers/tests from EOS ALICE storage nodes at CERN to EOS instances
at RAL. In a nutshell, the tests some times fail and some times succeed and we are trying to understand why this happens.

During his latest debugging attempt, our ALICE liason (Alex) found out that EOS reuses the connection to EOS storage nodes; the same connection can be used for multiple transfers.

He points to the tcpdump,
http://www-public.gridpp.rl.ac.uk/filelists/tcpdump_eos94

to point out that there is only one SYN exchange in the beginning and two rounds of exchange corresponding to two transfers.

As you can see, eos95 server sent multiple packets to initiate a transfer, but never got an ACK from EOS storage server (st-096-…). Alex thinks that these stale connections seem to be the cause of the problem.

Can you please confirm if this connection re-use is an EOS feature and if, so, can you indicate how can it turned off?

Thanks,

George

georgep · February 13, 2024, 1:46pm

Sorry for the pinging @esindril. Do you have any comments on the above?

Thanks,

George

esindril · February 15, 2024, 4:37pm

Hi George,

The mechanism of reutilizing TCP connections is not something specific to EOS but actually comes from the XRootD framework. There is nothing we can do at the EOS level to influence this. Normaly, a physical connection can be reused and can multiplex several transfers originating from the same client/process.

At this point looking at the network interaction is very low level. I would be more interested in the errors that you get from the application layer, ie. the XRootD client when performing such transfers and if you have any error messages at all or things just hang.

Of course, this does not rule out some faulty network equipment in between that could explain such transitory errors, but we would need more data on this, with exact storage nodes on the CERN EOS side that exhibit this behavior. I suggest you open a ticket concerning this on the CERN Service Now portal so that it can properly be investigated by the EOS ALICE service manager.

Cheers,
Elvin

arogovsk · February 21, 2024, 5:20pm

Hi Elvin,

/* Sorry for jumping in */

The error messages we are getting are not very meaningful, they usually refered to expired tpc token. The errors happen when the server gives up on re-using stale connection, and establishes a new one.

Example:

240130 13:09:14 time=1706620154.516796 func=stat                     level=INFO  logid=8214e9aa-bf70-11ee-b6c5-e8ebd34116f4 unit=fst@antares-eos95.scd.rl.ac.uk:1095 tid=00007fd2c33d3700 source=XrdFstOfsFile:1308             tident=arogovsk.2173756:77@lxplus973.cern.ch sec=      uid=36350 gid=24277 name=nobody geo="" path=/eos/antarespreprodtier1/prod/alice/04/53413/1a476b28-bbca-11ee-8070-024282dad15e fxid=2faf1349 size=0 mtime=1706620043.63563771
240130 13:09:17 time=1706620157.034517 func=stat                     level=INFO  logid=8214e9aa-bf70-11ee-b6c5-e8ebd34116f4 unit=fst@antares-eos95.scd.rl.ac.uk:1095 tid=00007fd2c33d3700 source=XrdFstOfsFile:1308             tident=arogovsk.2173756:77@lxplus973.cern.ch sec=      uid=36350 gid=24277 name=nobody geo="" path=/eos/antarespreprodtier1/prod/alice/04/53413/1a476b28-bbca-11ee-8070-024282dad15e fxid=2faf1349 size=0 mtime=1706620043.63563771
240130 13:09:19 time=1706620159.555109 func=stat                     level=INFO  logid=8214e9aa-bf70-11ee-b6c5-e8ebd34116f4 unit=fst@antares-eos95.scd.rl.ac.uk:1095 tid=00007fd2c33d3700 source=XrdFstOfsFile:1308             tident=arogovsk.2173756:77@lxplus973.cern.ch sec=      uid=36350 gid=24277 name=nobody geo="" path=/eos/antarespreprodtier1/prod/alice/04/53413/1a476b28-bbca-11ee-8070-024282dad15e fxid=2faf1349 size=0 mtime=1706620043.63563771
240130 13:09:20 time=1706620160.641037 func=_close                   level=INFO  logid=8214e9aa-bf70-11ee-b6c5-e8ebd34116f4 unit=fst@antares-eos95.scd.rl.ac.uk:1095 tid=00007fd2c33d3700 source=XrdFstOfsFile:1505             tident=arogovsk.2173756:77@lxplus973.cern.ch sec=      uid=36350 gid=24277 name=nobody geo="" msg="remove tpc key" key=16eb57e02b3b2b3c65b8f48b
240130 13:23:52 time=1706621032.741816 func=fileOpen                 level=ERROR logid=8214e9aa-bf70-11ee-b6c5-e8ebd34116f4 unit=fst@antares-eos95.scd.rl.ac.uk:1095 tid=00007fd29a7fc700 source=XrdIo:284                      tident=<service> sec=      uid=0 gid=0 name= geo="" error= "open failed url=root://st-096-hh15112c.cern.ch:1095//14/30518/455060cb-6368-11ee-8c24-e83935241a72?fst.readahead=true&fst.valid=1706620102&tpc.key=16eb57e02b3b2b3c65b8f48b&tpc.org=arogovsk.2173756@lxplus973.cern.ch&tpc.stage=copy, errno=3010, errc=400, msg=[ERROR] Error response: permission denied"
240130 13:23:52 time=1706621032.741896 func=DoTpcTransfer            level=ERROR logid=8214e9aa-bf70-11ee-b6c5-e8ebd34116f4 unit=fst@antares-eos95.scd.rl.ac.uk:1095 tid=00007fd29a7fc700 source=XrdFstOfsFile:3813             tident=arogovsk.2173756:77@lxplus973.cern.ch sec=      uid=36350 gid=24277 name=nobody geo="" msg="TPC open failed for url=root://st-096-hh15112c.cern.ch:1095//14/30518/455060cb-6368-11ee-8c24-e83935241a72?fst.readahead=true cgi=tpc.key=16eb57e02b3b2b3c65b8f48b&tpc.org=arogovsk.2173756@lxplus973.cern.ch&tpc.stage=copy"

Note that the error to fileOpen request was returned almost 15 minutes after client disconnected.

Best Regards,
Alex

esindril · March 4, 2024, 10:35am

Hi Alex,

Sorry for the late reply on this, I was on holidays.
Would it be possible to debug such a recent failing transfer? For example, send me the FST logs for the day from your instance (where the reference transfer fails) and I can correlate the transfer with the activity on the EOS instance at CERN, and then we can probably find more clues for the underlying cause of these failures.

Thanks,
Elvin

arogovsk · March 5, 2024, 2:19pm

Hi Elvin,

I’ll let @georgep post the logs, since he is a service admin.

In the meantime we have enabled tcp keepalive feature for our preprod instance (by adding XRD_TCPKEEPALIVE.* env variables). This feature should detect and close stale connections, and so far there have been no failures on the preprod instance. Fingers crossed.

Best Regards,
Alex

arogovsk · March 7, 2024, 2:37pm

I’ll let @georgep post the logs, since he is a service admin.

George posted the log here:
http://www-public.gridpp.rl.ac.uk/filelists/xrdlog.fst-antares-eos03

Best Regards,
Alex

esindril · March 8, 2024, 8:12am

Hi Alex,

Thanks for the logs, these are very useful in understanding what actually happens to these transfers. At first look, indeed there seem to be some networking issue between RAL and CERN, but it’s not possible to pin-pint the exact cause from the logs.

What happens for this particular transfer is that the client triggers the TPC transfer but the open from machine antares-eos03.scd.rl.ac.uk takes around 15 minutes to arrive on the storage node at CERN, namely st-096-100gb054.cern.ch.

Below you can see the logs, where the first open matches the timestamps from your logs and this is the client driving the TPC transfer monalisa.2901490:9964@pcalimonitor connecting to the CERN EOS storage at 240307 09:27:00 then the TPC protocol assumes that the destination will connect and transfer the file. This connection happens only at 240307 09:43:32 and by that time, the timeout on the client side has triggered and closed both connections to the source and destination and also the TPC key validity has expired.

Logs:

240307 01:57:03 3301519 XrootdXeq: daemon.5179:11095@antares-eos03.scd.rl.ac.uk disc 2:13:54 (socket error)
240307 09:27:00 time=1709800020.937525 func=open                     level=INFO  logid=7821dd46-dc5c-11ee-82e8-a4bf01796193 unit=fst@st-096-100gb054.cern.ch:1095 tid=00007f78434da640 source=XrdFstOfsFile:175              tident=monalisa.2901490:9964@pcalimonitor sec=unix  uid=0 gid=0 name=monalisa geo="" path=/04/36435/05422528-a162-11e8-b5f9-a310691b2def info=authz=<...>&cap.msg=<...>&cap.sym=<...>&eos.app=transfer-3rd&eos.clientinfo=zbase64:MDAwMDAwNmJ4nB3IYQqAIAxA4at0gWRCBQo7TOosIVTmJLp92b/3vlIpWxbUCmBKxbobU46jmj8p4I/yVELXYySmMKByKhgo7v2S8R8nJi/YZAazzRrgcLAuyhNn5U+rwawvzOkj6A==&mgm.blockchecksum=ignore&mgm.id=614b5493&mgm.logid=7821dd46-dc5c-11ee-82e8-a4bf01796193&mgm.mtime=1534430300&mgm.replicahead=0&mgm.replicaindex=0&tpc.dst=antares-eos03.scd.rl.ac.uk&tpc.key=21f2c80208bd45f265e97a54&tpc.stage=copy open_mode=0
240307 09:27:00 time=1709800020.937609 func=ProcessTpcOpaque         level=INFO  logid=7821dd46-dc5c-11ee-82e8-a4bf01796193 unit=fst@st-096-100gb054.cern.ch:1095 tid=00007f78434da640 source=XrdFstOfsFile:3070             tident=monalisa.2901490:9964@pcalimonitor sec=unix  uid=0 gid=0 name=monalisa geo="" msg="tpc src session" key=21f2c80208bd45f265e97a54, org=monalisa.2901490@pcalimonitor.cern.ch, dst=antares-eos03.scd.rl.ac.uk path=/04/36435/05422528-a162-11e8-b5f9-a310691b2def expires=1709800140
240307 09:43:32 194762 XrootdXeq: daemon.5179:10069@antares-eos03.scd.rl.ac.uk pub IPv4 login as daemon
240307 09:43:32 time=1709801012.804134 func=open                     level=INFO  logid=unknown unit=fst@st-096-100gb054.cern.ch:1095 tid=00007f75431c4640 source=XrdFstOfsFile:175              tident=daemon.5179:10069@antares-eos03.scd.rl.ac.uk sec=unix  uid=0 gid=0 name=daemon geo="" path=/04/36435/05422528-a162-11e8-b5f9-a310691b2def info=fst.readahead=true&fst.valid=1709800079&tpc.key=21f2c80208bd45f265e97a54&tpc.org=monalisa.2901490@pcalimonitor.cern.ch&tpc.stage=copy&tried=st-096-100gb054.cern.ch&triedrc=srverr open_mode=0
240307 09:43:47 194762 FstOfs_ProcessTpcOpaque: daemon.5179:10069@antares-eos03.scd.rl.ac.uk Unable to open - tpc key not valid /04/36435/05422528-a162-11e8-b5f9-a310691b2def; operation not permitted
240307 09:43:47 time=1709801027.816517 func=ProcessTpcOpaque         level=ERROR logid=unknown unit=fst@st-096-100gb054.cern.ch:1095 tid=00007f75431c4640 source=XrdFstOfsFile:3104             tident=daemon.5179:10069@antares-eos03.scd.rl.ac.uk sec=unix  uid=0 gid=0 name=daemon geo="" msg="tpc key not valid" key=21f2c80208bd45f265e97a54
240307 09:43:47 time=1709801027.816581 func=open                     level=ERROR logid=unknown unit=fst@st-096-100gb054.cern.ch:1095 tid=00007f75431c4640 source=XrdFstOfsFile:191              tident=daemon.5179:10069@antares-eos03.scd.rl.ac.uk sec=unix  uid=0 gid=0 name=daemon geo="" msg="failed while processing TPC/open opaque"
240307 09:43:47 time=1709801027.816609 func=_close                   level=INFO  logid=unknown unit=fst@st-096-100gb054.cern.ch:1095 tid=00007f75431c4640 source=XrdFstOfsFile:2250             tident=daemon.5179:10069@antares-eos03.scd.rl.ac.uk sec=unix  uid=0 gid=0 name=daemon geo="" msg="done close" rc=0 errc=0
240307 11:55:37 405723 XrootdXeq: daemon.5179:10069@antares-eos03.scd.rl.ac.uk disc 2:12:05 (socket error)

At this point, I would suggest running some checks on the infrastructure to make sure there are no transient or systematic network problem (some router/switch dropping packets etc.) by running traceroute or mtr between the two storage endpoints involved in this transfer. This is the output that I get from st-096-100gb054.cern.ch towards antares-eos03.scd.rl.ac.uk:

[root@st-096-100gb054 fst]# traceroute -4 antares-eos03.scd.rl.ac.uk
traceroute to antares-eos03.scd.rl.ac.uk (130.246.216.68), 30 hops max, 60 byte packets
 1  l513-c-rjuxl-15-ip178.cern.ch (128.142.217.34)  3.712 ms  0.817 ms l513-c-rjuxl-16-ip178.cern.ch (128.142.217.35)  0.908 ms
 2  l513-b-rjuxl-2-hb55.cern.ch (185.249.56.129)  0.754 ms l773-b-rjupl-1-hp55.cern.ch (185.249.56.137)  0.233 ms  0.239 ms
 3  b513-b-rjuxl-1-ht2.cern.ch (192.65.196.209)  0.893 ms  1.155 ms b513-b-rjuxl-1-ht4.cern.ch (172.24.112.53)  0.864 ms
 4  * * *
 5  g513-e-fpa78-1-fi1.cern.ch (192.65.196.150)  0.744 ms  0.749 ms  0.761 ms
 6  * * *
 7  e513-e-rjup1-1-pe4.cern.ch (192.65.184.82)  1.033 ms  0.929 ms  0.951 ms
 8  cern.mx1.gen.ch.geant.net (62.40.124.157)  1.122 ms  1.190 ms  1.644 ms
 9  ae7.mx1.par.fr.geant.net (62.40.98.239)  15.763 ms  15.744 ms  15.547 ms
10  ae8.mx1.lon2.uk.geant.net (62.40.98.106)  15.119 ms  15.106 ms  15.066 ms
11  ae2.mx1.lon.uk.geant.net (62.40.98.64)  15.881 ms  15.771 ms  15.758 ms
12  janet-gw.mx1.lon.uk.geant.net (62.40.124.198)  15.895 ms  15.935 ms  16.030 ms
13  ae29.londpg-sbr2.ja.net (146.97.33.2)  19.577 ms  19.593 ms  19.466 ms
14  ral-r26.ja.net (146.97.41.34)  17.676 ms  17.906 ms  17.885 ms
15  192.100.78.51 (192.100.78.51)  18.618 ms  18.509 ms  18.578 ms
16  * * *
17  * * *
18  * * *
19  * * *
20  * * *
21  * * *
22  * * *
23  * * *
24  * * *
25  * * *
26  * * *
27  * * *
28  * * *
29  * * *
30  * * *

and using mtr

[root@st-096-100gb054 fst]# mtr -r -c 20 antares-eos03.scd.rl.ac.uk
Start: 2024-03-08T09:10:17+0100
HOST: st-096-100gb054.cern.ch     Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- l513-c-rjuxl-15-ip178.cer  0.0%    20    0.9   2.2   0.8  22.1   4.7
  2.|-- l773-b-rjupl-1-hp55.cern.  0.0%    20    0.3   0.3   0.2   0.3   0.0
  3.|-- b513-b-rjuxl-1-ht4.cern.c  0.0%    20    1.4   1.9   0.8  11.8   2.5
  4.|-- g513-e-rjuxm-10-pg4.cern.  0.0%    20    0.9   5.5   0.7  79.6  17.7
  5.|-- ???                       100.0    20    0.0   0.0   0.0   0.0   0.0

Cheers,
Elvin

arogovsk · March 8, 2024, 11:24am

Hi Elvin,

Thanks for you analysis.
Here is a traceroute from antares-eos03.scd.rl.ac.uk to st-096-100gb054.cern.ch. Looks the same as yours to me (so probably no routing issue present):

[root@antares-eos03 ~]# traceroute st-096-100gb054.cern.ch
traceroute to st-096-100gb054.cern.ch (128.142.217.41), 30 hops max, 60 byte packets
 1  gateway (130.246.216.65)  0.325 ms  0.296 ms  0.393 ms
 2  10.0.10.190 (10.0.10.190)  0.060 ms  0.045 ms  0.055 ms
 3  10.0.10.241 (10.0.10.241)  0.455 ms  0.574 ms  0.559 ms
 4  10.0.10.254 (10.0.10.254)  0.146 ms  0.132 ms  0.116 ms
 5  192.100.78.55 (192.100.78.55)  0.495 ms  0.564 ms  0.549 ms
 6  ae3.londpg-sbr2.ja.net (146.97.41.33)  2.271 ms  2.219 ms  2.284 ms
 7  ae29.londhx-sbr1.ja.net (146.97.33.1)  2.721 ms  2.725 ms  2.656 ms
 8  janet.mx1.lon.uk.geant.net (62.40.124.197)  2.864 ms  2.785 ms  2.955 ms
 9  ae2.mx1.lon2.uk.geant.net (62.40.98.65)  17.923 ms  17.888 ms  17.864 ms
10  ae8.mx1.par.fr.geant.net (62.40.98.107)  9.874 ms  9.469 ms  9.445 ms
11  ae7.mx1.gen.ch.geant.net (62.40.98.238)  17.938 ms  17.814 ms  17.723 ms
12  cern-gw.mx1.gen.ch.geant.net (62.40.124.158)  17.781 ms  17.428 ms  17.407 ms
13  e513-e-rjup1-1-ne0.cern.ch (192.65.184.189)  17.808 ms e513-e-rjuxm-v10-pe4.cern.ch (192.65.184.81)  17.599 ms  17.953 ms
14  * * e513-e-rjuxm-v10-pe4.cern.ch (192.65.184.81)  18.044 ms
15  * * *
16  * * *

As said above, the problem seems to affect only long-lived inactive connections, and it may be tricky to reproduce it (without actually establishing a connection and leaving it inactive for some time)…

Best Regards,
Alex

esindril · March 11, 2024, 2:26pm

Hi Alex,

Looking at the output of mtr from the CERN EOS storage node towards RAL, I see there is 100% packet loss after a few hops. Is this actually expected? Do you know what could be the explanation? Is there some packet filtering taking place somewhere along the route?

Going back to the logging on the storage server at CERN this looks to me like it’s a new TCP connection since the previous one was closed a few hours before.
240307 01:57:03 3301519 XrootdXeq: daemon.5179:11095@antares-eos03.scd.rl.ac.uk disc 2:13:54 (socket error)

Also this error message on disconnect socket error is not particularly comforting, but I’ll ping @amadio maybe he can set up a reproducer for this issue as this seems to come from the XRootD layer rather than EOS.

Can you or @georgep let us know exactly what version of eos/eos-xrootd and xrootd you are using?

Thanks,
Elvin

amadio · March 11, 2024, 5:01pm

This looks quite similar to a recent ticket on XRootD’s GitHub repository: root:// TPC transfer fail with xrootd 5.6.x · Issue #2202 · xrootd/xrootd · GitHub

I think I’ve been able to reproduce this locally with both tpc endpoints on the same machine, so I’d say that it’s probably not networking. Since the ticket mentions that things used to work with XRootD 5.5.x, I will try to bisect this to the commit that introduced the problem. If I conform that things do work with the earlier version for the case I saw.

arogovsk · March 14, 2024, 2:51pm

Hi Elvin,

Looking at the output of mtr from the CERN EOS storage node towards RAL, I see there is 100% packet loss after a few hops. Is this actually expected? Do you know what could be the explanation?

Some network routers discard packets with TTL=0 and do not send ICMP packet back to the sender (e.g. this can happen when the device is busy, or when it is set-up to behave like this).

Can you or @georgep let us know exactly what version of eos/eos-xrootd and xrootd you are using?

We use eos-server-5.1.28-1.el7.cern.x86_64 and eos-xrootd-5.5.10-1.el7.cern.x86_64.

Best Regards,
Alex

arogovsk · April 10, 2024, 3:04pm

Hi All,

Just FYI: we have not seen any test failures since the deployment of the TCP keepalive feature. So most probably it was a network issue after all.

Best Regards,
Alex

esindril · April 12, 2024, 12:18pm

Hi Alexander,

Thanks for letting us know. We have TCP keepalive enabled by default on all our machines with the following values:

net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9

Was this not the case for you? These usually come preset with the OS.

Thanks,
Elvin

arogovsk · July 10, 2024, 1:57pm

Hi Elvin,

Very sorry, i completely missed your reply…

It seems like xrootd has its own keepalive setup.
Namely, to set the feature up we used env variables:

XRD_TCPKEEPALIVE=1
XRD_TCPKEEPALIVEINTERVAL=60
XRD_TCPKEEPALIVEPROBES=3
XRD_TCPKEEPALIVETIME=180

Best Regards,
Alex

esindril · July 18, 2024, 7:44am

Hi Alex,

Thanks for letting us know, I misunderstood initially to which TCP keepalive feature you were referring. I was thinking of the OS one, but now I see you enabled the one embedded in the XRootD client. We never needed to do this but it’s good that it had the desired effect!

Cheers,
Elvin

CERN Accelerating science

EOS connection re-use