We have been trying to debug a long ongoing issue we have with TPC transfers/tests from EOS ALICE storage nodes at CERN to EOS instances
at RAL. In a nutshell, the tests some times fail and some times succeed and we are trying to understand why this happens.
During his latest debugging attempt, our ALICE liason (Alex) found out that EOS reuses the connection to EOS storage nodes; the same connection can be used for multiple transfers.
to point out that there is only one SYN exchange in the beginning and two rounds of exchange corresponding to two transfers.
As you can see, eos95 server sent multiple packets to initiate a transfer, but never got an ACK from EOS storage server (st-096-…). Alex thinks that these stale connections seem to be the cause of the problem.
Can you please confirm if this connection re-use is an EOS feature and if, so, can you indicate how can it turned off?
The mechanism of reutilizing TCP connections is not something specific to EOS but actually comes from the XRootD framework. There is nothing we can do at the EOS level to influence this. Normaly, a physical connection can be reused and can multiplex several transfers originating from the same client/process.
At this point looking at the network interaction is very low level. I would be more interested in the errors that you get from the application layer, ie. the XRootD client when performing such transfers and if you have any error messages at all or things just hang.
Of course, this does not rule out some faulty network equipment in between that could explain such transitory errors, but we would need more data on this, with exact storage nodes on the CERN EOS side that exhibit this behavior. I suggest you open a ticket concerning this on the CERN Service Now portal so that it can properly be investigated by the EOS ALICE service manager.
The error messages we are getting are not very meaningful, they usually refered to expired tpc token. The errors happen when the server gives up on re-using stale connection, and establishes a new one.
Sorry for the late reply on this, I was on holidays.
Would it be possible to debug such a recent failing transfer? For example, send me the FST logs for the day from your instance (where the reference transfer fails) and I can correlate the transfer with the activity on the EOS instance at CERN, and then we can probably find more clues for the underlying cause of these failures.
I’ll let @georgep post the logs, since he is a service admin.
In the meantime we have enabled tcp keepalive feature for our preprod instance (by adding XRD_TCPKEEPALIVE.* env variables). This feature should detect and close stale connections, and so far there have been no failures on the preprod instance. Fingers crossed.
Thanks for the logs, these are very useful in understanding what actually happens to these transfers. At first look, indeed there seem to be some networking issue between RAL and CERN, but it’s not possible to pin-pint the exact cause from the logs.
What happens for this particular transfer is that the client triggers the TPC transfer but the open from machine antares-eos03.scd.rl.ac.uk takes around 15 minutes to arrive on the storage node at CERN, namely st-096-100gb054.cern.ch.
Below you can see the logs, where the first open matches the timestamps from your logs and this is the client driving the TPC transfer monalisa.2901490:9964@pcalimonitor connecting to the CERN EOS storage at 240307 09:27:00 then the TPC protocol assumes that the destination will connect and transfer the file. This connection happens only at 240307 09:43:32 and by that time, the timeout on the client side has triggered and closed both connections to the source and destination and also the TPC key validity has expired.
At this point, I would suggest running some checks on the infrastructure to make sure there are no transient or systematic network problem (some router/switch dropping packets etc.) by running traceroute or mtr between the two storage endpoints involved in this transfer. This is the output that I get from st-096-100gb054.cern.ch towards antares-eos03.scd.rl.ac.uk:
[root@st-096-100gb054 fst]# traceroute -4 antares-eos03.scd.rl.ac.uk
traceroute to antares-eos03.scd.rl.ac.uk (130.246.216.68), 30 hops max, 60 byte packets
1 l513-c-rjuxl-15-ip178.cern.ch (128.142.217.34) 3.712 ms 0.817 ms l513-c-rjuxl-16-ip178.cern.ch (128.142.217.35) 0.908 ms
2 l513-b-rjuxl-2-hb55.cern.ch (185.249.56.129) 0.754 ms l773-b-rjupl-1-hp55.cern.ch (185.249.56.137) 0.233 ms 0.239 ms
3 b513-b-rjuxl-1-ht2.cern.ch (192.65.196.209) 0.893 ms 1.155 ms b513-b-rjuxl-1-ht4.cern.ch (172.24.112.53) 0.864 ms
4 * * *
5 g513-e-fpa78-1-fi1.cern.ch (192.65.196.150) 0.744 ms 0.749 ms 0.761 ms
6 * * *
7 e513-e-rjup1-1-pe4.cern.ch (192.65.184.82) 1.033 ms 0.929 ms 0.951 ms
8 cern.mx1.gen.ch.geant.net (62.40.124.157) 1.122 ms 1.190 ms 1.644 ms
9 ae7.mx1.par.fr.geant.net (62.40.98.239) 15.763 ms 15.744 ms 15.547 ms
10 ae8.mx1.lon2.uk.geant.net (62.40.98.106) 15.119 ms 15.106 ms 15.066 ms
11 ae2.mx1.lon.uk.geant.net (62.40.98.64) 15.881 ms 15.771 ms 15.758 ms
12 janet-gw.mx1.lon.uk.geant.net (62.40.124.198) 15.895 ms 15.935 ms 16.030 ms
13 ae29.londpg-sbr2.ja.net (146.97.33.2) 19.577 ms 19.593 ms 19.466 ms
14 ral-r26.ja.net (146.97.41.34) 17.676 ms 17.906 ms 17.885 ms
15 192.100.78.51 (192.100.78.51) 18.618 ms 18.509 ms 18.578 ms
16 * * *
17 * * *
18 * * *
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
24 * * *
25 * * *
26 * * *
27 * * *
28 * * *
29 * * *
30 * * *
Thanks for you analysis.
Here is a traceroute from antares-eos03.scd.rl.ac.uk to st-096-100gb054.cern.ch. Looks the same as yours to me (so probably no routing issue present):
[root@antares-eos03 ~]# traceroute st-096-100gb054.cern.ch
traceroute to st-096-100gb054.cern.ch (128.142.217.41), 30 hops max, 60 byte packets
1 gateway (130.246.216.65) 0.325 ms 0.296 ms 0.393 ms
2 10.0.10.190 (10.0.10.190) 0.060 ms 0.045 ms 0.055 ms
3 10.0.10.241 (10.0.10.241) 0.455 ms 0.574 ms 0.559 ms
4 10.0.10.254 (10.0.10.254) 0.146 ms 0.132 ms 0.116 ms
5 192.100.78.55 (192.100.78.55) 0.495 ms 0.564 ms 0.549 ms
6 ae3.londpg-sbr2.ja.net (146.97.41.33) 2.271 ms 2.219 ms 2.284 ms
7 ae29.londhx-sbr1.ja.net (146.97.33.1) 2.721 ms 2.725 ms 2.656 ms
8 janet.mx1.lon.uk.geant.net (62.40.124.197) 2.864 ms 2.785 ms 2.955 ms
9 ae2.mx1.lon2.uk.geant.net (62.40.98.65) 17.923 ms 17.888 ms 17.864 ms
10 ae8.mx1.par.fr.geant.net (62.40.98.107) 9.874 ms 9.469 ms 9.445 ms
11 ae7.mx1.gen.ch.geant.net (62.40.98.238) 17.938 ms 17.814 ms 17.723 ms
12 cern-gw.mx1.gen.ch.geant.net (62.40.124.158) 17.781 ms 17.428 ms 17.407 ms
13 e513-e-rjup1-1-ne0.cern.ch (192.65.184.189) 17.808 ms e513-e-rjuxm-v10-pe4.cern.ch (192.65.184.81) 17.599 ms 17.953 ms
14 * * e513-e-rjuxm-v10-pe4.cern.ch (192.65.184.81) 18.044 ms
15 * * *
16 * * *
As said above, the problem seems to affect only long-lived inactive connections, and it may be tricky to reproduce it (without actually establishing a connection and leaving it inactive for some time)…
Looking at the output of mtr from the CERN EOS storage node towards RAL, I see there is 100% packet loss after a few hops. Is this actually expected? Do you know what could be the explanation? Is there some packet filtering taking place somewhere along the route?
Going back to the logging on the storage server at CERN this looks to me like it’s a new TCP connection since the previous one was closed a few hours before. 240307 01:57:03 3301519 XrootdXeq: daemon.5179:11095@antares-eos03.scd.rl.ac.uk disc 2:13:54 (socket error)
Also this error message on disconnect socket error is not particularly comforting, but I’ll ping @amadio maybe he can set up a reproducer for this issue as this seems to come from the XRootD layer rather than EOS.
Can you or @georgep let us know exactly what version of eos/eos-xrootd and xrootd you are using?
I think I’ve been able to reproduce this locally with both tpc endpoints on the same machine, so I’d say that it’s probably not networking. Since the ticket mentions that things used to work with XRootD 5.5.x, I will try to bisect this to the commit that introduced the problem. If I conform that things do work with the earlier version for the case I saw.
Looking at the output of mtr from the CERN EOS storage node towards RAL, I see there is 100% packet loss after a few hops. Is this actually expected? Do you know what could be the explanation?
Some network routers discard packets with TTL=0 and do not send ICMP packet back to the sender (e.g. this can happen when the device is busy, or when it is set-up to behave like this).
Can you or @georgep let us know exactly what version of eos/eos-xrootd and xrootd you are using?
We use eos-server-5.1.28-1.el7.cern.x86_64 and eos-xrootd-5.5.10-1.el7.cern.x86_64.
Thanks for letting us know, I misunderstood initially to which TCP keepalive feature you were referring. I was thinking of the OS one, but now I see you enabled the one embedded in the XRootD client. We never needed to do this but it’s good that it had the desired effect!