EOS connection re-use

Hello,

We have been trying to debug a long ongoing issue we have with TPC transfers/tests from EOS ALICE storage nodes at CERN to EOS instances
at RAL. In a nutshell, the tests some times fail and some times succeed and we are trying to understand why this happens.

During his latest debugging attempt, our ALICE liason (Alex) found out that EOS reuses the connection to EOS storage nodes; the same connection can be used for multiple transfers.

He points to the tcpdump,
http://www-public.gridpp.rl.ac.uk/filelists/tcpdump_eos94

to point out that there is only one SYN exchange in the beginning and two rounds of exchange corresponding to two transfers.

As you can see, eos95 server sent multiple packets to initiate a transfer, but never got an ACK from EOS storage server (st-096-…). Alex thinks that these stale connections seem to be the cause of the problem.

Can you please confirm if this connection re-use is an EOS feature and if, so, can you indicate how can it turned off?

Thanks,

George

Sorry for the pinging @esindril. Do you have any comments on the above?

Thanks,

George

Hi George,

The mechanism of reutilizing TCP connections is not something specific to EOS but actually comes from the XRootD framework. There is nothing we can do at the EOS level to influence this. Normaly, a physical connection can be reused and can multiplex several transfers originating from the same client/process.

At this point looking at the network interaction is very low level. I would be more interested in the errors that you get from the application layer, ie. the XRootD client when performing such transfers and if you have any error messages at all or things just hang.

Of course, this does not rule out some faulty network equipment in between that could explain such transitory errors, but we would need more data on this, with exact storage nodes on the CERN EOS side that exhibit this behavior. I suggest you open a ticket concerning this on the CERN Service Now portal so that it can properly be investigated by the EOS ALICE service manager.

Cheers,
Elvin

Hi Elvin,

/* Sorry for jumping in */

The error messages we are getting are not very meaningful, they usually refered to expired tpc token. The errors happen when the server gives up on re-using stale connection, and establishes a new one.

Example:

240130 13:09:14 time=1706620154.516796 func=stat                     level=INFO  logid=8214e9aa-bf70-11ee-b6c5-e8ebd34116f4 unit=fst@antares-eos95.scd.rl.ac.uk:1095 tid=00007fd2c33d3700 source=XrdFstOfsFile:1308             tident=arogovsk.2173756:77@lxplus973.cern.ch sec=      uid=36350 gid=24277 name=nobody geo="" path=/eos/antarespreprodtier1/prod/alice/04/53413/1a476b28-bbca-11ee-8070-024282dad15e fxid=2faf1349 size=0 mtime=1706620043.63563771
240130 13:09:17 time=1706620157.034517 func=stat                     level=INFO  logid=8214e9aa-bf70-11ee-b6c5-e8ebd34116f4 unit=fst@antares-eos95.scd.rl.ac.uk:1095 tid=00007fd2c33d3700 source=XrdFstOfsFile:1308             tident=arogovsk.2173756:77@lxplus973.cern.ch sec=      uid=36350 gid=24277 name=nobody geo="" path=/eos/antarespreprodtier1/prod/alice/04/53413/1a476b28-bbca-11ee-8070-024282dad15e fxid=2faf1349 size=0 mtime=1706620043.63563771
240130 13:09:19 time=1706620159.555109 func=stat                     level=INFO  logid=8214e9aa-bf70-11ee-b6c5-e8ebd34116f4 unit=fst@antares-eos95.scd.rl.ac.uk:1095 tid=00007fd2c33d3700 source=XrdFstOfsFile:1308             tident=arogovsk.2173756:77@lxplus973.cern.ch sec=      uid=36350 gid=24277 name=nobody geo="" path=/eos/antarespreprodtier1/prod/alice/04/53413/1a476b28-bbca-11ee-8070-024282dad15e fxid=2faf1349 size=0 mtime=1706620043.63563771
240130 13:09:20 time=1706620160.641037 func=_close                   level=INFO  logid=8214e9aa-bf70-11ee-b6c5-e8ebd34116f4 unit=fst@antares-eos95.scd.rl.ac.uk:1095 tid=00007fd2c33d3700 source=XrdFstOfsFile:1505             tident=arogovsk.2173756:77@lxplus973.cern.ch sec=      uid=36350 gid=24277 name=nobody geo="" msg="remove tpc key" key=16eb57e02b3b2b3c65b8f48b
240130 13:23:52 time=1706621032.741816 func=fileOpen                 level=ERROR logid=8214e9aa-bf70-11ee-b6c5-e8ebd34116f4 unit=fst@antares-eos95.scd.rl.ac.uk:1095 tid=00007fd29a7fc700 source=XrdIo:284                      tident=<service> sec=      uid=0 gid=0 name= geo="" error= "open failed url=root://st-096-hh15112c.cern.ch:1095//14/30518/455060cb-6368-11ee-8c24-e83935241a72?fst.readahead=true&fst.valid=1706620102&tpc.key=16eb57e02b3b2b3c65b8f48b&tpc.org=arogovsk.2173756@lxplus973.cern.ch&tpc.stage=copy, errno=3010, errc=400, msg=[ERROR] Error response: permission denied"
240130 13:23:52 time=1706621032.741896 func=DoTpcTransfer            level=ERROR logid=8214e9aa-bf70-11ee-b6c5-e8ebd34116f4 unit=fst@antares-eos95.scd.rl.ac.uk:1095 tid=00007fd29a7fc700 source=XrdFstOfsFile:3813             tident=arogovsk.2173756:77@lxplus973.cern.ch sec=      uid=36350 gid=24277 name=nobody geo="" msg="TPC open failed for url=root://st-096-hh15112c.cern.ch:1095//14/30518/455060cb-6368-11ee-8c24-e83935241a72?fst.readahead=true cgi=tpc.key=16eb57e02b3b2b3c65b8f48b&tpc.org=arogovsk.2173756@lxplus973.cern.ch&tpc.stage=copy"

Note that the error to fileOpen request was returned almost 15 minutes after client disconnected.

Best Regards,
Alex