Ttl in ofs.tpc is not repected in EOS 4?

Hello,

In the recent EOS workskhop, Elvin mentioned that the setting of the ttl value in the
ofs.tpc directive in /etc/xrd.cf.fst is not repected by EOS (or is it the MGM). Can you
please confirm that this is indeed the case and what are the possible workarounds?

I have been trying to modify the ofs.tpc (including the increase of ttl) in /etc/xrd.cf.fst with EOS 4.8.88 (upgraded last week to 4.8.98) but without a result and this bug could explain why.

Thanks,

George

Hi George,

It’s true that EOS does not use any of the parameters specified in the ofs.tpc directive since the TPC mechanism is different from the vanilla XRootD one. By default the ttl value in EOS 4 is 120 seconds.

Starting with EOS 5.1.5 this can be controlled by increasing the EOS_FST_TPC_KEY_MIN_VALIDITY_SEC which is still 120 seconds by default. The ttl value can also be controlled by the client but in EOS 4 this is not taken into account in EOS therefore the value is 120 seconds not matter what the client requests.

Nevertheless, having to wait for 120 seconds (or more) for the client to come with the key is quite a long time and might point to some other more systemic issues with the other endpoint that requires so much time for such a simple action.

The ability to control this ttl value was added in the following commit [1] and you can find further details in the commit message.

Cheers,
Elvin

[1] https://gitlab.cern.ch/dss/eos/-/commit/2387cb85e50ac6236e4a5acc08eced8059fcc4b8

Hi Elvin,

Thanks for clarifying this Elvin.

120 seconds is larger than the [min,max] of [80,90] I attempted to set on the FST config and the client error “The following command has timed out and was killed after 300s:” does indeed point to another kind of cause which we will try to identify with the help of a simple TPC script.

George

Hi Elvin,

Just to note in case if this is of any help, that the monalisa client is issuing repeating stat requests ( >100) to EOS after “tpc running → 2nd sync” and at some point (after 5 mins/300s) gives up.

I cannot reproduce the issue on a dev instance which is not used by external users other than ALICE and this makes me wonder if production load has anything to do with this. I have also noticed that when eos@fst is restarted the ALICE TPC test becomes successfull for some time but then fails again

George

Hi George,

Can you describe a bit more the problem that you are having? I’m not sure I get the full picture. What you describe here is the normal implementation of the XRootD client TPC functionality and there is no surprise about the stat, this happens every 2.5 seconds [1].

Thanks,
Elvin

[1] https://github.com/xrootd/xrootd/blob/master/src/XrdCl/XrdClThirdPartyCopyJob.cc#L732

Hi Elvin,

I am not surprised about the “stat” itself. In the dev EOS instance where the TPC test is constantly passing, there is one stat after the 1st tpc sync, another after the 2nd tpc sync and then the file is copied. Whereas in the production EOS instance (where I have the problem), I see all these tens of stats after the 2nd sync after which the client is timed out (300 seconds)…

Maybe if I set in the FST config

xrootd.trace all -debug
xrd.trace all

may give a clue because right now I’m totally clueless what is happening!

George

Hi Elvin,

Just to quickly check with you that none of the ofs.tpc parameters continue not to be used by EOS 5 (as it was the case with EOS 4)?

If this is the case with EOS 5 as well, we will try play with EOS_FST_TPC_KEY_MIN_VALIDITY_SEC

Thanks,

George

Hi George,

Indeed, nothing changed in EOS 5 in comparison to 4 when it comes to this functionality.

Cheers,
Elvin

Hi Elvin,

Thanks for confirming.

George