In the recent EOS workskhop, Elvin mentioned that the setting of the ttl value in the
ofs.tpc directive in /etc/xrd.cf.fst is not repected by EOS (or is it the MGM). Can you
please confirm that this is indeed the case and what are the possible workarounds?
I have been trying to modify the ofs.tpc (including the increase of ttl) in /etc/xrd.cf.fst with EOS 4.8.88 (upgraded last week to 4.8.98) but without a result and this bug could explain why.
It’s true that EOS does not use any of the parameters specified in the
ofs.tpc directive since the TPC mechanism is different from the vanilla XRootD one. By default the
ttl value in EOS 4 is 120 seconds.
Starting with EOS 5.1.5 this can be controlled by increasing the
EOS_FST_TPC_KEY_MIN_VALIDITY_SEC which is still 120 seconds by default. The
ttl value can also be controlled by the client but in EOS 4 this is not taken into account in EOS therefore the value is 120 seconds not matter what the client requests.
Nevertheless, having to wait for 120 seconds (or more) for the client to come with the key is quite a long time and might point to some other more systemic issues with the other endpoint that requires so much time for such a simple action.
The ability to control this
ttl value was added in the following commit  and you can find further details in the commit message.
Thanks for clarifying this Elvin.
120 seconds is larger than the [min,max] of [80,90] I attempted to set on the FST config and the client error “The following command has timed out and was killed after 300s:” does indeed point to another kind of cause which we will try to identify with the help of a simple TPC script.
Just to note in case if this is of any help, that the monalisa client is issuing repeating stat requests ( >100) to EOS after “tpc running → 2nd sync” and at some point (after 5 mins/300s) gives up.
I cannot reproduce the issue on a dev instance which is not used by external users other than ALICE and this makes me wonder if production load has anything to do with this. I have also noticed that when eos@fst is restarted the ALICE TPC test becomes successfull for some time but then fails again
Can you describe a bit more the problem that you are having? I’m not sure I get the full picture. What you describe here is the normal implementation of the XRootD client TPC functionality and there is no surprise about the stat, this happens every 2.5 seconds .
I am not surprised about the “stat” itself. In the dev EOS instance where the TPC test is constantly passing, there is one stat after the 1st tpc sync, another after the 2nd tpc sync and then the file is copied. Whereas in the production EOS instance (where I have the problem), I see all these tens of stats after the 2nd sync after which the client is timed out (300 seconds)…
Maybe if I set in the FST config
xrootd.trace all -debug
may give a clue because right now I’m totally clueless what is happening!