FST validity expired when trying to read a file from Slave

Has anyone encountered the following situation:

  • A client who is connected MGM slave gets a message “Network unreachable” whne trying to open a file for reading?

After digging in, we found that:

  • for some reason the Slave MGM gets wrong health status from some of the FSTs; and
  • reports them as offline; and
  • this on leads into redirect target on the call from the client; and
  • ends up with an error message from the FST: Unable to open - fst validity expired

Can it be wrong settings or some kind of a bug?

Greetings,
Veselin

Hi Veselin,

The error FST validity expired happens due to a slow connection between two of the replicas. This we’ve put in place to avoid the XRootD client from triggering automatic open recovery which was buggy and leading to corruption of the data on disk.

Therefore, at this point I would say that the client was properly redirected from the slave to the FSTs but there is a problem with a slow connection or smth along those lines. The fully understand this could you provides us with a trace of such a transfer with XRootD logging enabled e.g.

XRD_LOGLEVEL=Dump xrdcp -f /some/file/ root://eos-slave//location/in/eos/

Thanks,
Elvin

Ciao Elvin from the client when I run:

I get many messages like that (with main error(?) FST still misses the required capability key) if I rerun the copy command it eventually succeeds:

[2018-07-05 11:25:42.897386 +0200][Dump ][TaskMgr ] Running task: “TickGeneratorTask for: eos-jeodpp-slave:1094”
[2018-07-05 11:25:42.897477 +0200][Dump ][TaskMgr ] Will rerun task “TickGeneratorTask for: eos-jeodpp-slave:1094” at [2018-07-05 11:25:57 +0200]
[2018-07-05 11:25:42.897520 +0200][Dump ][TaskMgr ] Running task: “TickGeneratorTask for: s-jrciprjeos145p.cidsn.jrc.it:1095”
[2018-07-05 11:25:42.897554 +0200][Dump ][TaskMgr ] Will rerun task “TickGeneratorTask for: s-jrciprjeos145p.cidsn.jrc.it:1095” at [2018-07-05 11:25:57 +0200]
[2018-07-05 11:25:42.902540 +0200][Dump ][XRootDTransport ] [eos-jeodpp-slave:1094 #0.0] Stream inactive since 165 seconds, TTL: 1200, allocated SIDs: 0, open files: 0
[2018-07-05 11:25:42.902589 +0200][Dump ][XRootDTransport ] [eos-jeodpp-slave:1094 #0.0] Stream inactive since 165 seconds, stream timeout: 60, allocated SIDs: 0, wait barrier: 2018-07-05 11:22:57 +0200
[2018-07-05 11:25:43.897688 +0200][Dump ][TaskMgr ] Running task: “FileTimer task”
[2018-07-05 11:25:43.897760 +0200][Dump ][TaskMgr ] Will rerun task “FileTimer task” at [2018-07-05 11:25:58 +0200]
[2018-07-05 11:25:47.898261 +0200][Dump ][TaskMgr ] Running task: “WaitTask for: 0x0x220b7a0”
[2018-07-05 11:25:47.898381 +0200][Dump ][PostMaster ] [s-jrciprjeos145p.cidsn.jrc.it:1095 #0] Sending message kXR_open (file: /eos/jeodpp/data/SRS/Copernicus/S2/scenes/source/L1C/2017/10/14/065/S2A_MSIL1C_20171014T102021_N0205_R065_T32TMR_20171014T102235.SAFE/GRANULE/L1C_T32TMR_A012074_20171014T102235/IMG_DATA/T32TMR_20171014T102021_B08.jp2?cap.msg=Mzjar8HN1qTY6iTKcoW3AD2wXQbqDPKXwseW/D9FeM9IrVMXPOCTSxxMSQzoCCHu2RG50vAr0qSMh/y4UxZ/8BfZjR70v4dwuuY8IE/hoXKWsGsflcOt6jjiAsoF/HuQ/lbjrjIMIKdAe3KnRNEb2fnI9ABoUqHfG4hjVFOZVxPMaPt2yWEl20ot6VoCOZTh9JGUv9y4G+ssWYkxRv4btyXGC2vsmTVA820FVSw99Awcr2g11KkRtryugINXGIN82obwvAYHcBnEY2iWZhfqaC0R0CQ34ErdQKwpI51VERoyvUpS4VW9HsugPEAZGyfgqIStiFikney1aLpAxJg0a9xUBncCNpP8xzI5panbVlsaEGRqZHzIftAWTUaRMRSnh2335JVGaMCc0EWkKUQY1+TIXKk+UBhDf25+VLlIjzquI1a9Fhtb0Oyvksi6/YgcwVnxhH+4/h0GEsTlqVnUBFfROq8MhhLWk1M+c/qw2rOeohDRXdhrslaooEDG4hkNsucaRyMbQ6pW0Ja2PDWwpNCiPRmdEeG0YLRlBY+aoCIKKCFeOm1KHhc2PSxOh2c8DXQAm1QLhLbV52Lp1cy5tgqwKpsEwnojYj88QP5XE0/6euERzWP5JNLCfN5cKWdDFMYYMMEx/qtJN6krpyZz10Hyn0UlPXtSiwdGd7vzdOrIVw4RP1qB+Yj7W8If2f06nSA8RWkhzVDLRcwriY1Ylx1qB7P4b2zH9LpqcnuWD7BQ6kt0cW6UtKLDRyUHq5PJfoSzuXEnmXJitxnLaHRU4f8V79DPW3US2k/zDwkMCze1lGuFIgBsoHrgObBdFQwPOvMISOxZu0Y=&cap.sym=RR8usaHHf8g85rnqHzoBqQENPgg=&mgm.blockchecksum=ignore&mgm.id=0bcb6e61&mgm.logid=0156642c-8035-11e8-b042-6c0b84b7fa2e&mgm.replicahead=1&mgm.replicaindex=1, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ) (0x220b7a0) through substream 0 expecting answer at 0
[2018-07-05 11:25:47.898445 +0200][Debug ][TaskMgr ] Done with task: “WaitTask for: 0x0x220b7a0”
[2018-07-05 11:25:47.898526 +0200][Dump ][AsyncSock ] [s-jrciprjeos145p.cidsn.jrc.it:1095 #0.0] Wrote a message: kXR_open (file: /eos/jeodpp/data/SRS/Copernicus/S2/scenes/source/L1C/2017/10/14/065/S2A_MSIL1C_20171014T102021_N0205_R065_T32TMR_20171014T102235.SAFE/GRANULE/L1C_T32TMR_A012074_20171014T102235/IMG_DATA/T32TMR_20171014T102021_B08.jp2?cap.msg=Mzjar8HN1qTY6iTKcoW3AD2wXQbqDPKXwseW/D9FeM9IrVMXPOCTSxxMSQzoCCHu2RG50vAr0qSMh/y4UxZ/8BfZjR70v4dwuuY8IE/hoXKWsGsflcOt6jjiAsoF/HuQ/lbjrjIMIKdAe3KnRNEb2fnI9ABoUqHfG4hjVFOZVxPMaPt2yWEl20ot6VoCOZTh9JGUv9y4G+ssWYkxRv4btyXGC2vsmTVA820FVSw99Awcr2g11KkRtryugINXGIN82obwvAYHcBnEY2iWZhfqaC0R0CQ34ErdQKwpI51VERoyvUpS4VW9HsugPEAZGyfgqIStiFikney1aLpAxJg0a9xUBncCNpP8xzI5panbVlsaEGRqZHzIftAWTUaRMRSnh2335JVGaMCc0EWkKUQY1+TIXKk+UBhDf25+VLlIjzquI1a9Fhtb0Oyvksi6/YgcwVnxhH+4/h0GEsTlqVnUBFfROq8MhhLWk1M+c/qw2rOeohDRXdhrslaooEDG4hkNsucaRyMbQ6pW0Ja2PDWwpNCiPRmdEeG0YLRlBY+aoCIKKCFeOm1KHhc2PSxOh2c8DXQAm1QLhLbV52Lp1cy5tgqwKpsEwnojYj88QP5XE0/6euERzWP5JNLCfN5cKWdDFMYYMMEx/qtJN6krpyZz10Hyn0UlPXtSiwdGd7vzdOrIVw4RP1qB+Yj7W8If2f06nSA8RWkhzVDLRcwriY1Ylx1qB7P4b2zH9LpqcnuWD7BQ6kt0cW6UtKLDRyUHq5PJfoSzuXEnmXJitxnLaHRU4f8V79DPW3US2k/zDwkMCze1lGuFIgBsoHrgObBdFQwPOvMISOxZu0Y=&cap.sym=RR8usaHHf8g85rnqHzoBqQENPgg=&mgm.blockchecksum=ignore&mgm.id=0bcb6e61&mgm.logid=0156642c-8035-11e8-b042-6c0b84b7fa2e&mgm.replicahead=1&mgm.replicaindex=1, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ) (0x220b7a0), 1255 bytes
[2018-07-05 11:25:47.898590 +0200][Dump ][AsyncSock ] [s-jrciprjeos145p.cidsn.jrc.it:1095 #0.0] Successfully sent message: kXR_open (file: /eos/jeodpp/data/SRS/Copernicus/S2/scenes/source/L1C/2017/10/14/065/S2A_MSIL1C_20171014T102021_N0205_R065_T32TMR_20171014T102235.SAFE/GRANULE/L1C_T32TMR_A012074_20171014T102235/IMG_DATA/T32TMR_20171014T102021_B08.jp2?cap.msg=Mzjar8HN1qTY6iTKcoW3AD2wXQbqDPKXwseW/D9FeM9IrVMXPOCTSxxMSQzoCCHu2RG50vAr0qSMh/y4UxZ/8BfZjR70v4dwuuY8IE/hoXKWsGsflcOt6jjiAsoF/HuQ/lbjrjIMIKdAe3KnRNEb2fnI9ABoUqHfG4hjVFOZVxPMaPt2yWEl20ot6VoCOZTh9JGUv9y4G+ssWYkxRv4btyXGC2vsmTVA820FVSw99Awcr2g11KkRtryugINXGIN82obwvAYHcBnEY2iWZhfqaC0R0CQ34ErdQKwpI51VERoyvUpS4VW9HsugPEAZGyfgqIStiFikney1aLpAxJg0a9xUBncCNpP8xzI5panbVlsaEGRqZHzIftAWTUaRMRSnh2335JVGaMCc0EWkKUQY1+TIXKk+UBhDf25+VLlIjzquI1a9Fhtb0Oyvksi6/YgcwVnxhH+4/h0GEsTlqVnUBFfROq8MhhLWk1M+c/qw2rOeohDRXdhrslaooEDG4hkNsucaRyMbQ6pW0Ja2PDWwpNCiPRmdEeG0YLRlBY+aoCIKKCFeOm1KHhc2PSxOh2c8DXQAm1QLhLbV52Lp1cy5tgqwKpsEwnojYj88QP5XE0/6euERzWP5JNLCfN5cKWdDFMYYMMEx/qtJN6krpyZz10Hyn0UlPXtSiwdGd7vzdOrIVw4RP1qB+Yj7W8If2f06nSA8RWkhzVDLRcwriY1Ylx1qB7P4b2zH9LpqcnuWD7BQ6kt0cW6UtKLDRyUHq5PJfoSzuXEnmXJitxnLaHRU4f8V79DPW3US2k/zDwkMCze1lGuFIgBsoHrgObBdFQwPOvMISOxZu0Y=&cap.sym=RR8usaHHf8g85rnqHzoBqQENPgg=&mgm.blockchecksum=ignore&mgm.id=0bcb6e61&mgm.logid=0156642c-8035-11e8-b042-6c0b84b7fa2e&mgm.replicahead=1&mgm.replicaindex=1, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ) (0x220b7a0).
[2018-07-05 11:25:47.898625 +0200][Dump ][XRootD ] [s-jrciprjeos145p.cidsn.jrc.it:1095] Message kXR_open (file: /eos/jeodpp/data/SRS/Copernicus/S2/scenes/source/L1C/2017/10/14/065/S2A_MSIL1C_20171014T102021_N0205_R065_T32TMR_20171014T102235.SAFE/GRANULE/L1C_T32TMR_A012074_20171014T102235/IMG_DATA/T32TMR_20171014T102021_B08.jp2?cap.msg=Mzjar8HN1qTY6iTKcoW3AD2wXQbqDPKXwseW/D9FeM9IrVMXPOCTSxxMSQzoCCHu2RG50vAr0qSMh/y4UxZ/8BfZjR70v4dwuuY8IE/hoXKWsGsflcOt6jjiAsoF/HuQ/lbjrjIMIKdAe3KnRNEb2fnI9ABoUqHfG4hjVFOZVxPMaPt2yWEl20ot6VoCOZTh9JGUv9y4G+ssWYkxRv4btyXGC2vsmTVA820FVSw99Awcr2g11KkRtryugINXGIN82obwvAYHcBnEY2iWZhfqaC0R0CQ34ErdQKwpI51VERoyvUpS4VW9HsugPEAZGyfgqIStiFikney1aLpAxJg0a9xUBncCNpP8xzI5panbVlsaEGRqZHzIftAWTUaRMRSnh2335JVGaMCc0EWkKUQY1+TIXKk+UBhDf25+VLlIjzquI1a9Fhtb0Oyvksi6/YgcwVnxhH+4/h0GEsTlqVnUBFfROq8MhhLWk1M+c/qw2rOeohDRXdhrslaooEDG4hkNsucaRyMbQ6pW0Ja2PDWwpNCiPRmdEeG0YLRlBY+aoCIKKCFeOm1KHhc2PSxOh2c8DXQAm1QLhLbV52Lp1cy5tgqwKpsEwnojYj88QP5XE0/6euERzWP5JNLCfN5cKWdDFMYYMMEx/qtJN6krpyZz10Hyn0UlPXtSiwdGd7vzdOrIVw4RP1qB+Yj7W8If2f06nSA8RWkhzVDLRcwriY1Ylx1qB7P4b2zH9LpqcnuWD7BQ6kt0cW6UtKLDRyUHq5PJfoSzuXEnmXJitxnLaHRU4f8V79DPW3US2k/zDwkMCze1lGuFIgBsoHrgObBdFQwPOvMISOxZu0Y=&cap.sym=RR8usaHHf8g85rnqHzoBqQENPgg=&mgm.blockchecksum=ignore&mgm.id=0bcb6e61&mgm.logid=0156642c-8035-11e8-b042-6c0b84b7fa2e&mgm.replicahead=1&mgm.replicaindex=1, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ) has been successfully sent.
[2018-07-05 11:25:47.898650 +0200][Dump ][PostMaster ] [s-jrciprjeos145p.cidsn.jrc.it:1095 #0.0] All messages consumed, disable uplink
[2018-07-05 11:25:47.898981 +0200][Dump ][XRootDTransport ] [msg: 0xa8007bd0] Expecting 74 bytes of message body
[2018-07-05 11:25:47.899010 +0200][Dump ][AsyncSock ] [s-jrciprjeos145p.cidsn.jrc.it:1095 #0.0] Received message header for 0xa8007bd0 size: 8
[2018-07-05 11:25:47.899031 +0200][Dump ][AsyncSock ] [s-jrciprjeos145p.cidsn.jrc.it:1095 #0.0] Received message 0xa8007bd0 of 82 bytes
[2018-07-05 11:25:47.899060 +0200][Dump ][PostMaster ] [s-jrciprjeos145p.cidsn.jrc.it:1095 #0] Handling received message: 0xa8007bd0.
[2018-07-05 11:25:47.899182 +0200][Dump ][XRootD ] [s-jrciprjeos145p.cidsn.jrc.it:1095] Got kXR_wait response of 10 seconds to message kXR_open (file: /eos/jeodpp/data/SRS/Copernicus/S2/scenes/source/L1C/2017/10/14/065/S2A_MSIL1C_20171014T102021_N0205_R065_T32TMR_20171014T102235.SAFE/GRANULE/L1C_T32TMR_A012074_20171014T102235/IMG_DATA/T32TMR_20171014T102021_B08.jp2?cap.msg=Mzjar8HN1qTY6iTKcoW3AD2wXQbqDPKXwseW/D9FeM9IrVMXPOCTSxxMSQzoCCHu2RG50vAr0qSMh/y4UxZ/8BfZjR70v4dwuuY8IE/hoXKWsGsflcOt6jjiAsoF/HuQ/lbjrjIMIKdAe3KnRNEb2fnI9ABoUqHfG4hjVFOZVxPMaPt2yWEl20ot6VoCOZTh9JGUv9y4G+ssWYkxRv4btyXGC2vsmTVA820FVSw99Awcr2g11KkRtryugINXGIN82obwvAYHcBnEY2iWZhfqaC0R0CQ34ErdQKwpI51VERoyvUpS4VW9HsugPEAZGyfgqIStiFikney1aLpAxJg0a9xUBncCNpP8xzI5panbVlsaEGRqZHzIftAWTUaRMRSnh2335JVGaMCc0EWkKUQY1+TIXKk+UBhDf25+VLlIjzquI1a9Fhtb0Oyvksi6/YgcwVnxhH+4/h0GEsTlqVnUBFfROq8MhhLWk1M+c/qw2rOeohDRXdhrslaooEDG4hkNsucaRyMbQ6pW0Ja2PDWwpNCiPRmdEeG0YLRlBY+aoCIKKCFeOm1KHhc2PSxOh2c8DXQAm1QLhLbV52Lp1cy5tgqwKpsEwnojYj88QP5XE0/6euERzWP5JNLCfN5cKWdDFMYYMMEx/qtJN6krpyZz10Hyn0UlPXtSiwdGd7vzdOrIVw4RP1qB+Yj7W8If2f06nSA8RWkhzVDLRcwriY1Ylx1qB7P4b2zH9LpqcnuWD7BQ6kt0cW6UtKLDRyUHq5PJfoSzuXEnmXJitxnLaHRU4f8V79DPW3US2k/zDwkMCze1lGuFIgBsoHrgObBdFQwPOvMISOxZu0Y=&cap.sym=RR8usaHHf8g85rnqHzoBqQENPgg=&mgm.blockchecksum=ignore&mgm.id=0bcb6e61&mgm.logid=0156642c-8035-11e8-b042-6c0b84b7fa2e&mgm.replicahead=1&mgm.replicaindex=1, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ): FST still misses the required capability key; come back in 10 seconds!
[2018-07-05 11:25:47.899295 +0200][Debug ][TaskMgr ] Registering task: “WaitTask for: 0x0x220b7a0” to be run at: [2018-07-05 11:25:57 +0200]

Greetings,
Veselin

Hi Veselin,

Can you make sure that the MGM and the diskserver have the same sss key?
Could you please post the output of the following commands?
on the master MGM: sum /etc/eos.keytab
on the slave MGM : sum /etc/eos.keytab
on host s-jrciprjeos145p.cidsn.jrc.it: sum /etc/eos.keytab

Thanks,
Elvin

Ciao Elvin,

We have several other users in the keytabs of the MGMs, the one that maters the daemon one is in the last row so here is the output:
Master: tail -n 1 /etc/eos.keytab | sum
29985 1
Slave: tail -n 1 /etc/eos.keytab | sum
29985 1
FST: tail -n 1 /etc/eos.keytab | sum
29985 1

And keep in mind that the read succeeds every once in a while.

Cheers,
Veselin

Hi Veselin,

I might have found the reason for the flipping online/offline status for the FSTs. I’m about to tag 4.2.28 and maybe you can give it a try. This should fix the issues that you’ve encountered. The new tag should be in the repos tomorrow.

Cheers,
Elvin

Hi Elvin,

Thank you for your answer, Veselin has now left, but I will try to upgrade the slave at convenient moment, and let you know.

Franck