vessko
(Veselin Vasilev)
June 25, 2018, 9:58am
1
Has anyone encountered the following situation:
A client who is connected MGM slave gets a message “Network unreachable” whne trying to open a file for reading?
After digging in, we found that:
for some reason the Slave MGM gets wrong health status from some of the FSTs; and
reports them as offline; and
this on leads into redirect target on the call from the client; and
ends up with an error message from the FST: Unable to open - fst validity expired
Can it be wrong settings or some kind of a bug?
Greetings,
Veselin
esindril
(Elvin Alin Sindrilaru)
July 5, 2018, 7:31am
2
Hi Veselin,
The error FST validity expired happens due to a slow connection between two of the replicas. This we’ve put in place to avoid the XRootD client from triggering automatic open recovery which was buggy and leading to corruption of the data on disk.
Therefore, at this point I would say that the client was properly redirected from the slave to the FSTs but there is a problem with a slow connection or smth along those lines. The fully understand this could you provides us with a trace of such a transfer with XRootD logging enabled e.g.
XRD_LOGLEVEL=Dump xrdcp -f /some/file/ root://eos-slave//location/in/eos/
Thanks,
Elvin
vessko
(Veselin Vasilev)
July 5, 2018, 9:42am
3
Ciao Elvin from the client when I run:
I get many messages like that (with main error(?) FST still misses the required capability key) if I rerun the copy command it eventually succeeds:
[2018-07-05 11:25:42.897386 +0200][Dump ][TaskMgr ] Running task: “TickGeneratorTask for: eos-jeodpp-slave:1094”
[2018-07-05 11:25:42.897477 +0200][Dump ][TaskMgr ] Will rerun task “TickGeneratorTask for: eos-jeodpp-slave:1094” at [2018-07-05 11:25:57 +0200]
[2018-07-05 11:25:42.897520 +0200][Dump ][TaskMgr ] Running task: “TickGeneratorTask for: s-jrciprjeos145p.cidsn.jrc.it:1095”
[2018-07-05 11:25:42.897554 +0200][Dump ][TaskMgr ] Will rerun task “TickGeneratorTask for: s-jrciprjeos145p.cidsn.jrc.it:1095” at [2018-07-05 11:25:57 +0200]
[2018-07-05 11:25:42.902540 +0200][Dump ][XRootDTransport ] [eos-jeodpp-slave:1094 #0 .0] Stream inactive since 165 seconds, TTL: 1200, allocated SIDs: 0, open files: 0
[2018-07-05 11:25:42.902589 +0200][Dump ][XRootDTransport ] [eos-jeodpp-slave:1094 #0 .0] Stream inactive since 165 seconds, stream timeout: 60, allocated SIDs: 0, wait barrier: 2018-07-05 11:22:57 +0200
[2018-07-05 11:25:43.897688 +0200][Dump ][TaskMgr ] Running task: “FileTimer task”
[2018-07-05 11:25:43.897760 +0200][Dump ][TaskMgr ] Will rerun task “FileTimer task” at [2018-07-05 11:25:58 +0200]
[2018-07-05 11:25:47.898261 +0200][Dump ][TaskMgr ] Running task: “WaitTask for: 0x0x220b7a0”
[2018-07-05 11:25:47.898381 +0200][Dump ][PostMaster ] [s-jrciprjeos145p.cidsn.jrc.it:1095 #0 ] Sending message kXR_open (file: /eos/jeodpp/data/SRS/Copernicus/S2/scenes/source/L1C/2017/10/14/065/S2A_MSIL1C_20171014T102021_N0205_R065_T32TMR_20171014T102235.SAFE/GRANULE/L1C_T32TMR_A012074_20171014T102235/IMG_DATA/T32TMR_20171014T102021_B08.jp2?cap.msg=Mzjar8HN1qTY6iTKcoW3AD2wXQbqDPKXwseW/D9FeM9IrVMXPOCTSxxMSQzoCCHu2RG50vAr0qSMh/y4UxZ/8BfZjR70v4dwuuY8IE/hoXKWsGsflcOt6jjiAsoF/HuQ/lbjrjIMIKdAe3KnRNEb2fnI9ABoUqHfG4hjVFOZVxPMaPt2yWEl20ot6VoCOZTh9JGUv9y4G+ssWYkxRv4btyXGC2vsmTVA820FVSw99Awcr2g11KkRtryugINXGIN82obwvAYHcBnEY2iWZhfqaC0R0CQ34ErdQKwpI51VERoyvUpS4VW9HsugPEAZGyfgqIStiFikney1aLpAxJg0a9xUBncCNpP8xzI5panbVlsaEGRqZHzIftAWTUaRMRSnh2335JVGaMCc0EWkKUQY1+TIXKk+UBhDf25+VLlIjzquI1a9Fhtb0Oyvksi6/YgcwVnxhH+4/h0GEsTlqVnUBFfROq8MhhLWk1M+c/qw2rOeohDRXdhrslaooEDG4hkNsucaRyMbQ6pW0Ja2PDWwpNCiPRmdEeG0YLRlBY+aoCIKKCFeOm1KHhc2PSxOh2c8DXQAm1QLhLbV52Lp1cy5tgqwKpsEwnojYj88QP5XE0/6euERzWP5JNLCfN5cKWdDFMYYMMEx/qtJN6krpyZz10Hyn0UlPXtSiwdGd7vzdOrIVw4RP1qB+Yj7W8If2f06nSA8RWkhzVDLRcwriY1Ylx1qB7P4b2zH9LpqcnuWD7BQ6kt0cW6UtKLDRyUHq5PJfoSzuXEnmXJitxnLaHRU4f8V79DPW3US2k/zDwkMCze1lGuFIgBsoHrgObBdFQwPOvMISOxZu0Y=&cap.sym=RR8usaHHf8g85rnqHzoBqQENPgg=&mgm.blockchecksum=ignore&mgm.id=0bcb6e61&mgm.logid=0156642c-8035-11e8-b042-6c0b84b7fa2e&mgm.replicahead=1&mgm.replicaindex=1, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ) (0x220b7a0) through substream 0 expecting answer at 0
[2018-07-05 11:25:47.898445 +0200][Debug ][TaskMgr ] Done with task: “WaitTask for: 0x0x220b7a0”
[2018-07-05 11:25:47.898526 +0200][Dump ][AsyncSock ] [s-jrciprjeos145p.cidsn.jrc.it:1095 #0 .0] Wrote a message: kXR_open (file: /eos/jeodpp/data/SRS/Copernicus/S2/scenes/source/L1C/2017/10/14/065/S2A_MSIL1C_20171014T102021_N0205_R065_T32TMR_20171014T102235.SAFE/GRANULE/L1C_T32TMR_A012074_20171014T102235/IMG_DATA/T32TMR_20171014T102021_B08.jp2?cap.msg=Mzjar8HN1qTY6iTKcoW3AD2wXQbqDPKXwseW/D9FeM9IrVMXPOCTSxxMSQzoCCHu2RG50vAr0qSMh/y4UxZ/8BfZjR70v4dwuuY8IE/hoXKWsGsflcOt6jjiAsoF/HuQ/lbjrjIMIKdAe3KnRNEb2fnI9ABoUqHfG4hjVFOZVxPMaPt2yWEl20ot6VoCOZTh9JGUv9y4G+ssWYkxRv4btyXGC2vsmTVA820FVSw99Awcr2g11KkRtryugINXGIN82obwvAYHcBnEY2iWZhfqaC0R0CQ34ErdQKwpI51VERoyvUpS4VW9HsugPEAZGyfgqIStiFikney1aLpAxJg0a9xUBncCNpP8xzI5panbVlsaEGRqZHzIftAWTUaRMRSnh2335JVGaMCc0EWkKUQY1+TIXKk+UBhDf25+VLlIjzquI1a9Fhtb0Oyvksi6/YgcwVnxhH+4/h0GEsTlqVnUBFfROq8MhhLWk1M+c/qw2rOeohDRXdhrslaooEDG4hkNsucaRyMbQ6pW0Ja2PDWwpNCiPRmdEeG0YLRlBY+aoCIKKCFeOm1KHhc2PSxOh2c8DXQAm1QLhLbV52Lp1cy5tgqwKpsEwnojYj88QP5XE0/6euERzWP5JNLCfN5cKWdDFMYYMMEx/qtJN6krpyZz10Hyn0UlPXtSiwdGd7vzdOrIVw4RP1qB+Yj7W8If2f06nSA8RWkhzVDLRcwriY1Ylx1qB7P4b2zH9LpqcnuWD7BQ6kt0cW6UtKLDRyUHq5PJfoSzuXEnmXJitxnLaHRU4f8V79DPW3US2k/zDwkMCze1lGuFIgBsoHrgObBdFQwPOvMISOxZu0Y=&cap.sym=RR8usaHHf8g85rnqHzoBqQENPgg=&mgm.blockchecksum=ignore&mgm.id=0bcb6e61&mgm.logid=0156642c-8035-11e8-b042-6c0b84b7fa2e&mgm.replicahead=1&mgm.replicaindex=1, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ) (0x220b7a0), 1255 bytes
[2018-07-05 11:25:47.898590 +0200][Dump ][AsyncSock ] [s-jrciprjeos145p.cidsn.jrc.it:1095 #0 .0] Successfully sent message: kXR_open (file: /eos/jeodpp/data/SRS/Copernicus/S2/scenes/source/L1C/2017/10/14/065/S2A_MSIL1C_20171014T102021_N0205_R065_T32TMR_20171014T102235.SAFE/GRANULE/L1C_T32TMR_A012074_20171014T102235/IMG_DATA/T32TMR_20171014T102021_B08.jp2?cap.msg=Mzjar8HN1qTY6iTKcoW3AD2wXQbqDPKXwseW/D9FeM9IrVMXPOCTSxxMSQzoCCHu2RG50vAr0qSMh/y4UxZ/8BfZjR70v4dwuuY8IE/hoXKWsGsflcOt6jjiAsoF/HuQ/lbjrjIMIKdAe3KnRNEb2fnI9ABoUqHfG4hjVFOZVxPMaPt2yWEl20ot6VoCOZTh9JGUv9y4G+ssWYkxRv4btyXGC2vsmTVA820FVSw99Awcr2g11KkRtryugINXGIN82obwvAYHcBnEY2iWZhfqaC0R0CQ34ErdQKwpI51VERoyvUpS4VW9HsugPEAZGyfgqIStiFikney1aLpAxJg0a9xUBncCNpP8xzI5panbVlsaEGRqZHzIftAWTUaRMRSnh2335JVGaMCc0EWkKUQY1+TIXKk+UBhDf25+VLlIjzquI1a9Fhtb0Oyvksi6/YgcwVnxhH+4/h0GEsTlqVnUBFfROq8MhhLWk1M+c/qw2rOeohDRXdhrslaooEDG4hkNsucaRyMbQ6pW0Ja2PDWwpNCiPRmdEeG0YLRlBY+aoCIKKCFeOm1KHhc2PSxOh2c8DXQAm1QLhLbV52Lp1cy5tgqwKpsEwnojYj88QP5XE0/6euERzWP5JNLCfN5cKWdDFMYYMMEx/qtJN6krpyZz10Hyn0UlPXtSiwdGd7vzdOrIVw4RP1qB+Yj7W8If2f06nSA8RWkhzVDLRcwriY1Ylx1qB7P4b2zH9LpqcnuWD7BQ6kt0cW6UtKLDRyUHq5PJfoSzuXEnmXJitxnLaHRU4f8V79DPW3US2k/zDwkMCze1lGuFIgBsoHrgObBdFQwPOvMISOxZu0Y=&cap.sym=RR8usaHHf8g85rnqHzoBqQENPgg=&mgm.blockchecksum=ignore&mgm.id=0bcb6e61&mgm.logid=0156642c-8035-11e8-b042-6c0b84b7fa2e&mgm.replicahead=1&mgm.replicaindex=1, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ) (0x220b7a0).
[2018-07-05 11:25:47.898625 +0200][Dump ][XRootD ] [s-jrciprjeos145p.cidsn.jrc.it:1095] Message kXR_open (file: /eos/jeodpp/data/SRS/Copernicus/S2/scenes/source/L1C/2017/10/14/065/S2A_MSIL1C_20171014T102021_N0205_R065_T32TMR_20171014T102235.SAFE/GRANULE/L1C_T32TMR_A012074_20171014T102235/IMG_DATA/T32TMR_20171014T102021_B08.jp2?cap.msg=Mzjar8HN1qTY6iTKcoW3AD2wXQbqDPKXwseW/D9FeM9IrVMXPOCTSxxMSQzoCCHu2RG50vAr0qSMh/y4UxZ/8BfZjR70v4dwuuY8IE/hoXKWsGsflcOt6jjiAsoF/HuQ/lbjrjIMIKdAe3KnRNEb2fnI9ABoUqHfG4hjVFOZVxPMaPt2yWEl20ot6VoCOZTh9JGUv9y4G+ssWYkxRv4btyXGC2vsmTVA820FVSw99Awcr2g11KkRtryugINXGIN82obwvAYHcBnEY2iWZhfqaC0R0CQ34ErdQKwpI51VERoyvUpS4VW9HsugPEAZGyfgqIStiFikney1aLpAxJg0a9xUBncCNpP8xzI5panbVlsaEGRqZHzIftAWTUaRMRSnh2335JVGaMCc0EWkKUQY1+TIXKk+UBhDf25+VLlIjzquI1a9Fhtb0Oyvksi6/YgcwVnxhH+4/h0GEsTlqVnUBFfROq8MhhLWk1M+c/qw2rOeohDRXdhrslaooEDG4hkNsucaRyMbQ6pW0Ja2PDWwpNCiPRmdEeG0YLRlBY+aoCIKKCFeOm1KHhc2PSxOh2c8DXQAm1QLhLbV52Lp1cy5tgqwKpsEwnojYj88QP5XE0/6euERzWP5JNLCfN5cKWdDFMYYMMEx/qtJN6krpyZz10Hyn0UlPXtSiwdGd7vzdOrIVw4RP1qB+Yj7W8If2f06nSA8RWkhzVDLRcwriY1Ylx1qB7P4b2zH9LpqcnuWD7BQ6kt0cW6UtKLDRyUHq5PJfoSzuXEnmXJitxnLaHRU4f8V79DPW3US2k/zDwkMCze1lGuFIgBsoHrgObBdFQwPOvMISOxZu0Y=&cap.sym=RR8usaHHf8g85rnqHzoBqQENPgg=&mgm.blockchecksum=ignore&mgm.id=0bcb6e61&mgm.logid=0156642c-8035-11e8-b042-6c0b84b7fa2e&mgm.replicahead=1&mgm.replicaindex=1, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ) has been successfully sent.
[2018-07-05 11:25:47.898650 +0200][Dump ][PostMaster ] [s-jrciprjeos145p.cidsn.jrc.it:1095 #0 .0] All messages consumed, disable uplink
[2018-07-05 11:25:47.898981 +0200][Dump ][XRootDTransport ] [msg: 0xa8007bd0] Expecting 74 bytes of message body
[2018-07-05 11:25:47.899010 +0200][Dump ][AsyncSock ] [s-jrciprjeos145p.cidsn.jrc.it:1095 #0 .0] Received message header for 0xa8007bd0 size: 8
[2018-07-05 11:25:47.899031 +0200][Dump ][AsyncSock ] [s-jrciprjeos145p.cidsn.jrc.it:1095 #0 .0] Received message 0xa8007bd0 of 82 bytes
[2018-07-05 11:25:47.899060 +0200][Dump ][PostMaster ] [s-jrciprjeos145p.cidsn.jrc.it:1095 #0 ] Handling received message: 0xa8007bd0.
[2018-07-05 11:25:47.899182 +0200][Dump ][XRootD ] [s-jrciprjeos145p.cidsn.jrc.it:1095] Got kXR_wait response of 10 seconds to message kXR_open (file: /eos/jeodpp/data/SRS/Copernicus/S2/scenes/source/L1C/2017/10/14/065/S2A_MSIL1C_20171014T102021_N0205_R065_T32TMR_20171014T102235.SAFE/GRANULE/L1C_T32TMR_A012074_20171014T102235/IMG_DATA/T32TMR_20171014T102021_B08.jp2?cap.msg=Mzjar8HN1qTY6iTKcoW3AD2wXQbqDPKXwseW/D9FeM9IrVMXPOCTSxxMSQzoCCHu2RG50vAr0qSMh/y4UxZ/8BfZjR70v4dwuuY8IE/hoXKWsGsflcOt6jjiAsoF/HuQ/lbjrjIMIKdAe3KnRNEb2fnI9ABoUqHfG4hjVFOZVxPMaPt2yWEl20ot6VoCOZTh9JGUv9y4G+ssWYkxRv4btyXGC2vsmTVA820FVSw99Awcr2g11KkRtryugINXGIN82obwvAYHcBnEY2iWZhfqaC0R0CQ34ErdQKwpI51VERoyvUpS4VW9HsugPEAZGyfgqIStiFikney1aLpAxJg0a9xUBncCNpP8xzI5panbVlsaEGRqZHzIftAWTUaRMRSnh2335JVGaMCc0EWkKUQY1+TIXKk+UBhDf25+VLlIjzquI1a9Fhtb0Oyvksi6/YgcwVnxhH+4/h0GEsTlqVnUBFfROq8MhhLWk1M+c/qw2rOeohDRXdhrslaooEDG4hkNsucaRyMbQ6pW0Ja2PDWwpNCiPRmdEeG0YLRlBY+aoCIKKCFeOm1KHhc2PSxOh2c8DXQAm1QLhLbV52Lp1cy5tgqwKpsEwnojYj88QP5XE0/6euERzWP5JNLCfN5cKWdDFMYYMMEx/qtJN6krpyZz10Hyn0UlPXtSiwdGd7vzdOrIVw4RP1qB+Yj7W8If2f06nSA8RWkhzVDLRcwriY1Ylx1qB7P4b2zH9LpqcnuWD7BQ6kt0cW6UtKLDRyUHq5PJfoSzuXEnmXJitxnLaHRU4f8V79DPW3US2k/zDwkMCze1lGuFIgBsoHrgObBdFQwPOvMISOxZu0Y=&cap.sym=RR8usaHHf8g85rnqHzoBqQENPgg=&mgm.blockchecksum=ignore&mgm.id=0bcb6e61&mgm.logid=0156642c-8035-11e8-b042-6c0b84b7fa2e&mgm.replicahead=1&mgm.replicaindex=1, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ): FST still misses the required capability key; come back in 10 seconds!
[2018-07-05 11:25:47.899295 +0200][Debug ][TaskMgr ] Registering task: “WaitTask for: 0x0x220b7a0” to be run at: [2018-07-05 11:25:57 +0200]
Greetings,
Veselin
esindril
(Elvin Alin Sindrilaru)
July 5, 2018, 12:39pm
4
Hi Veselin,
Can you make sure that the MGM and the diskserver have the same sss key?
Could you please post the output of the following commands?
on the master MGM: sum /etc/eos.keytab
on the slave MGM : sum /etc/eos.keytab
on host s-jrciprjeos145p.cidsn.jrc.it: sum /etc/eos.keytab
Thanks,
Elvin
vessko
(Veselin Vasilev)
July 5, 2018, 1:12pm
5
Ciao Elvin,
We have several other users in the keytabs of the MGMs, the one that maters the daemon one is in the last row so here is the output:
Master: tail -n 1 /etc/eos.keytab | sum
29985 1
Slave: tail -n 1 /etc/eos.keytab | sum
29985 1
FST: tail -n 1 /etc/eos.keytab | sum
29985 1
And keep in mind that the read succeeds every once in a while.
Cheers,
Veselin
esindril
(Elvin Alin Sindrilaru)
July 9, 2018, 4:44pm
6
Hi Veselin,
I might have found the reason for the flipping online/offline status for the FSTs. I’m about to tag 4.2.28 and maybe you can give it a try. This should fix the issues that you’ve encountered. The new tag should be in the repos tomorrow.
Cheers,
Elvin
franck-jrc
(Franck Eyraud)
July 16, 2018, 1:26pm
7
Hi Elvin,
Thank you for your answer, Veselin has now left, but I will try to upgrade the slave at convenient moment, and let you know.
Franck