Cp works except eos cp

sahn · May 20, 2020, 7:06am

Hello,

I am testing EOS with RAIN striping 12 chunks across 18 FSTs (with 4 parities). All cp commands works well such as eos cp, xrdcp, posix cp command on EOS fuse mount. However eos cp only fails when trying to copy back from eos to local space. For example,

sh-4.2# eos cp /eos/gsdc/testrain/file2g-$(hostname -s) file2g-cp
[eoscp] file2g-jbod-mgmt-03      Total 2048.00 MB       |>...................| 0.00 % [0.0 MB[eoscp] file2g-jbod-mgmt-03      Total 2048.00 MB       |>...................| 0.00 % [0.0 MB/s]
error: file size difference between source and target file source=root://jbod-mgmt-01.eoscluster.sdfarm.kr:1094//eos/gsdc/testrain/file2g-jbod-mgmt-03?eos.app=eoscp [2.15 GB] target=/file2g-cp [0 B]
#WARNING [eos-cp] copied 0/1 files and 0 B in 0.72 seconds with 0 B/s

FST log says:

2020-05-20 16:01:23 200520 07:01:23 18622 FstOfs_ProcessTpcOpaque: root.2491:422@jbod-mgmt-03.eoscluster Unable to open - capability illegal /eos/gsdc/testrain/file2g-jbod-mgmt-03; key was rejected by service
2020-05-20 16:01:23 200520 07:01:23 18622 FstOfs_stat: root.2491:422@jbod-mgmt-03.eoscluster Unable to stat file /eos/gsdc/testrain/file2g-jbod-mgmt-03; no such file or directory

There is a file in /eos/gsdc/testrain/

sh-4.2# ls -l /mnt/eos/gsdc/testrain/file2g-$(hostname -s)
-rw-r-----. 1 daemon daemon 2147483648 May 20 06:36 /mnt/eos/gsdc/testrain/file2g-jbod-mgmt-03

xrdcp and posix cp command on EOS fuse mount are working

sh-4.2# xrdcp -f root://jbod-mgmt-01.sdfarm.kr:1094//eos/gsdc/testrain/file2g-$(hostname -s) file2g-xrdcp
[2GB/2GB][100%][==================================================][682.7MB/s]

sh-4.2# cp /mnt/eos/gsdc/testrain/file2g-$(hostname -s) file2g-cp
sh-4.2# ls -ltr file2g-*
-rw-r--r--. 1 root root 2147483648 May 20 06:47 file2g-xrdcp
-rw-r-----. 1 root root 2147483648 May 20 06:48 file2g-cp

Here is the output of “eos file info” command on that file:

sh-4.2# eos file info /eos/gsdc/testrain/file2g-$(hostname -s)
  File: '/eos/gsdc/testrain/file2g-jbod-mgmt-03'  Flags: 0640
  Size: 2147483648
Modify: Wed May 20 06:36:56 2020 Timestamp: 1589956616.550563000
Change: Wed May 20 06:36:49 2020 Timestamp: 1589956609.846617010
Birth : Wed May 20 06:36:49 2020 Timestamp: 1589956609.846617010
  CUid: 2 CGid: 2  Fxid: 000009d7 Fid: 2519    Pid: 43   Pxid: 0000002b
XStype: adler    XS: a7 c5 6f 2a    ETAGs: "676188913664:a7c56f2a"
Layout: qrain Stripes: 12 Blocksize: 1M LayoutId: 40640b52
  #Rep: 12
┌───┬──────┬─────────────────────────────────┬────────────────┬─────────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│                             host│      schedgroup│                 path│      boot│  configstatus│       drain│  active│                  geotag│
└───┴──────┴─────────────────────────────────┴────────────────┴─────────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
 0      907 jbod-mgmt-06.eoscluster.sdfarm.kr       default.66 /jbod/box_11_disk_066     booted             rw      nodrain   online         kisti::gsdc::g02
 1     1495 jbod-mgmt-09.eoscluster.sdfarm.kr       default.66 /jbod/box_18_disk_066     booted             rw      nodrain   online         kisti::gsdc::g03
 2      739 jbod-mgmt-05.eoscluster.sdfarm.kr       default.66 /jbod/box_09_disk_066     booted             rw      nodrain   online         kisti::gsdc::g02
 3      319 jbod-mgmt-02.eoscluster.sdfarm.kr       default.66 /jbod/box_04_disk_066     booted             rw      nodrain   online         kisti::gsdc::g01
 4     1411 jbod-mgmt-09.eoscluster.sdfarm.kr       default.66 /jbod/box_17_disk_066     booted             rw      nodrain   online         kisti::gsdc::g03
 5      823 jbod-mgmt-05.eoscluster.sdfarm.kr       default.66 /jbod/box_10_disk_066     booted             rw      nodrain   online         kisti::gsdc::g02
 6      235 jbod-mgmt-02.eoscluster.sdfarm.kr       default.66 /jbod/box_03_disk_066     booted             rw      nodrain   online         kisti::gsdc::g01
 7     1243 jbod-mgmt-08.eoscluster.sdfarm.kr       default.66 /jbod/box_15_disk_066     booted             rw      nodrain   online         kisti::gsdc::g03
 8      487 jbod-mgmt-03.eoscluster.sdfarm.kr       default.66 /jbod/box_06_disk_066     booted             rw      nodrain   online         kisti::gsdc::g01
 9      571 jbod-mgmt-04.eoscluster.sdfarm.kr       default.66 /jbod/box_07_disk_066     booted             rw      nodrain   online         kisti::gsdc::g02
 10     151 jbod-mgmt-01.eoscluster.sdfarm.kr       default.66 /jbod/box_02_disk_066     booted             ro      nodrain   online         kisti::gsdc::g01
 11    1327 jbod-mgmt-08.eoscluster.sdfarm.kr       default.66 /jbod/box_16_disk_066     booted             rw      nodrain   online         kisti::gsdc::g03

*******

The version of EOS used here is:
EOS_INSTANCE=gsdc
EOS_SERVER_VERSION=4.7.7 EOS_SERVER_RELEASE=1
EOS_CLIENT_VERSION=4.7.7 EOS_CLIENT_RELEASE=1

And the version of QuarkDB is 0.4.2.

Please just let me know if you need any further information.

Thank you.

Best regards,
Sang-Un

sahn · May 26, 2020, 5:26am

I have tried to have further look into “eos cp” by using “eoscp” command instead and the output is following:

sh-4.2# eoscp -d root://jbod-mgmt-01.eoscluster.sdfarm.kr:1094//eos/gsdc/testrain/file2g-jbod-mgmt-01 file2g-cp
[eoscp]: allocate copy buffer with 8388608 bytes
[eoscp] src<0>=root://jbod-mgmt-01.eoscluster.sdfarm.kr:1094//eos/gsdc/testrain/file2g-jbod-mgmt-01 dst<0>=file2g-cp
[eoscp] having PIO_ACCESS for source location=0 size=2147483648
[eoscp] src<0>=/eos/gsdc/testrain/file2g-jbod-mgmt-01 [root://jbod-mgmt-09.eoscluster.sdfarm.kr:1095/]
[eoscp] src<1>=/eos/gsdc/testrain/file2g-jbod-mgmt-01 [root://jbod-mgmt-09.eoscluster.sdfarm.kr:1096/]
[eoscp] src<2>=/eos/gsdc/testrain/file2g-jbod-mgmt-01 [root://jbod-mgmt-06.eoscluster.sdfarm.kr:1096/]
[eoscp] src<3>=/eos/gsdc/testrain/file2g-jbod-mgmt-01 [root://jbod-mgmt-01.eoscluster.sdfarm.kr:1096/]
[eoscp] src<4>=/eos/gsdc/testrain/file2g-jbod-mgmt-01 [root://jbod-mgmt-04.eoscluster.sdfarm.kr:1095/]
[eoscp] src<5>=/eos/gsdc/testrain/file2g-jbod-mgmt-01 [root://jbod-mgmt-02.eoscluster.sdfarm.kr:1096/]
[eoscp] src<6>=/eos/gsdc/testrain/file2g-jbod-mgmt-01 [root://jbod-mgmt-08.eoscluster.sdfarm.kr:1095/]
[eoscp] src<7>=/eos/gsdc/testrain/file2g-jbod-mgmt-01 [root://jbod-mgmt-05.eoscluster.sdfarm.kr:1096/]
[eoscp] src<8>=/eos/gsdc/testrain/file2g-jbod-mgmt-01 [root://jbod-mgmt-01.eoscluster.sdfarm.kr:1095/]
[eoscp] src<9>=/eos/gsdc/testrain/file2g-jbod-mgmt-01 [root://jbod-mgmt-05.eoscluster.sdfarm.kr:1095/]
[eoscp] src<10>=/eos/gsdc/testrain/file2g-jbod-mgmt-01 [root://jbod-mgmt-03.eoscluster.sdfarm.kr:1096/]
[eoscp] src<11>=/eos/gsdc/testrain/file2g-jbod-mgmt-01 [root://jbod-mgmt-07.eoscluster.sdfarm.kr:1095/]
[eoscp]: copy protocol raid:raid:raid:raid:raid:raid:raid:raid:raid:raid:raid:raid:=>file:

[eoscp]: XROOT(RAIDIO) is transparent for staging - nothing to check
[eoscp]: XROOT(RAIDIO) is transparent for staging - nothing to check
[eoscp]: XROOT(RAIDIO) is transparent for staging - nothing to check
[eoscp]: XROOT(RAIDIO) is transparent for staging - nothing to check
[eoscp]: XROOT(RAIDIO) is transparent for staging - nothing to check
[eoscp]: XROOT(RAIDIO) is transparent for staging - nothing to check
[eoscp]: XROOT(RAIDIO) is transparent for staging - nothing to check
[eoscp]: XROOT(RAIDIO) is transparent for staging - nothing to check
[eoscp]: XROOT(RAIDIO) is transparent for staging - nothing to check
[eoscp]: XROOT(RAIDIO) is transparent for staging - nothing to check
[eoscp]: XROOT(RAIDIO) is transparent for staging - nothing to check
[eoscp]: XROOT(RAIDIO) is transparent for staging - nothing to check
[eoscp]: doing XROOT(RAID-PIO) open with flags: 0
200526 05:12:03 time=1590469923.863694 func=OpenPio                  level=DEBUG logid=6fe215da-9f0f-11ea-8466-b8599fa51330 unit=none tid=00007f3f8bd8c8c0 source=RaidMetaLayout:419             tident=<service> sec=      uid=0 gid=0 name= geo="" Read case
200526 05:12:03 time=1590469923.864412 func=fileOpen                 level=ERROR logid=6fe21fee-9f0f-11ea-8466-b8599fa51330 unit=none tid=00007f3f8bd8c8c0 source=XrdIo:233                      tident=<service> sec=      uid=0 gid=0 name= geo="" error= "open failed url=root://jbod-mgmt-09.eoscluster.sdfarm.kr:1095//eos/gsdc/testrain/file2g-jbod-mgmt-01?cap.msg=dSmf7AbtE6fb5cRw3yl9T9wIBa0AHcT+C1X6mjsNoUEmAn/xQgLgIz50y0W9LgnRXrI3PO4CKIo2APtW4Vk1ndiu77RDZ99bAu8pvod3BdfA7UnO1xgbSB7g0JD8+LlXQhpRsdJt6TtVBQ1qsmnt8ZrccjmDrB7mD3hlOIrlGp7LLPy/rNscvadzLMOvX2M8I5You2A0FP237NT/0W7Zo+9v2PmAzdQPRqVByZBL5WEIskIpnEnpGJNqZ2ylmLbAoRSvMEkdup31ZHd/aH+LoOsK0JkLwqQGhWNxe3Xoag85DRlopARMc43/e8gPwvanFisJGpTVbnCO8GiCb7muIZlTqBI1X4S1jvYMPaCOA2p4Z7ZOt631Nc431coLrJN8IeiF+FtcZCGXTemsXseWuRMNfEpUFEIisIqHFXgiwupuOx2iU27StcbsH2ec6xRfLESlsSjRJM1GJE7wdESY5wQ5OC/RF+WIXt7RJ9vwx2MGt74/cxgjK7gkT81oBb3ajGXKhJehmEYqePeLor2h7cN6a5q9mt4YuyVAV2h9XpGmgliHo1DFK13DhtOuvplss7qXIaqNu8fLuoYz5Pu+QFRP3yuTmSLQJJrrt+WUwkdb7c0uow2RzuPuuherX3qfKHEGXJFNDpklUOSiFJPqZQwswAFTgaw+7+jP+1mJjgUq2ULqT4eP2EZzR2pKRchP5npktmVN7BGVT7ogVLTja0MzSSPMyLUWm0uA5DQp+mu5BfiaHs+ZHJrlBWDCOQoTDRcXlpHDXFl4lx07XfXTqmAYJJ7OXPLBnrJ6Ipknzz3MD5pwiRylvFHaZNukqIQP+PMjf15cgRXk3vJ+sfxIRlOgvFbn+/ZWQXhJd8q1qexatHUgXqhLoYZCiq1Af8zrn44wo2gNujIEJpjHxyIzj0Qo0rB8vhjiHS3OOF2MQqw04qhazdF1e25JsIAbndFU4JwgT6hcAMZdzUUgLXBTPJ7ikYFfQC1+2GcBNi+YmFk2HTw5GcZqSW4pkX4pDV+UeMz+f/TD1xYdjWx+VvqslyoQVb34a6bQMut2oJKFVfG10LDEKbga32nU/cC/JEwQKoyyiSNDN162Z3tiJSreiA9tSkBjynYr4mALgX0FoaJhT3y60kR5gh7FSWqk8cZI77keonQW+0G3YGbI7MM9/FA5ZkdpBUl+8UX/zExwshI0ONHKOygpLc6iFlqb2wEASBM2kP1AXBhdhjhRXZ8ECkPQHoJg3klGldlaKmIcxuZr4n/BE9YLkMrC/PHJMOU4p9kv61XhtsNe/69NLze07q5/PaSLWrjo0ld0KeYM6zqFt8nX4Fesl2dpVbDDFwVtuNkR0SnTwNyyQZKSTVCRY8nfngDMtqjKN5ByIFBQzUxAsrYjOD&cap.sym=GK+jmevjSPwa/N0MaGwsvDmrjrk=&fst.blocksize=1048576&fst.readahead=true&fst.valid=1590469982&mgm.logid=6fdeabac-9f0f-11ea-9687-b8599fa51330&mgm.replicaindex=0, errno=3005, errc=400, msg=[ERROR] Error response: Input/output error"
200526 05:12:03 time=1590469923.864442 func=OpenPio                  level=ERROR logid=6fe215da-9f0f-11ea-8466-b8599fa51330 unit=none tid=00007f3f8bd8c8c0 source=RaidMetaLayout:436             tident=<service> sec=      uid=0 gid=0 name= geo="" failed to open remote stripes
... (truncated)
200526 05:12:03 time=1590469923.871960 func=fileOpen                 level=ERROR logid=6fe33442-9f0f-11ea-8466-b8599fa51330 unit=none tid=00007f3f8bd8c8c0 source=XrdIo:233                      tident=<service> sec=      uid=0 gid=0 name= geo="" error= "open failed url=root://jbod-mgmt-07.eoscluster.sdfarm.kr:1095//eos/gsdc/testrain/file2g-jbod-mgmt-01?cap.msg=dSmf7AbtE6fb5cRw3yl9T9wIBa0AHcT+C1X6mjsNoUEmAn/xQgLgIz50y0W9LgnRXrI3PO4CKIo2APtW4Vk1ndiu77RDZ99bAu8pvod3BdfA7UnO1xgbSB7g0JD8+LlXQhpRsdJt6TtVBQ1qsmnt8ZrccjmDrB7mD3hlOIrlGp7LLPy/rNscvadzLMOvX2M8I5You2A0FP237NT/0W7Zo+9v2PmAzdQPRqVByZBL5WEIskIpnEnpGJNqZ2ylmLbAoRSvMEkdup31ZHd/aH+LoOsK0JkLwqQGhWNxe3Xoag85DRlopARMc43/e8gPwvanFisJGpTVbnCO8GiCb7muIZlTqBI1X4S1jvYMPaCOA2p4Z7ZOt631Nc431coLrJN8IeiF+FtcZCGXTemsXseWuRMNfEpUFEIisIqHFXgiwupuOx2iU27StcbsH2ec6xRfLESlsSjRJM1GJE7wdESY5wQ5OC/RF+WIXt7RJ9vwx2MGt74/cxgjK7gkT81oBb3ajGXKhJehmEYqePeLor2h7cN6a5q9mt4YuyVAV2h9XpGmgliHo1DFK13DhtOuvplss7qXIaqNu8fLuoYz5Pu+QFRP3yuTmSLQJJrrt+WUwkdb7c0uow2RzuPuuherX3qfKHEGXJFNDpklUOSiFJPqZQwswAFTgaw+7+jP+1mJjgUq2ULqT4eP2EZzR2pKRchP5npktmVN7BGVT7ogVLTja0MzSSPMyLUWm0uA5DQp+mu5BfiaHs+ZHJrlBWDCOQoTDRcXlpHDXFl4lx07XfXTqmAYJJ7OXPLBnrJ6Ipknzz3MD5pwiRylvFHaZNukqIQP+PMjf15cgRXk3vJ+sfxIRlOgvFbn+/ZWQXhJd8q1qexatHUgXqhLoYZCiq1Af8zrn44wo2gNujIEJpjHxyIzj0Qo0rB8vhjiHS3OOF2MQqw04qhazdF1e25JsIAbndFU4JwgT6hcAMZdzUUgLXBTPJ7ikYFfQC1+2GcBNi+YmFk2HTw5GcZqSW4pkX4pDV+UeMz+f/TD1xYdjWx+VvqslyoQVb34a6bQMut2oJKFVfG10LDEKbga32nU/cC/JEwQKoyyiSNDN162Z3tiJSreiA9tSkBjynYr4mALgX0FoaJhT3y60kR5gh7FSWqk8cZI77keonQW+0G3YGbI7MM9/FA5ZkdpBUl+8UX/zExwshI0ONHKOygpLc6iFlqb2wEASBM2kP1AXBhdhjhRXZ8ECkPQHoJg3klGldlaKmIcxuZr4n/BE9YLkMrC/PHJMOU4p9kv61XhtsNe/69NLze07q5/PaSLWrjo0ld0KeYM6zqFt8nX4Fesl2dpVbDDFwVtuNkR0SnTwNyyQZKSTVCRY8nfngDMtqjKN5ByIFBQzUxAsrYjOD&cap.sym=GK+jmevjSPwa/N0MaGwsvDmrjrk=&fst.blocksize=1048576&fst.readahead=true&fst.valid=1590469982&mgm.logid=6fdeabac-9f0f-11ea-9687-b8599fa51330&mgm.replicaindex=11, errno=3005, errc=400, msg=[ERROR] Error response: Input/output error"
200526 05:12:03 time=1590469923.871981 func=OpenPio                  level=ERROR logid=6fe215da-9f0f-11ea-8466-b8599fa51330 unit=none tid=00007f3f8bd8c8c0 source=RaidMetaLayout:436             tident=<service> sec=      uid=0 gid=0 name= geo="" failed to open remote stripes
200526 05:12:03 time=1590469923.872023 func=ValidateHeader           level=DEBUG logid=6fe215da-9f0f-11ea-8466-b8599fa51330 unit=none tid=00007f3f8bd8c8c0 source=RaidMetaLayout:524             tident=<service> sec=      uid=0 gid=0 name= geo="" file is either new or there are no corruptions.
200526 05:12:03 time=1590469923.872033 func=OpenPio                  level=DEBUG logid=6fe215da-9f0f-11ea-8466-b8599fa51330 unit=none tid=00007f3f8bd8c8c0 source=RaidMetaLayout:484             tident=<service> sec=      uid=0 gid=0 name= geo="" Finished open with size: 0.
[eoscp]: doing POSIX open to write  file2g-cp
200526 05:12:03 time=1590469923.872126 func=Read                     level=DEBUG logid=6fe215da-9f0f-11ea-8466-b8599fa51330 unit=none tid=00007f3f8bd8c8c0 source=RaidMetaLayout:590             tident=<service> sec=      uid=0 gid=0 name= geo="" offset=0, length=4194304
200526 05:12:03 time=1590469923.872133 func=Read                     level=WARN  logid=6fe215da-9f0f-11ea-8466-b8599fa51330 unit=none tid=00007f3f8bd8c8c0 source=RaidMetaLayout:615             tident=<service> sec=      uid=0 gid=0 name= geo="" read to big, resizing the read length
[eoscp]                          Total 2048.00 MB       |>...................| 0.00 % [0.0 MB/s]
[eoscp] #################################################################
[eoscp] # Date                     : ( 1590469923 ) Tue May 26 05:12:03 2020
[eoscp] # auth forced=<none> krb5=<none> gsi=<none>
[eoscp] # Source Name [00]         : root://jbod-mgmt-09.eoscluster.sdfarm.kr:1095//eos/gsdc/testrain/file2g-jbod-mgmt-01
[eoscp] # Source Name [01]         : root://jbod-mgmt-09.eoscluster.sdfarm.kr:1096//eos/gsdc/testrain/file2g-jbod-mgmt-01
[eoscp] # Source Name [02]         : root://jbod-mgmt-06.eoscluster.sdfarm.kr:1096//eos/gsdc/testrain/file2g-jbod-mgmt-01
[eoscp] # Source Name [03]         : root://jbod-mgmt-01.eoscluster.sdfarm.kr:1096//eos/gsdc/testrain/file2g-jbod-mgmt-01
[eoscp] # Source Name [04]         : root://jbod-mgmt-04.eoscluster.sdfarm.kr:1095//eos/gsdc/testrain/file2g-jbod-mgmt-01
[eoscp] # Source Name [05]         : root://jbod-mgmt-02.eoscluster.sdfarm.kr:1096//eos/gsdc/testrain/file2g-jbod-mgmt-01
[eoscp] # Source Name [06]         : root://jbod-mgmt-08.eoscluster.sdfarm.kr:1095//eos/gsdc/testrain/file2g-jbod-mgmt-01
[eoscp] # Source Name [07]         : root://jbod-mgmt-05.eoscluster.sdfarm.kr:1096//eos/gsdc/testrain/file2g-jbod-mgmt-01
[eoscp] # Source Name [08]         : root://jbod-mgmt-01.eoscluster.sdfarm.kr:1095//eos/gsdc/testrain/file2g-jbod-mgmt-01
[eoscp] # Source Name [09]         : root://jbod-mgmt-05.eoscluster.sdfarm.kr:1095//eos/gsdc/testrain/file2g-jbod-mgmt-01
[eoscp] # Source Name [10]         : root://jbod-mgmt-03.eoscluster.sdfarm.kr:1096//eos/gsdc/testrain/file2g-jbod-mgmt-01
[eoscp] # Source Name [11]         : root://jbod-mgmt-07.eoscluster.sdfarm.kr:1095//eos/gsdc/testrain/file2g-jbod-mgmt-01
[eoscp] # Destination Name [00]    : file2g-cp
[eoscp] # Data Copied [bytes]      : 0
[eoscp] # Realtime [s]             : 0.029000
[eoscp] # Eff.Copy. Rate[MB/s]     : 0.000000
[eoscp] # Write Start Position     : 0
[eoscp] # Write Stop  Position     : 0
200526 05:12:03 time=1590469923.872244 func=Close                    level=WARN  logid=6fe215da-9f0f-11ea-8466-b8599fa51330 unit=none tid=00007f3f8bd8c8c0 source=RaidMetaLayout:1473            tident=<service> sec=      uid=0 gid=0 name= geo="" remote stripe could not be closed as the file is NULL
200526 05:12:03 time=1590469923.872250 func=Close                    level=WARN  logid=6fe215da-9f0f-11ea-8466-b8599fa51330 unit=none tid=00007f3f8bd8c8c0 source=RaidMetaLayout:1473            tident=<service> sec=      uid=0 gid=0 name= geo="" remote stripe could not be closed as the file is NULL
200526 05:12:03 time=1590469923.872265 func=Close                    level=WARN  logid=6fe215da-9f0f-11ea-8466-b8599fa51330 unit=none tid=00007f3f8bd8c8c0 source=RaidMetaLayout:1473            tident=<service> sec=      uid=0 gid=0 name= geo="" remote stripe could not be closed as the file is NULL
200526 05:12:03 time=1590469923.872270 func=Close                    level=WARN  logid=6fe215da-9f0f-11ea-8466-b8599fa51330 unit=none tid=00007f3f8bd8c8c0 source=RaidMetaLayout:1473            tident=<service> sec=      uid=0 gid=0 name= geo="" remote stripe could not be closed as the file is NULL
200526 05:12:03 time=1590469923.872273 func=Close                    level=WARN  logid=6fe215da-9f0f-11ea-8466-b8599fa51330 unit=none tid=00007f3f8bd8c8c0 source=RaidMetaLayout:1473            tident=<service> sec=      uid=0 gid=0 name= geo="" remote stripe could not be closed as the file is NULL
200526 05:12:03 time=1590469923.872277 func=Close                    level=WARN  logid=6fe215da-9f0f-11ea-8466-b8599fa51330 unit=none tid=00007f3f8bd8c8c0 source=RaidMetaLayout:1473            tident=<service> sec=      uid=0 gid=0 name= geo="" remote stripe could not be closed as the file is NULL
200526 05:12:03 time=1590469923.872281 func=Close                    level=WARN  logid=6fe215da-9f0f-11ea-8466-b8599fa51330 unit=none tid=00007f3f8bd8c8c0 source=RaidMetaLayout:1473            tident=<service> sec=      uid=0 gid=0 name= geo="" remote stripe could not be closed as the file is NULL
200526 05:12:03 time=1590469923.872284 func=Close                    level=WARN  logid=6fe215da-9f0f-11ea-8466-b8599fa51330 unit=none tid=00007f3f8bd8c8c0 source=RaidMetaLayout:1473            tident=<service> sec=      uid=0 gid=0 name= geo="" remote stripe could not be closed as the file is NULL
200526 05:12:03 time=1590469923.872288 func=Close                    level=WARN  logid=6fe215da-9f0f-11ea-8466-b8599fa51330 unit=none tid=00007f3f8bd8c8c0 source=RaidMetaLayout:1473            tident=<service> sec=      uid=0 gid=0 name= geo="" remote stripe could not be closed as the file is NULL
200526 05:12:03 time=1590469923.872292 func=Close                    level=WARN  logid=6fe215da-9f0f-11ea-8466-b8599fa51330 unit=none tid=00007f3f8bd8c8c0 source=RaidMetaLayout:1473            tident=<service> sec=      uid=0 gid=0 name= geo="" remote stripe could not be closed as the file is NULL
200526 05:12:03 time=1590469923.872296 func=Close                    level=WARN  logid=6fe215da-9f0f-11ea-8466-b8599fa51330 unit=none tid=00007f3f8bd8c8c0 source=RaidMetaLayout:1473            tident=<service> sec=      uid=0 gid=0 name= geo="" remote stripe could not be closed as the file is NULL
200526 05:12:03 time=1590469923.872301 func=Close                    level=WARN  logid=6fe215da-9f0f-11ea-8466-b8599fa51330 unit=none tid=00007f3f8bd8c8c0 source=RaidMetaLayout:1485            tident=<service> sec=      uid=0 gid=0 name= geo="" local stripe could not be closed as the file is NULL
[eoscp] # Total read wait time is: 0.000000 milliseconds.
[eoscp] # Total write wait time is: 0.000000 milliseconds.

Xrdcp works as follows:

sh-4.2# xrdcp -f root://jbod-mgmt-01.sdfarm.kr:1094//eos/gsdc/testrain/file2g-$(hostname -s) file2g-xrdcp
[2GB/2GB][100%][==================================================][341.3MB/s]  s]

Why does eoscp or eos cp fail to read stripes while xrdcp can? How could I dig further into the issue? Where should I look into?

Best regards,
Sang-Un

esindril · June 22, 2020, 11:07am

Hi Sang-Un,

I can’t seem to be able to reproduce your issue. Can you make sure that the destination of your “eoscp” command is on the local disk an not again on the xrootd server? By this I mean try copying to /tmp/test_file.dat for example.

From the first error that you posted, namely Unable to open - capability illegal seem like there is some mismatch between the sss keys between the MGM and the FSTs. But I don’t see how this would work with xrdcp unless recovery happens on the fly.

To help better understand what is happening please run eoscp with the following env variable and post the result: XRD_LOGLEVEL=Dump

Also post any error messages you might be able to collect from the FSTs during such a failed transfer.

Thanks,
Elvin

sahn · June 25, 2020, 12:19am

Hi Elvin,

I found a difference between xrdcp and eoscp (or eos cp) from the dump of command.

Here is a snippet of eoscp with XRD_LOGLEVEL=Dump:

[2020-06-24 23:44:22.275837 +0000][Debug  ][XRootDTransport   ] [jbod-mgmt-08.eoscluster.sdfarm.kr:1096 #0.0] Authentication is required: &P=unix&P=sss,0.13:/etc/eos.keytab
[2020-06-24 23:44:22.275853 +0000][Debug  ][XRootDTransport   ] [jbod-mgmt-08.eoscluster.sdfarm.kr:1096 #0.0] Sending authentication data
[2020-06-24 23:44:22.275873 +0000][Debug  ][XRootDTransport   ] [jbod-mgmt-08.eoscluster.sdfarm.kr:1096 #0.0] Trying to authenticate using unix
[2020-06-24 23:44:22.276010 +0000][Dump   ][AsyncSock         ] [jbod-mgmt-08.eoscluster.sdfarm.kr:1096 #0.0] Wrote a message:  (0xd00057f0), 39 bytes
[2020-06-24 23:44:22.276057 +0000][Dump   ][XRootDTransport   ] [msg: 0xd00057f0] Expecting 0 bytes of message body
[2020-06-24 23:44:22.276077 +0000][Dump   ][AsyncSock         ] [jbod-mgmt-08.eoscluster.sdfarm.kr:1096 #0.0] Received message header, size: 8
[2020-06-24 23:44:22.276092 +0000][Dump   ][AsyncSock         ] [jbod-mgmt-08.eoscluster.sdfarm.kr:1096 #0.0] Received a message of 8 bytes
[2020-06-24 23:44:22.276103 +0000][Debug  ][XRootDTransport   ] [jbod-mgmt-08.eoscluster.sdfarm.kr:1096 #0.0] Authenticated with unix.
[2020-06-24 23:44:22.276119 +0000][Debug  ][PostMaster        ] [jbod-mgmt-08.eoscluster.sdfarm.kr:1096 #0] Stream 0 connected.
[2020-06-24 23:44:22.276147 +0000][Dump   ][AsyncSock         ] [jbod-mgmt-08.eoscluster.sdfarm.kr:1096 #0.0] Wrote a message: kXR_stat (path: /eos/gsdc/rain12/file-1g-jbod-mgmt-01, flags: none) (0x2f23e70), 61 bytes
[2020-06-24 23:44:22.276164 +0000][Dump   ][AsyncSock         ] [jbod-mgmt-08.eoscluster.sdfarm.kr:1096 #0.0] Successfully sent message: kXR_stat (path: /eos/gsdc/rain12/file-1g-jbod-mgmt-01, flags: none) (0x2f23e70).
[2020-06-24 23:44:22.276176 +0000][Dump   ][XRootD            ] [jbod-mgmt-08.eoscluster.sdfarm.kr:1096] Message kXR_stat (path: /eos/gsdc/rain12/file-1g-jbod-mgmt-01, flags: none) has been successfully sent.
[2020-06-24 23:44:22.276202 +0000][Debug  ][ExDbgMsg          ] [jbod-mgmt-08.eoscluster.sdfarm.kr:1096] Moving MsgHandler: 0x2f2a9e0 (message: kXR_stat (path: /eos/gsdc/rain12/file-1g-jbod-mgmt-01, flags: none) ) from out-queu to in-queue.
[2020-06-24 23:44:22.276215 +0000][Dump   ][PostMaster        ] [jbod-mgmt-08.eoscluster.sdfarm.kr:1096 #0.0] All messages consumed, disable uplink
[2020-06-24 23:44:22.276237 +0000][Dump   ][XRootDTransport   ] [msg: 0xd0005420] Expecting 89 bytes of message body
[2020-06-24 23:44:22.276248 +0000][Dump   ][AsyncSock         ] [jbod-mgmt-08.eoscluster.sdfarm.kr:1096 #0.0] Received message header for 0xd0005420 size: 8
[2020-06-24 23:44:22.276293 +0000][Dump   ][AsyncSock         ] [jbod-mgmt-08.eoscluster.sdfarm.kr:1096 #0.0] Received message 0xd0005420 of 97 bytes
[2020-06-24 23:44:22.276311 +0000][Dump   ][PostMaster        ] [jbod-mgmt-08.eoscluster.sdfarm.kr:1096 #0] Handling received message: 0xd0005420.
[2020-06-24 23:44:22.276355 +0000][Dump   ][XRootD            ] [jbod-mgmt-08.eoscluster.sdfarm.kr:1096] Got a kXR_error response to request kXR_stat (path: /eos/gsdc/rain12/file-1g-jbod-mgmt-01, flags: none) [3011] Unable to stat file /eos/gsdc/rain12/file-1g-jbod-mgmt-01; no such file or directory
[2020-06-24 23:44:22.276373 +0000][Debug  ][XRootD            ] [jbod-mgmt-08.eoscluster.sdfarm.kr:1096] Handling error while processing kXR_stat (path: /eos/gsdc/rain12/file-1g-jbod-mgmt-01, flags: none): [ERROR] Error response: No such file or directory.
[2020-06-24 23:44:22.276381 +0000][Debug  ][ExDbgMsg          ] [jbod-mgmt-08.eoscluster.sdfarm.kr:1096] Calling MsgHandler: 0x2f2a9e0 (message: kXR_stat (path: /eos/gsdc/rain12/file-1g-jbod-mgmt-01, flags: none) ) with status: [ERROR] Error response: No such file or directory.
[2020-06-24 23:44:22.276391 +0000][Debug  ][ExDbgMsg          ] [jbod-mgmt-08.eoscluster.sdfarm.kr:1096] Destroying MsgHandler: 0x2f2a9e0.

It tries to authenticate with unix instead of sss while xrdcp authenticates with sss as following:

[2020-06-25 00:00:52.818169 +0000][Debug  ][XRootDTransport   ] [jbod-mgmt-01.sdfarm.kr:1094 #0.0] Authentication is required: &P=sss,0.13:/etc/eos.keytab&P=unix
[2020-06-25 00:00:52.818181 +0000][Debug  ][XRootDTransport   ] [jbod-mgmt-01.sdfarm.kr:1094 #0.0] Sending authentication data
[2020-06-25 00:00:52.819373 +0000][Debug  ][XRootDTransport   ] [jbod-mgmt-01.sdfarm.kr:1094 #0.0] Trying to authenticate using sss
[2020-06-25 00:00:52.819466 +0000][Dump   ][AsyncSock         ] [jbod-mgmt-01.sdfarm.kr:1094 #0.0] Wrote a message:  (0xf0003630), 175 bytes
[2020-06-25 00:00:52.819542 +0000][Dump   ][XRootDTransport   ] [msg: 0xf0000950] Expecting 0 bytes of message body
[2020-06-25 00:00:52.819554 +0000][Dump   ][AsyncSock         ] [jbod-mgmt-01.sdfarm.kr:1094 #0.0] Received message header, size: 8
[2020-06-25 00:00:52.819580 +0000][Dump   ][AsyncSock         ] [jbod-mgmt-01.sdfarm.kr:1094 #0.0] Received a message of 8 bytes
[2020-06-25 00:00:52.819609 +0000][Debug  ][XRootDTransport   ] [jbod-mgmt-01.sdfarm.kr:1094 #0.0] Authenticated with sss.
[2020-06-25 00:00:52.819624 +0000][Debug  ][PostMaster        ] [jbod-mgmt-01.sdfarm.kr:1094 #0] Stream 0 connected.
[2020-06-25 00:00:52.819657 +0000][Dump   ][AsyncSock         ] [jbod-mgmt-01.sdfarm.kr:1094 #0.0] Wrote a message: kXR_open (file: /eos/gsdc/rain12/file-1g-jbod-mgmt-01?, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ) (0x1656db0), 62 bytes
[2020-06-25 00:00:52.819674 +0000][Dump   ][AsyncSock         ] [jbod-mgmt-01.sdfarm.kr:1094 #0.0] Successfully sent message: kXR_open (file: /eos/gsdc/rain12/file-1g-jbod-mgmt-01?, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ) (0x1656db0).
[2020-06-25 00:00:52.819693 +0000][Dump   ][XRootD            ] [jbod-mgmt-01.sdfarm.kr:1094] Message kXR_open (file: /eos/gsdc/rain12/file-1g-jbod-mgmt-01?, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ) has been successfully sent.
[2020-06-25 00:00:52.819707 +0000][Debug  ][ExDbgMsg          ] [jbod-mgmt-01.sdfarm.kr:1094] Moving MsgHandler: 0x165cb80 (message: kXR_open (file: /eos/gsdc/rain12/file-1g-jbod-mgmt-01?, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ) ) from out-queu to in-queue.
[2020-06-25 00:00:52.819739 +0000][Dump   ][PostMaster        ] [jbod-mgmt-01.sdfarm.kr:1094 #0.0] All messages consumed, disable uplink
[2020-06-25 00:00:52.820653 +0000][Dump   ][XRootDTransport   ] [msg: 0xf00008c0] Expecting 1797 bytes of message body
[2020-06-25 00:00:52.820671 +0000][Dump   ][AsyncSock         ] [jbod-mgmt-01.sdfarm.kr:1094 #0.0] Received message header for 0xf00008c0 size: 8
[2020-06-25 00:00:52.820692 +0000][Dump   ][AsyncSock         ] [jbod-mgmt-01.sdfarm.kr:1094 #0.0] Received message 0xf00008c0 of 1805 bytes
[2020-06-25 00:00:52.820708 +0000][Dump   ][PostMaster        ] [jbod-mgmt-01.sdfarm.kr:1094 #0] Handling received message: 0xf00008c0.
[2020-06-25 00:00:52.820759 +0000][Dump   ][XRootD            ] [jbod-mgmt-01.sdfarm.kr:1094] Got kXR_redirect response to message kXR_open (file: /eos/gsdc/rain12/file-1g-jbod-mgmt-01?, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ): jbod-mgmt-09.eoscluster.sdfarm.kr?&cap.sym=vYBdWW/5V3HpmQYotPXG7yT3enk=&cap.msg=AIr/NI3WAsTgDFcjCb0nz+fUb8eGdC8TBxJfNPi2DjtcsexfgftFobw3NDe5sDcuP32FIUPbk/M3VkZXstaai2/xOAXbbHxBwLnH69/eLnE73vqqo2y2aLvWMVEg4heto0N1e2sQ/isPZICs0qzsl1SSq1db29g48EMMLfcASFrn/yr/n/eRIQbzlF+0DfqftBDayN98F40DOxQuE+tiPkqaZeTZNfv62hp5gTHxjly4Dn4L1adxOn5NVrial/e6xr2Y/7HOCkGdxA1BXjQgjN1cDafujPLRJhs2DEbhuyljRbCPmgaX1hZr720sHU986BQ97oNNXEVp4AV1sCXFcBQo5EZ7USD/wMuONNDtJXMHD6yPWG93/Bl8zz9cAaoA6I1YYNKj6mR9bJ3T2A0JxVT4Nx7XGBaoD3i3E4mVlmByEbyUeWD6+C8UcIGMqZTB2LgbYrn0XN9BooOG49lc9Rs4GtNutWY2gOr73FUsr5yKB4DKPTf0h4ZDBbnZ66l0+Wa0ZWMow4r3d9XbUJl4keyx6v7eXcUryoey8ibU5OpNVrbTWmI7lJl8rKQxvfdBsPV8SDwDftw4hfyn/CVcUJ5oVdKiWb25GuYko37+Wh+CVH6f1AA+ERCeJJiQI8rhLdXgNfarSJPKjywFWWVNGfZkurmwmLl7/xnSAXpfAYpzGuJi225bGqNkqT+NATlTdbxfmM8StxpQ1Jcog+pRDPbrDO0WB2NOAt+MTXiX1Oz8dKMFYUMfjQINBrDlecpLzVFrDUNY8n00C9WYHO6WtKB9c3VNTrIFbuD++JXt1mjhA+ydtqZYYl6+nuOEiEKsJoio5Ef7uUcBbOF8g7NhwzaHUh6Zj9BVgwaEDycvRBmEvNhzNutPEC2w4AsMS5IxM2RHy+sP0EjIgmYCCwE0xtiosMfnpEo9jH2nVK8AdrlpcFyJ0I3oqoy0ToCSh44RaWA5dJ5LMUqeMHu7TBXMECyrwEV1gX1Ldr0RIZkZFK998N5VsNh1ET9cFko5JTAP7TEOwbta61r82jEr1delPEXclC1PmGREpfZnTiQ0e6G8H/MlbRD/Jaq56+XopUyBwV0u6uiBftOKZnwG8KjuV7PvAJU6+HBWm3WzsyZh06Mex1sOzHdWDdYUFGgbfYO0Frt55DiUgzOZd+BhqBbfM524dhHcOhCvsdEBXlHDtAms1QeLfaXMx73BEaoLb1uhLOjpxnm6Gr+wJySPYQyxYWB/jollVDBZBeCjcXDCpI4vf1qGmGEQVsqJjy02e+mtEPM2Qkk1LVrgIJyWmlDKrKetZAvGlrcySFCPx9XZgYdbooV/pBoZA51ffWpitgoRfoPLmM1oZWmRNTmtx5H3HU41zrg4r9IQ0FETFY7YJUtDu5/WldON7NtiwGKQ2wRxrCSxHvSCCIyublE2G3DWlD+MX11SxUYFeB1WO/zjcOJZ9kc5YfBE+KdW3CPCra+lu9dbEdIjrYnD/ajSTWS9IcjXdjbPveMhjMejsIdhvCPh81Vvqw8eL1qArFGxKuzY5HjVgMCu80sf7LrxSyPisFMAYKALdtgSe+KVCAX2RQ76s0tXuor5kg==&mgm.logid=ef7766aa-b676-11ea-9711-b8599fa51330&mgm.replicaindex=1&mgm.replicahead=1&mgm.id=00000039&mgm.mtime=1593042252, port 1096

It seems to me that the order of authentication try is different:

For eoscp: Authentication is required: &P=unix&P=sss,0.13:/etc/eos.keytab
For xrdcp: Authentication is required: &P=sss,0.13:/etc/eos.keytab&P=unix

For eoscp, it is authenticated with unix then it fails to stat the target file because I think it is not able to access to /eos space.

I wonder why they try differently and what would be the root cause of this. Would it be possible to enforce sss authentication during eoscp?

Thank you.

Best regards,
Sang-Un

esindril · June 25, 2020, 8:43am

Hi Sang-Un,

The service running on jbod-mgmt-01.sdfarm.kr:1094 is an MGM, right? Could you post the sec.protobin configuration that you use there?

Also the service running on jbod-mgmt-08.eoscluster.sdfarm.kr:1096 is an FST? Can you also post the config?

Can you try re-running the test from a different machine which is not running neither an MGM nor an FST?

Maybe post again the full log of the transfers for each case so that we see exactly where the client goes and what authentication protocol it uses. Normally the client should always use unix authentication when connecting to the FSTs. But in this case I don’t think the authentication is the problem but more the fact that the stat failed …

Also post the output of the eos fileinfo for the file you are trying to copy.

Thanks,
Elvin

sahn · June 26, 2020, 1:26am

Hi Elvin,

Thanks.

sec.protobind configuration on MGM is the following:

sec.protbind localhost.localdomain sss unix
sec.protbind localhost sss unix
sec.protbind *.eoscluster.sdfarm.kr sss unix
sec.protbind * only krb5 sss unix

And related configuration on FST is following:

sec.protbind * only unix sss

The full log of transfer with XRD_LOGLEVEL=Dump can be found here: eoscpdump.log (just let me know if it is not accessible)

I have tried several times on different client containers (running on different physical nodes).

And here is the output of eos file info /eos/gsdc/rain12/file-1g-$(hostname -s)

sh-4.2# eos file info /eos/gsdc/rain12/file-1g-$(hostname -s)
  File: '/eos/gsdc/rain12/file-1g-jbod-mgmt-08'  Flags: 0640
  Size: 1073741824
Modify: Thu Jun 25 00:55:23 2020 Timestamp: 1593046523.581740000
Change: Thu Jun 25 00:55:18 2020 Timestamp: 1593046518.147824455
Birth : Thu Jun 25 00:55:18 2020 Timestamp: 1593046518.147824455
  CUid: 2 CGid: 2  Fxid: 0000004d Fid: 77    Pid: 34   Pxid: 00000022
XStype: adler    XS: 2f 66 2f 03    ETAGs: "20669530112:2f662f03"
Layout: qrain Stripes: 12 Blocksize: 1M LayoutId: 40640b52
  #Rep: 12
┌───┬──────┬─────────────────────────────────┬────────────────┬─────────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│                             host│      schedgroup│                 path│      boot│  configstatus│       drain│  active│                  geotag│
└───┴──────┴─────────────────────────────────┴────────────────┴─────────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
 0      442 jbod-mgmt-03.eoscluster.sdfarm.kr       default.21 /jbod/box_06_disk_021     booted             rw      nodrain   online         kisti::gsdc::g01
 1     1366 jbod-mgmt-09.eoscluster.sdfarm.kr       default.21 /jbod/box_17_disk_021     booted             rw      nodrain   online         kisti::gsdc::g03
 2      694 jbod-mgmt-05.eoscluster.sdfarm.kr       default.21 /jbod/box_09_disk_021     booted             rw      nodrain   online         kisti::gsdc::g02
 3     1030 jbod-mgmt-07.eoscluster.sdfarm.kr       default.21 /jbod/box_13_disk_021     booted             rw      nodrain   online         kisti::gsdc::g03
 4      274 jbod-mgmt-02.eoscluster.sdfarm.kr       default.21 /jbod/box_04_disk_021     booted             rw      nodrain   online         kisti::gsdc::g01
 5      778 jbod-mgmt-05.eoscluster.sdfarm.kr       default.21 /jbod/box_10_disk_021     booted             rw      nodrain   online         kisti::gsdc::g02
 6     1198 jbod-mgmt-08.eoscluster.sdfarm.kr       default.21 /jbod/box_15_disk_021     booted             rw      nodrain   online         kisti::gsdc::g03
 7      358 jbod-mgmt-03.eoscluster.sdfarm.kr       default.21 /jbod/box_05_disk_021     booted             rw      nodrain   online         kisti::gsdc::g01
 8      946 jbod-mgmt-06.eoscluster.sdfarm.kr       default.21 /jbod/box_12_disk_021     booted             rw      nodrain   online         kisti::gsdc::g02
 9      610 jbod-mgmt-04.eoscluster.sdfarm.kr       default.21 /jbod/box_08_disk_021     booted             rw      nodrain   online         kisti::gsdc::g02
 10    1114 jbod-mgmt-07.eoscluster.sdfarm.kr       default.21 /jbod/box_14_disk_021     booted             rw      nodrain   online         kisti::gsdc::g03
 11      22 jbod-mgmt-01.eoscluster.sdfarm.kr       default.21 /jbod/box_01_disk_021     booted             rw      nodrain   online         kisti::gsdc::g01

*******

Best regards,
Sang-Un

esindril · June 26, 2020, 9:34am

Hi Sang-Un,

I’m a bit puzzled by you setup. Can you tell me what are the <hostname>:<port> values for the MGM, MQ and the FSTs in this setup? Also post the output for eos node ls and the output of the following commands:

sum /etc/eos.keytab on machine jbod-mgmt-01.eoscluster.sdfarm.kr
sum /etc/eos.keytab on machine jbod-mgmt-09.eoscluster.sdfarm.kr

Can you also try doing the same experiment with a raiddp file? So the layout should be raiddp with 6 stripes and not qrain?

Thanks,
Elvin

sahn · June 29, 2020, 6:09am

Hi Elvin,

We are running MGM, QDB, MQ, FSTs in containers and they rely on host network to share same IP addresses because we have lack of ipv4 space.

To give some details, below is the current setup:
jbod-mgmt-01.eoscluster.sdfarm.kr: mgm-01, krb, fst-0001, fst-0002
jbod-mgmt-02.eoscluster.sdfarm.kr: qdb-01, fst-0003, fst-0004
jbod-mgmt-03.eoscluster.sdfarm.kr: fst-0005, fst-0006
jbod-mgmt-04.eoscluster.sdfarm.kr: mgm-02, fst-0007, fst-0008
jbod-mgmt-05.eoscluster.sdfarm.kr: qdb-02, fst-0009, fst-0010
jbod-mgmt-06.eoscluster.sdfarm.kr: fst-0011, fst-0012
jbod-mgmt-07.eoscluster.sdfarm.kr: mgm-03, fst-0013, fst-0014
jbod-mgmt-08.eoscluster.sdfarm.kr: qdb-03, fst-0015, fst-0016
jbod-mgmt-09.eoscluster.sdfarm.kr: fst-0017, fst-0018

Each of fst has different port, e.g. mgm-01:1094, fst-0001:1095, fst-0002:1096.

The output of node ls is:

EOS Console [root://localhost] |/eos/gsdc/rain12/> node ls
┌──────────┬──────────────────────────────────────┬────────────────┬──────────┬────────────┬──────┬──────────┬────────┬────────┬────────────────┬─────┐
│type      │                              hostport│          geotag│    status│   activated│  txgw│ gw-queued│  gw-ntx│ gw-rate│  heartbeatdelta│ nofs│
└──────────┴──────────────────────────────────────┴────────────────┴──────────┴────────────┴──────┴──────────┴────────┴────────┴────────────────┴─────┘
 nodesview  jbod-mgmt-01.eoscluster.sdfarm.kr:1095 kisti::gsdc::g01     online           on    off          0       10      120                2    84
 nodesview  jbod-mgmt-01.eoscluster.sdfarm.kr:1096 kisti::gsdc::g01     online           on    off          0       10      120                2    84
 nodesview  jbod-mgmt-02.eoscluster.sdfarm.kr:1095 kisti::gsdc::g01     online           on    off          0       10      120                1    84
 nodesview  jbod-mgmt-02.eoscluster.sdfarm.kr:1096 kisti::gsdc::g01     online           on    off          0       10      120                1    84
 nodesview  jbod-mgmt-03.eoscluster.sdfarm.kr:1095 kisti::gsdc::g01     online           on    off          0       10      120                2    84
 nodesview  jbod-mgmt-03.eoscluster.sdfarm.kr:1096 kisti::gsdc::g01     online           on    off          0       10      120                2    84
 nodesview  jbod-mgmt-04.eoscluster.sdfarm.kr:1095 kisti::gsdc::g02     online           on    off          0       10      120                1    84
 nodesview  jbod-mgmt-04.eoscluster.sdfarm.kr:1096 kisti::gsdc::g02     online           on    off          0       10      120                2    84
 nodesview  jbod-mgmt-05.eoscluster.sdfarm.kr:1095              ???    unknown          off    off          0       10      120                ~    84
 nodesview  jbod-mgmt-05.eoscluster.sdfarm.kr:1096              ???    unknown          off    off          0       10      120                ~    84
 nodesview  jbod-mgmt-06.eoscluster.sdfarm.kr:1095 kisti::gsdc::g02     online           on    off          0       10      120                1    84
 nodesview  jbod-mgmt-06.eoscluster.sdfarm.kr:1096 kisti::gsdc::g02     online           on    off          0       10      120                2    84
 nodesview  jbod-mgmt-07.eoscluster.sdfarm.kr:1095 kisti::gsdc::g03     online           on    off          0       10      120                2    84
 nodesview  jbod-mgmt-07.eoscluster.sdfarm.kr:1096 kisti::gsdc::g03     online           on    off          0       10      120                2    84
 nodesview  jbod-mgmt-08.eoscluster.sdfarm.kr:1095 kisti::gsdc::g03     online           on    off          0       10      120                1    84
 nodesview  jbod-mgmt-08.eoscluster.sdfarm.kr:1096 kisti::gsdc::g03     online           on    off          0       10      120                2    84
 nodesview  jbod-mgmt-09.eoscluster.sdfarm.kr:1095 kisti::gsdc::g03     online           on    off          0       10      120                1    84
 nodesview  jbod-mgmt-09.eoscluster.sdfarm.kr:1096 kisti::gsdc::g03     online           on    off          0       10      120                1    84

For now, jbod-mgmt-05 is down for unknown reason. There was a firmware update for SAS HBA cards a few days ago and suddenly it started being kept reboot. It’s under investigation.

Anyway, I will try other replication mode, as you suggested raiddp, raid6 and others as well.

Thanks.

Best regards,
Sang-Un

sahn · June 29, 2020, 7:17am

Hi Elvin,

Just a quick update.

eos cp works with raiddp and raid6 (while jbod-mgmt-05.sdfarm.kr down) for both copy in and out.

The configuration for raiddp and raid6 are the following:

EOS Console [root://localhost] |/eos/gsdc/> attr -r set default=raiddp /eos/gsdc/raiddp
EOS Console [root://localhost] |/eos/gsdc/> attr ls raiddp
sys.eos.btime="1593411145.825163375"
sys.forced.blockchecksum="crc32c"
sys.forced.blocksize="1M"
sys.forced.checksum="adler"
sys.forced.layout="raiddp"
sys.forced.nstripes="6"
sys.forced.space="default"

EOS Console [root://localhost] |/eos/gsdc/> attr -r set default=raid6 /eos/gsdc/raid6
EOS Console [root://localhost] |/eos/gsdc/> attr ls raid6
sys.eos.btime="1593411209.921572416"
sys.forced.blockchecksum="crc32c"
sys.forced.blocksize="1M"
sys.forced.checksum="adler"
sys.forced.layout="raid6"
sys.forced.nstripes="6"
sys.forced.space="default"

However sometime it hangs. I tried to catch those hangs and its full logs as following:

Success log: eoscpdump_success.log
Hang log 1: eoscpdump_hang.log
Hang log 2: eoscpdump_hang2.log

These are the case of copying a file to raiddp. Please note that one of the nodes (jbod-mgmt-05) is down.

And I also have tried to copy a file into archive (3 parity) but it always hangs… here is the configuration:

EOS Console [root://localhost] |/eos/gsdc/> attr -r set default=archive /eos/gsdc/archive
EOS Console [root://localhost] |/eos/gsdc/> attr ls archive
sys.eos.btime="1593411399.17132519"
sys.forced.blockchecksum="crc32c"
sys.forced.blocksize="1M"
sys.forced.checksum="adler"
sys.forced.layout="archive"
sys.forced.nstripes="8"
sys.forced.space="default"

Thank you.

Best regards,
Sang-Un

esindril · June 29, 2020, 1:24pm

Hi Sang-Un,

From the logs you posted it seems there was a networking issue with the following machine:
jbod-mgmt-02.eoscluster.sdfarm.kr:1095
as the client did not get a reply of the open request withing the default 60 seconds timeout.

Can you also provide the following info related to the initial errors that you saw:
sum /etc/eos.keytab on machine jbod-mgmt-01.eoscluster.sdfarm.kr
sum /etc/eos.keytab on machine jbod-mgmt-09.eoscluster.sdfarm.kr

Thanks,
Elvin

sahn · June 30, 2020, 2:45am

Hi Elvin,

The output of the command sum /etc/eos.keytab on the machines are:

<jbod-mgmt-01.eoscluster.sdfarm.kr>
sh-4.2# sum /etc/eos.keytab
49266     1

<jbod-mgmt-09.eoscluster.sdfarm.kr>
sh-4.2# sum /etc/eos.keytab
49266     1

The output of the command on the other machines (all of containers) is the same. I will keep trying to dig into the network issues.

By the way, I have tried to catch FST logs when the transfer fails. This time a file copy command ran on jbod-mgmt-08 XRD_LOGLEVEL=Dump eos cp file-1g /eos/gsdc/raiddp/file-1g-jbod-mgmt-08 and then it was redirected at first to jbod-mgmt-09 (then later to jbod-mgmt-04 but nothing happened, timed out after about 20 minutes).

Redirection log in MGM (jbod-mgmt-01):

200630 01:39:08 078 XrootdXeq: root.2975:438@jbod-mgmt-08.eoscluster pub IPv4 login as daemon
200630 01:39:08 time=1593481148.029016 func=IdMap                    level=INFO  logid=static.............................. unit=mgm@jbod-mgmt-01.eoscluster.sdfarm.kr:1094 tid=00007fdb97018700 source=Mapping:1033                   tident= sec=(null) uid=99 gid=99 name=- geo="" sec.prot=sss sec.name="daemon" sec.host="jbod-mgmt-08.eoscluster.sdfarm.kr" sec.vorg="" sec.grps="daemon" sec.role="" sec.info="" sec.app="eoscp" sec.tident="root.2975:438@jbod-mgmt-08.eoscluster" vid.uid=2 vid.gid=2
200630 01:39:08 time=1593481148.029069 func=open                     level=INFO  logid=7d59b1d6-ba72-11ea-a6ff-b8599fa51330 unit=mgm@jbod-mgmt-01.eoscluster.sdfarm.kr:1094 tid=00007fdb97018700 source=XrdMgmOfsFile:455              tident=root.2975:438@jbod-mgmt-08.eoscluster sec=sss   uid=2 gid=2 name=daemon geo="" op=write trunc=512 path=/eos/gsdc/raiddp/file-1g-jbod-mgmt-08 info=eos.app=eoscp&eos.bookingsize=1073741824&eos.targetsize=1073741824&fst.valid=1593481207
200630 01:39:08 time=1593481148.029185 func=open                     level=INFO  logid=7d59b1d6-ba72-11ea-a6ff-b8599fa51330 unit=mgm@jbod-mgmt-01.eoscluster.sdfarm.kr:1094 tid=00007fdb97018700 source=XrdMgmOfsFile:566              tident=root.2975:438@jbod-mgmt-08.eoscluster sec=sss   uid=2 gid=2 name=daemon geo="" msg="rewrote symlinks" sym-path=/eos/gsdc/raiddp/file-1g-jbod-mgmt-08 realpath=/eos/gsdc/raiddp/file-1g-jbod-mgmt-08
200630 01:39:08 time=1593481148.029362 func=open                     level=INFO  logid=7d59b1d6-ba72-11ea-a6ff-b8599fa51330 unit=mgm@jbod-mgmt-01.eoscluster.sdfarm.kr:1094 tid=00007fdb97018700 source=XrdMgmOfsFile:945              tident=root.2975:438@jbod-mgmt-08.eoscluster sec=sss   uid=2 gid=2 name=daemon geo="" acl=0 r=0 w=0 wo=0 egroup=0 shared=0 mutable=1
200630 01:39:08 time=1593481148.029390 func=_rem                     level=INFO  logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@jbod-mgmt-01.eoscluster.sdfarm.kr:1094 tid=00007fdb97018700 source=Rm:105                         tident=<single-exec> sec=sss   uid=2 gid=2 name=daemon geo="" path=/eos/gsdc/raiddp/file-1g-jbod-mgmt-08 vid.uid=2 vid.gid=2
200630 01:39:08 time=1593481148.029613 func=_rem                     level=INFO  logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@jbod-mgmt-01.eoscluster.sdfarm.kr:1094 tid=00007fdb97018700 source=Rm:275                         tident=<single-exec> sec=sss   uid=2 gid=2 name=daemon geo="" unlinking from view /eos/gsdc/raiddp/file-1g-jbod-mgmt-08
200630 01:39:08 time=1593481148.029863 func=BroadcastDeletionFromExternal level=INFO  logid=static.............................. unit=mgm@jbod-mgmt-01.eoscluster.sdfarm.kr:1094 tid=00007fdb97018700 source=Caps:314                       tident= sec=(null) uid=99 gid=99 name=- geo="" id=3e name=file-1g-jbod-mgmt-08
200630 01:39:08 time=1593481148.029900 func=PurgeVersion             level=INFO  logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@jbod-mgmt-01.eoscluster.sdfarm.kr:1094 tid=00007fdb97018700 source=Version:170                    tident=<single-exec> sec=      uid=0 gid=0 name= geo="" version-dir=/eos/gsdc/raiddp/.sys.v#.file-1g-jbod-mgmt-08/ max-versions=0
200630 01:39:08 time=1593481148.030075 func=_rem                     level=INFO  logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@jbod-mgmt-01.eoscluster.sdfarm.kr:1094 tid=00007fdb97018700 source=Rm:429                         tident=<single-exec> sec=sss   uid=2 gid=2 name=daemon geo="" msg="deleted" can-recycle=0 path=/eos/gsdc/raiddp/file-1g-jbod-mgmt-08 owner.uid=2 owner.gid=2 vid.uid=2 vid.gid=2
200630 01:39:08 time=1593481148.030358 func=open                     level=INFO  logid=7d59b1d6-ba72-11ea-a6ff-b8599fa51330 unit=mgm@jbod-mgmt-01.eoscluster.sdfarm.kr:1094 tid=00007fdb97018700 source=XrdMgmOfsFile:1411             tident=root.2975:438@jbod-mgmt-08.eoscluster sec=sss   uid=2 gid=2 name=daemon geo="" blocksize=1048576 lid=20640532
200630 01:39:08 time=1593481148.030801 func=open                     level=INFO  logid=7d59b1d6-ba72-11ea-a6ff-b8599fa51330 unit=mgm@jbod-mgmt-01.eoscluster.sdfarm.kr:1094 tid=00007fdb97018700 source=XrdMgmOfsFile:2749             tident=root.2975:438@jbod-mgmt-08.eoscluster sec=sss   uid=2 gid=2 name=daemon geo="" op=write path=/eos/gsdc/raiddp/file-1g-jbod-mgmt-08 info=eos.app=eoscp&eos.bookingsize=1073741824&eos.targetsize=1073741824&fst.valid=1593481207 target[0]=(jbod-mgmt-09.eoscluster.sdfarm.kr,1458) target[1]=(jbod-mgmt-03.eoscluster.sdfarm.kr,450) target[2]=(jbod-mgmt-04.eoscluster.sdfarm.kr,618) target[3]=(jbod-mgmt-04.eoscluster.sdfarm.kr,534) target[4]=(jbod-mgmt-01.eoscluster.sdfarm.kr,30) target[5]=(jbod-mgmt-08.eoscluster.sdfarm.kr,1206)  redirection=jbod-mgmt-09.eoscluster.sdfarm.kr?&cap.sym=<...>&cap.msg=<...>&mgm.logid=7d59b1d6-ba72-11ea-a6ff-b8599fa51330&mgm.replicaindex=0&mgm.replicahead=0&mgm.id=0000016e:1096
200630 01:39:08 time=1593481148.030822 func=open                     level=INFO  logid=7d59b1d6-ba72-11ea-a6ff-b8599fa51330 unit=mgm@jbod-mgmt-01.eoscluster.sdfarm.kr:1094 tid=00007fdb97018700 source=XrdMgmOfsFile:2757             tident=root.2975:438@jbod-mgmt-08.eoscluster sec=sss   uid=2 gid=2 name=daemon geo="" info="redirection" hostport=jbod-mgmt-09.eoscluster.sdfarm.kr?&cap.sym=<...>&cap.msg=<...>&mgm.logid=7d59b1d6-ba72-11ea-a6ff-b8599fa51330&mgm.replicaindex=0&mgm.replicahead=0&mgm.id=0000016e:1096

FST Log in jbod-mgmt-09:1096

200630 01:43:58 time=1593481438.128257 func=Open                     level=WARN  logid=7d59b1d6-ba72-11ea-a6ff-b8599fa51330 unit=fst@jbod-mgmt-09.eoscluster.sdfarm.kr:1096 tid=00007f2ec162a700 source=RaidMetaLayout:311             tident=root.2975:441@jbod-mgmt-08.eoscluster sec=unix  uid=0 gid=0 name=root geo="" msg="open failed on remote stripe: root://jbod-mgmt-04.eoscluster.sdfarm.kr:1096///eos/gsdc/raiddp/file-1g-jbod-mgmt-08?&cap.msg=AIr/NI3WAsSA0m0dLyyCE2x2R1wWxofwzLCHblwSA8/HradzlphOCxupkLWUrgmqj1h5P/3+HeRZSUCzd13ZLp7LvPj29Sanl7kGJyzpHjKtRMIqJ8y2Y9vHQAWwuthsjynbl2hrUdl8Ogb7cDWIi+RYidOSuSrsgz35IQU087/kf1Y1hDpbXiUkToFcnDTNIe9cz+qD6el+MYlKyE6hV4efOxwkEbFApTsd4bQR2y9oCA+E3Zn/A8VjQiRlUuJHgbxS67oxZ2/ieT4RqPJiCX6DLHiVH47QFSd4FzrdVLMDoNvposLziX6b91fVgx5I0x2a2XhOHvhvaROL6QEPxlMmyOUvdI3+G38iHn3k1UuP8VrIJ4q/Gq+zyxYw28EJ00rQPOXDztSGRiek7Sn4LDrJf6pc6ZP3lgyKaIr2a7ibS4a8/6R/0MYn2PoYQ6jatvc05JmJkiJ9Pvk8r3tMZyfb5yxO5PbkQBffqIFOQ0EBLZCSENKWNezYhgTgqt9PXymxsf0ZdbbMkWuX21wZ0us5aSBRY6NVdIF8H9WMX7KC8pUIDVPR5xvufRPmbcpPwNy8Ds++FybSRCUYu7y92dOQ+a6XYbewlPPJ49eX8KpcKeW6sL4ovWEffPZaoE1kMkVFNHyyTKkt9ti6c7zwxvAjMXQ8qvJpDLP/Ce5TYikVWPP4Q1e8MgJtOnOE3PvxxmjaKFA08QZkb6Y0PpW1uVTxUzm7ViMGlDJEgHztScttdu9fiwpQpmz2yxPAA2SDujjlAXeiwEkH/SxWsejyF8yVaNDo22FhUV8px63aLWepEbAPzh0IRM9djobuwHs2e3BtOfWqMfXtlPkWEdDmmsE2rlpjSL2ffZFlsz0+zDadSfec5apSXd3FkaJUvaGrVXB4BX17QFmO/TVCAeuSsrzGCWbXvdtcs3AQQAj3+p4B8pKbtWMWCo6eWBTHZmKJzss20eDeoFNPIlpZgqkX2SoIAIP8x5DXAhXx47g/jAAMKhpr24jZYPOB8hwKSGeK6+ijalFlpGRVhwEM8Q0BiF0Mfek2xstb&cap.sym=vYBdWW/5V3HpmQYotPXG7yT3enk=&eos.app=eoscp&eos.bookingsize=1073741824&eos.targetsize=1073741824&fst.valid=1593481207&mgm.id=0000016e&mgm.logid=7d59b1d6-ba72-11ea-a6ff-b8599fa51330&mgm.replicahead=0&mgm.replicaindex=2&mgm.path=/eos/gsdc/raiddp/file-1g-jbod-mgmt-08"
200630 01:43:58 time=1593481438.128312 func=Open                     level=ERROR logid=7d59b1d6-ba72-11ea-a6ff-b8599fa51330 unit=fst@jbod-mgmt-09.eoscluster.sdfarm.kr:1096 tid=00007f2ec162a700 source=RaidMetaLayout:317             tident=root.2975:441@jbod-mgmt-08.eoscluster sec=unix  uid=0 gid=0 name=root geo="" msg="open failure is fatal is RW mode" stripe=root://jbod-mgmt-04.eoscluster.sdfarm.kr:1096///eos/gsdc/raiddp/file-1g-jbod-mgmt-08?&cap.msg=AIr/NI3WAsSA0m0dLyyCE2x2R1wWxofwzLCHblwSA8/HradzlphOCxupkLWUrgmqj1h5P/3+HeRZSUCzd13ZLp7LvPj29Sanl7kGJyzpHjKtRMIqJ8y2Y9vHQAWwuthsjynbl2hrUdl8Ogb7cDWIi+RYidOSuSrsgz35IQU087/kf1Y1hDpbXiUkToFcnDTNIe9cz+qD6el+MYlKyE6hV4efOxwkEbFApTsd4bQR2y9oCA+E3Zn/A8VjQiRlUuJHgbxS67oxZ2/ieT4RqPJiCX6DLHiVH47QFSd4FzrdVLMDoNvposLziX6b91fVgx5I0x2a2XhOHvhvaROL6QEPxlMmyOUvdI3+G38iHn3k1UuP8VrIJ4q/Gq+zyxYw28EJ00rQPOXDztSGRiek7Sn4LDrJf6pc6ZP3lgyKaIr2a7ibS4a8/6R/0MYn2PoYQ6jatvc05JmJkiJ9Pvk8r3tMZyfb5yxO5PbkQBffqIFOQ0EBLZCSENKWNezYhgTgqt9PXymxsf0ZdbbMkWuX21wZ0us5aSBRY6NVdIF8H9WMX7KC8pUIDVPR5xvufRPmbcpPwNy8Ds++FybSRCUYu7y92dOQ+a6XYbewlPPJ49eX8KpcKeW6sL4ovWEffPZaoE1kMkVFNHyyTKkt9ti6c7zwxvAjMXQ8qvJpDLP/Ce5TYikVWPP4Q1e8MgJtOnOE3PvxxmjaKFA08QZkb6Y0PpW1uVTxUzm7ViMGlDJEgHztScttdu9fiwpQpmz2yxPAA2SDujjlAXeiwEkH/SxWsejyF8yVaNDo22FhUV8px63aLWepEbAPzh0IRM9djobuwHs2e3BtOfWqMfXtlPkWEdDmmsE2rlpjSL2ffZFlsz0+zDadSfec5apSXd3FkaJUvaGrVXB4BX17QFmO/TVCAeuSsrzGCWbXvdtcs3AQQAj3+p4B8pKbtWMWCo6eWBTHZmKJzss20eDeoFNPIlpZgqkX2SoIAIP8x5DXAhXx47g/jAAMKhpr24jZYPOB8hwKSGeK6+ijalFlpGRVhwEM8Q0BiF0Mfek2xstb&cap.sym=vYBdWW/5V3HpmQYotPXG7yT3enk=&eos.app=eoscp&eos.bookingsize=1073741824&eos.targetsize=1073741824&fst.valid=1593481207&mgm.id=0000016e&mgm.logid=7d59b1d6-ba72-11ea-a6ff-b8599fa51330&mgm.replicahead=0&mgm.replicaindex=2&mgm.path=/eos/gsdc/raiddp/file-1g-jbod-mgmt-08
200630 01:43:58 time=1593481438.128338 func=open                     level=WARN  logid=7d59b1d6-ba72-11ea-a6ff-b8599fa51330 unit=fst@jbod-mgmt-09.eoscluster.sdfarm.kr:1096 tid=00007f2ec162a700 source=XrdFstOfsFile:460              tident=root.2975:441@jbod-mgmt-08.eoscluster sec=(null) uid=2 gid=2 name=nobody geo="" msg="open error return recoverable error EIO(kXR_IOError)" fid=0000016e
200630 01:43:58 021 FstOfs_open: root.2975:441@jbod-mgmt-08.eoscluster Unable to open - failed open ; input/output error
200630 01:43:58 021 XrdLink: Unable to send to root.2975:441@jbod-mgmt-08.eoscluster; connection timed out
200630 01:43:58 time=1593481438.128456 func=_close                   level=INFO  logid=7d59b1d6-ba72-11ea-a6ff-b8599fa51330 unit=fst@jbod-mgmt-09.eoscluster.sdfarm.kr:1096 tid=00007f2ec162a700 source=XrdFstOfsFile:1792             tident=root.2975:441@jbod-mgmt-08.eoscluster sec=      uid=2 gid=2 name=nobody geo="" msg="done close" rc=0 errc=0

The log in FST@jbod-mgmt-04:1096 says:

200630 01:43:58 time=1593481438.128423 func=open                     level=INFO  logid=7d59b1d6-ba72-11ea-a6ff-b8599fa51330 unit=fst@jbod-mgmt-04.eoscluster.sdfarm.kr:1096 tid=00007f85c46fc700 source=XrdFstOfsFile:119              tident=1.6:33@jbod-mgmt-09.eoscluster sec=sss   uid=0 gid=0 name=daemon geo="" path=/eos/gsdc/raiddp/file-1g-jbod-mgmt-08 info=cap.msg=<...>&cap.sym=<...>&eos.app=eoscp&eos.bookingsize=1073741824&eos.targetsize=1073741824&fst.blocksize=1048576&fst.readahead=true&fst.valid=1593481207&mgm.bookingsize=1073741824&mgm.id=0000016e&mgm.lid=543425842&mgm.logid=7d59b1d6-ba72-11ea-a6ff-b8599fa51330&mgm.path=/eos/gsdc/raiddp/file-1g-jbod-mgmt-08&mgm.replicahead=0&mgm.replicaindex=2 open_mode=200
200630 01:43:58 time=1593481438.128521 func=ProcessTpcOpaque         level=ERROR logid=7d59b1d6-ba72-11ea-a6ff-b8599fa51330 unit=fst@jbod-mgmt-04.eoscluster.sdfarm.kr:1096 tid=00007f85c46fc700 source=XrdFstOfsFile:2661             tident=1.6:33@jbod-mgmt-09.eoscluster sec=sss   uid=0 gid=0 name=daemon geo="" msg="FST still misses the required capability key"
200630 01:43:58 time=1593481438.128561 func=_close                   level=INFO  logid=7d59b1d6-ba72-11ea-a6ff-b8599fa51330 unit=fst@jbod-mgmt-04.eoscluster.sdfarm.kr:1096 tid=00007f85c46fc700 source=XrdFstOfsFile:1792             tident=1.6:33@jbod-mgmt-09.eoscluster sec=sss   uid=0 gid=0 name=daemon geo="" msg="done close" rc=0 errc=0

I have tried to get any hint from the source code for the message “FST still misses the required capability key” but it seems I cannot.

In fact, when the request was redirected to jbod-mgmt-04 (with different logid given by MGM), the error message “FST still misses the required capability key” was repeatedly shown until the command was timed out. It looks like authorization problem…

What would you suggest?

Thank you.

Best regards,
Sang-Un

sahn · June 30, 2020, 7:23am

Hi Elvin,

For the root cause of hanging, I think there is a mis-configuration on the FST running at jbod-mgmt-04.eoscluster.sdfarm.kr:1096. It claims “FST still misses …” when the requests were received. Copy fails all the time when it gets redirected to this node with this error message. What could be the root cause of this? Anyway, once this node is disabled, all copy operations are working fine.

By the way, still eos cp a file from qrain with stripes=12 failure persists… I have tested with different layout such as plain, replica, raiddp, raid6, archive. For those cases, both eos cp and xrdcp work just fine (cp in fuse mount works as well). Only when eos cp a file from rain12(qrain with stripes=12), it fails. Currently there are 3 nodes off (the problematic one abovementioned “jbod-mgmt-04.eoscluster.sdfarm.kr:1096” and two FST nodes on jbod-mgmt-05, which is down now) out of 18 nodes but I think it should work with 12 stripes with 4 parity, correct? Definitely I confirmed that 16 stripes does not work.

For your information, here is the output of attr ls of rain12:

EOS Console [root://localhost] |/eos/gsdc/> attr ls /eos/gsdc/rain12
sys.eos.btime="1593041284.268289744"
sys.forced.blockchecksum="crc32c"
sys.forced.blocksize="1M"
sys.forced.checksum="adler"
sys.forced.layout="qrain"
sys.forced.nstripes="12"
sys.forced.space="default"
sys.recycle="/eos/gsdc/proc/recycle/"

And the full dump of eos cp can be found here: eoscpdump_full_rain12.log.

Also I could catch a snippet of logs in MGM and FST related to this eos cp operation:

MGM log:

200630 06:55:04 069 XrootdXeq: root.2534:444@jbod-mgmt-04.eoscluster pub IPv4 login as daemon
200630 06:55:04 time=1593500104.629105 func=IdMap                    level=INFO  logid=static.............................. unit=mgm@jbod-mgmt-01.eoscluster.sdfarm.kr:1094 tid=00007fdb97921700 source=Mapping:1033                   tident= sec=(null) uid=99 gid=99 name=- geo="" sec.prot=sss sec.name="daemon" sec.host="jbod-mgmt-04.eoscluster.sdfarm.kr" sec.vorg="" sec.grps="daemon" sec.role="" sec.info="" sec.app="eoscp" sec.tident="root.2534:444@jbod-mgmt-04.eoscluster" vid.uid=2 vid.gid=2
200630 06:55:04 time=1593500104.629153 func=open                     level=INFO  logid=a05d5116-ba9e-11ea-acda-b8599fa51330 unit=mgm@jbod-mgmt-01.eoscluster.sdfarm.kr:1094 tid=00007fdb97921700 source=XrdMgmOfsFile:457              tident=root.2534:444@jbod-mgmt-04.eoscluster sec=sss   uid=2 gid=2 name=daemon geo="" op=read path=/eos/gsdc/rain12/file-1g-jbod-mgmt-04 info=eos.app=eoscp&mgm.pcmd=open&eos.cli.access=pio
200630 06:55:04 time=1593500104.629366 func=open                     level=INFO  logid=a05d5116-ba9e-11ea-acda-b8599fa51330 unit=mgm@jbod-mgmt-01.eoscluster.sdfarm.kr:1094 tid=00007fdb97921700 source=XrdMgmOfsFile:945              tident=root.2534:444@jbod-mgmt-04.eoscluster sec=sss   uid=2 gid=2 name=daemon geo="" acl=0 r=0 w=0 wo=0 egroup=0 shared=0 mutable=1
200630 06:55:04 time=1593500104.629846 func=open                     level=INFO  logid=a05d5116-ba9e-11ea-acda-b8599fa51330 unit=mgm@jbod-mgmt-01.eoscluster.sdfarm.kr:1094 tid=00007fdb97921700 source=XrdMgmOfsFile:2753             tident=root.2534:444@jbod-mgmt-04.eoscluster sec=sss   uid=2 gid=2 name=daemon geo="" op=read  path=/eos/gsdc/rain12/file-1g-jbod-mgmt-04 info=eos.app=eoscp&mgm.pcmd=open&eos.cli.access=pio target[0]=(jbod-mgmt-03.eoscluster.sdfarm.kr,387) target[1]=(jbod-mgmt-09.eoscluster.sdfarm.kr,1479) target[2]=(jbod-mgmt-01.eoscluster.sdfarm.kr,135) target[3]=(jbod-mgmt-06.eoscluster.sdfarm.kr,891) target[4]=(jbod-mgmt-06.eoscluster.sdfarm.kr,975) target[5]=(jbod-mgmt-07.eoscluster.sdfarm.kr,1059) target[6]=(jbod-mgmt-04.eoscluster.sdfarm.kr,555) target[7]=(jbod-mgmt-08.eoscluster.sdfarm.kr,1227) target[8]=(jbod-mgmt-01.eoscluster.sdfarm.kr,51) target[9]=(jbod-mgmt-03.eoscluster.sdfarm.kr,471) target[10]=(jbod-mgmt-08.eoscluster.sdfarm.kr,1311) target[11]=(jbod-mgmt-02.eoscluster.sdfarm.kr,303)  redirection=pio.0=jbod-mgmt-03.eoscluster.sdfarm.kr:1095&pio.1=jbod-mgmt-09.eoscluster.sdfarm.kr:1096&pio.2=jbod-mgmt-01.eoscluster.sdfarm.kr:1096&pio.3=jbod-mgmt-06.eoscluster.sdfarm.kr:1095&pio.4=jbod-mgmt-06.eoscluster.sdfarm.kr:1096&pio.5=jbod-mgmt-07.eoscluster.sdfarm.kr:1095&pio.6=jbod-mgmt-04.eoscluster.sdfarm.kr:1095&pio.7=jbod-mgmt-08.eoscluster.sdfarm.kr:1095&pio.8=jbod-mgmt-01.eoscluster.sdfarm.kr:1095&pio.9=jbod-mgmt-03.eoscluster.sdfarm.kr:1096&pio.10=jbod-mgmt-08.eoscluster.sdfarm.kr:1096&pio.11=jbod-mgmt-02.eoscluster.sdfarm.kr:1096&mgm.lid=1080298322&mgm.logid=a05d5116-ba9e-11ea-acda-b8599fa51330&cap.sym=<...>&cap.msg=<...>&mgm.id=0000022b&mgm.mtime=1593497491:1095
200630 06:55:04 time=1593500104.629878 func=open                     level=INFO  logid=a05d5116-ba9e-11ea-acda-b8599fa51330 unit=mgm@jbod-mgmt-01.eoscluster.sdfarm.kr:1094 tid=00007fdb97921700 source=XrdMgmOfsFile:2757             tident=root.2534:444@jbod-mgmt-04.eoscluster sec=sss   uid=2 gid=2 name=daemon geo="" info="redirection" hostport=pio.0=jbod-mgmt-03.eoscluster.sdfarm.kr:1095&pio.1=jbod-mgmt-09.eoscluster.sdfarm.kr:1096&pio.2=jbod-mgmt-01.eoscluster.sdfarm.kr:1096&pio.3=jbod-mgmt-06.eoscluster.sdfarm.kr:1095&pio.4=jbod-mgmt-06.eoscluster.sdfarm.kr:1096&pio.5=jbod-mgmt-07.eoscluster.sdfarm.kr:1095&pio.6=jbod-mgmt-04.eoscluster.sdfarm.kr:1095&pio.7=jbod-mgmt-08.eoscluster.sdfarm.kr:1095&pio.8=jbod-mgmt-01.eoscluster.sdfarm.kr:1095&pio.9=jbod-mgmt-03.eoscluster.sdfarm.kr:1096&pio.10=jbod-mgmt-08.eoscluster.sdfarm.kr:1096&pio.11=jbod-mgmt-02.eoscluster.sdfarm.kr:1096&mgm.lid=1080298322&mgm.logid=a05d5116-ba9e-11ea-acda-b8599fa51330&cap.sym=<...>&cap.msg=<...>&mgm.id=0000022b&mgm.mtime=1593497491:1095

FST log (jbod-mgmt-03.eoscluster.sdfarm.kr:1095) similar to the one reported in the first thread:

200630 06:55:04 020 XrootdXeq: root.2534:30@jbod-mgmt-04.eoscluster pub IPv4 login as root
200630 06:55:04 020 FstOfs_stat: root.2534:30@jbod-mgmt-04.eoscluster Unable to stat file /eos/gsdc/rain12/file-1g-jbod-mgmt-04; no such file or directory
200630 06:55:04 time=1593500104.661040 func=open                     level=INFO  logid=a05d5116-ba9e-11ea-acda-b8599fa51330 unit=fst@jbod-mgmt-03.eoscluster.sdfarm.kr:1095 tid=00007f507332b700 source=XrdFstOfsFile:119              tident=root.2534:30@jbod-mgmt-04.eoscluster sec=unix  uid=0 gid=0 name=root geo="" path=/eos/gsdc/rain12/file-1g-jbod-mgmt-04 info=cap.msg=<...>&cap.sym=<...>&eos.app=eoscp&fst.blocksize=1048576&fst.readahead=true&fst.valid=1593500163&mgm.logid=a05d5116-ba9e-11ea-acda-b8599fa51330&mgm.replicaindex=0 open_mode=0
200630 06:55:04 020 FstOfs_ProcessTpcOpaque: root.2534:30@jbod-mgmt-04.eoscluster Unable to open - capability illegal /eos/gsdc/rain12/file-1g-jbod-mgmt-04; key was rejected by service
200630 06:55:04 time=1593500104.661397 func=open                     level=ERROR logid=a05d5116-ba9e-11ea-acda-b8599fa51330 unit=fst@jbod-mgmt-03.eoscluster.sdfarm.kr:1095 tid=00007f507332b700 source=XrdFstOfsFile:134              tident=root.2534:30@jbod-mgmt-04.eoscluster sec=unix  uid=0 gid=0 name=root geo="" msg="failed while processing TPC/open opaque"
200630 06:55:04 time=1593500104.661451 func=_close                   level=INFO  logid=a05d5116-ba9e-11ea-acda-b8599fa51330 unit=fst@jbod-mgmt-03.eoscluster.sdfarm.kr:1095 tid=00007f507332b700 source=XrdFstOfsFile:1792             tident=root.2534:30@jbod-mgmt-04.eoscluster sec=unix  uid=0 gid=0 name=root geo="" msg="done close" rc=0 errc=0
200630 06:55:04 020 XrootdXeq: root.2534:30@jbod-mgmt-04.eoscluster disc 0:00:00

Do you have any idea on this? Please just let me know anything you need for further investigation.

Thank you.

Best regards,
Sang-Un

esindril · June 30, 2020, 8:13am

Hi Sang-Un,

The problem with the “capability key” missing comes from a connectivity issue between the FST and the MGM. When the FST boots its receives this information from the MGM. You should have a log line similar to the following when the FST boots:

200630 09:36:28 time=1593502588.502418 func=ProcessFstConfigChange   level=INFO  logid=static.............................. unit=fst@esdss000.cern.ch:2001 tid=00007f4a0d1eb700 source=Communicator:163               tident= sec=(null) uid=99 gid=99 name=- geo="" FST configuration change - k
ey=symkey, value=2U+B8jO/OYwdfsdadfUIbrCAJ50Fv5Y=

If this does not happen then the FST has a problem in communicating with the MGM.

The subsequent error message that you see in the FST logs related to Unable to open - capability illegal is tightly coupled to the sss key validation mentioned just above.

Cheers,
Elvin

sahn · July 3, 2020, 1:57am

Hi Elvin,

It seems that the FST @ jbod-mgmt-04.eoscluster.sdfarm.kr:1096 does not communicate with MGM properly because there isn’t any log like you mentioned (including “key=symkey”) while the other FSTs definitely have the similar logs (with different “value”).

How could I make this FST reboot so that it can start communicating with MGM? eos fst boot with --syncmgm command seems not helping. All filesystems are already booted. I also have tried to turn off and on the node jbod-mgmt-04.eoscluster.sdfarm.kr:1096.

As it is running in container, tried to run docker restart <container_id> but it does not help (confirmed that the process restarted but it does not look like trying to re-initiate communication with MGM).

It seems to me that there was mis-configuration at the beginning… Then is it required to start up the FST from the scratch after cleaning up? Do you have any idea?

Thank you.

Best regards,
Sang-Un

esindril · July 3, 2020, 10:44am

Hi Sang-Un,

A restart of the FST process should be enough, in general, unless the FST is indeed misconfigured. Also, a restart of the container should be good enough. If it still doesn’t connect then I would test out the network connectivity and then look in /etc/sysconfig/eos_env at the endpoint where the FST should connect.

Also, all the symkey values that you see on the FSTs should match otherwise it means that the MGM and the FSTs use different sss keys and that is why you get errors like Unable to open - capability illegal. For example in my setup, just restarted, I have:

$ for i in {1..7}; do tail -100  /var/log/eos/fst$i/xrdlog.fst$i | grep key=symkey; done                                                                                                                                                                    200703 12:38:37 time=1593772717.960307 func=ProcessFstConfigChange   level=INFO  logid=static.............................. unit=fst@esdss000.cern.ch:2001 tid=00007f8d5503f700 source=Communicator:163               tident= sec=(null) uid=99 gid=99 name=- geo="" FST configuration change - key=symkey, value=2U+B8jO/OYwptDUIbrCAJ50Fv5Y=
200703 12:38:37 time=1593772717.918915 func=ProcessFstConfigChange   level=INFO  logid=static.............................. unit=fst@esdss000.cern.ch:2002 tid=00007fb806ef2700 source=Communicator:163               tident= sec=(null) uid=99 gid=99 name=- geo="" FST configuration change - key=symkey, value=2U+B8jO/OYwptDUIbrCAJ50Fv5Y=
200703 12:38:37 time=1593772717.903869 func=ProcessFstConfigChange   level=INFO  logid=static.............................. unit=fst@esdss000.cern.ch:2003 tid=00007f984530b700 source=Communicator:163               tident= sec=(null) uid=99 gid=99 name=- geo="" FST configuration change - key=symkey, value=2U+B8jO/OYwptDUIbrCAJ50Fv5Y=
200703 12:38:37 time=1593772717.909841 func=ProcessFstConfigChange   level=INFO  logid=static.............................. unit=fst@esdss000.cern.ch:2004 tid=00007f855cf1b700 source=Communicator:163               tident= sec=(null) uid=99 gid=99 name=- geo="" FST configuration change - key=symkey, value=2U+B8jO/OYwptDUIbrCAJ50Fv5Y=
200703 12:38:38 time=1593772718.037272 func=ProcessFstConfigChange   level=INFO  logid=static.............................. unit=fst@esdss000.cern.ch:2005 tid=00007f9849575700 source=Communicator:163               tident= sec=(null) uid=99 gid=99 name=- geo="" FST configuration change - key=symkey, value=2U+B8jO/OYwptDUIbrCAJ50Fv5Y=
200703 12:38:37 time=1593772717.972703 func=ProcessFstConfigChange   level=INFO  logid=static.............................. unit=fst@esdss000.cern.ch:2006 tid=00007f33567f3700 source=Communicator:163               tident= sec=(null) uid=99 gid=99 name=- geo="" FST configuration change - key=symkey, value=2U+B8jO/OYwptDUIbrCAJ50Fv5Y=
200703 12:38:37 time=1593772717.983751 func=ProcessFstConfigChange   level=INFO  logid=static.............................. unit=fst@esdss000.cern.ch:2007 tid=00007ff6c1cee700 source=Communicator:163               tident= sec=(null) uid=99 gid=99 name=- geo="" FST configuration change - key=symkey, value=2U+B8jO/OYwptDUIbrCAJ50Fv5Y=

Cheers,
Elvin

sahn · July 6, 2020, 2:37am

Hi Elvin,

There isn’t network issue between MGM and the FST @ jbod-mgmt-04.eoscluster.sdfarm.kr:1096. Because it can be connected to ports as below:

From MGM (jbod-mgmt-01.eoscluster.sdfarm.kr)

sh-4.2# nc -v -w 2 jbod-mgmt-04.eoscluster.sdfarm.kr 1096
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 134.75.125.59:1096.

From FST (jbod-mgmt-04.eoscluster.sdfarm.kr)

sh-4.2# nc -v -w 2 jbod-mgmt-01.eoscluster.sdfarm.kr 1094
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 134.75.125.53:1094.
^C
sh-4.2# nc -v -w 2 jbod-mgmt-01.eoscluster.sdfarm.kr 1097
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 134.75.125.53:1097.
^C

When the FST is restarted, symkey is not retrieved or configured for jbod-mgmt-04.eoscluster.sdfarm.kr:1096. All other FSTs do have symkey values in their status for example:

EOS Console [root://localhost] |/eos/gsdc/> node status /eos/jbod-mgmt-03.eoscluster.sdfarm.kr:1096/fst
# ------------------------------------------------------------------------------------
# Node Variables
# ....................................................................................
debug.level                      := info
debug.state                      := debug
domain                           := MGM
gw.ntx                           := 10
gw.rate                          := 120
http.port                        := 8002
manager                          := jbod-mgmt-01.eoscluster.sdfarm.kr:1094
stat.active                      := online
stat.balance.ntx                 := 2
stat.balance.rate                := 25
stat.geotag                      := kisti::gsdc::g01
stat.hostport                    := jbod-mgmt-03.eoscluster.sdfarm.kr:1096
stat.net.ethratemib              := 119
stat.net.inratemib               := 0.00237768
stat.net.outratemib              := 0.0174825
stat.publishtimestamp            := 1594001898701
stat.sys.eos.start               := Thu Jun 25 23:35:32 2020
stat.sys.eos.version             := 4.7.7-1
stat.sys.kernel                  := 3.10.0-1127.8.2.el7.x86_64
stat.sys.keytab                  := 08ca24f0
stat.sys.rss                     := 647479296
stat.sys.sockets                 := 31
stat.sys.threads                 := 377
stat.sys.uptime                  :=  02:18:18 up 10 days, 21:19,  0 users,  load average: 0.47, 0.49, 0.51
stat.sys.vsize                   := 4082995200
stat.sys.xrootd.version          := v4.11.3
status                           := on
symkey                           := F5igzjdI+pFyJ49/45e9kA14sCc=
txgw                             := off

So I have just tried to set “symkey” by hand like below but it did not help.

node config jbod-mgmt-04.eoscluser.sdfarm.kr:1096 symkey=F5igzjdI+pFyJ49/45e9kA14sCc=

MGM URL and BROKER URL are set correctly (the same configuration file is shared among the other FSTs).
Should I destroy the concerned FST (remove FS, node from group and space) and then add again as a new FST? What else could I try?

Thank you.

Best regards,
Sang-Un

sahn · July 9, 2020, 12:14am

Hi Elvin,

I have managed to make the FST @ jbod-mgmt-04.eoscluster.sdfarm.kr:1096 working. Steps what I have done is following:

node rm jbod-mgmt-04.eoscluster.sdfarm.kr:1096, all filesystems must be empty before running this command => node config jbod-mgmt-04.eoscluster.sdfarm.kr:1096 configstatus=empty
Stop and start the concerned FST container
Confirmed that symkey is properly stored in the node
Register all filesystems belonging to the node by running eosfstregister -i jbod-mgmt-04.eoscluster.sdfarm.kr --port 1096 /jbod/ default:84 (-i option was used to import the existing fsid)
Move all fliesystems to proper scheduling groups

There might be some steps missing because of my poor memory. Anyway the node started communicating with MGM after reattaching it.

So I have performed copying tests again and it seems fine at least for plain, replica, raiddp, raid6, archive layouts. Both copying in and out are working just fine.

However still eos cp does not read files from qrain with 12 stripes layout. For qrain layout with 8 stripes and 10 stripes, eos cp works.

As noted above in the first thread, xrdcp is working when reads files from qrain with 12 stripes. FYI, the layout information is below and the full dump of eos cp can be found here.

EOS Console [root://localhost] |/eos/gsdc/> attr ls rain12
sys.eos.btime="1594253011.336590387"
sys.forced.blockchecksum="crc32c"
sys.forced.blocksize="1M"
sys.forced.checksum="adler"
sys.forced.layout="qrain"
sys.forced.nstripes="12"
sys.forced.space="default"

Thank you.

Best regards,
Sang-Un

sahn · July 9, 2020, 12:38am

Hi Elvin,

FYI, please find the successful full dump of xrdcp here.

As you already noted above, it looks stat error for the file requested.

[2020-07-09 00:09:31.671683 +0000][Dump   ][XRootD            ] [jbod-mgmt-06.eoscluster.sdfarm.kr:1096] Got a kXR_error response to request kXR_stat (path: /eos/gsdc/rain12/file-1g-jbod-mgmt-04, flags: none) [3011] Unable to stat file /eos/gsdc/rain12/file-1g-jbod-mgmt-04; no such file or directory
[2020-07-09 00:09:31.671693 +0000][Debug  ][XRootD            ] [jbod-mgmt-06.eoscluster.sdfarm.kr:1096] Handling error while processing kXR_stat (path: /eos/gsdc/rain12/file-1g-jbod-mgmt-04, flags: none): [ERROR] Error response: No such file or directory.
[2020-07-09 00:09:31.671699 +0000][Debug  ][ExDbgMsg          ] [jbod-mgmt-06.eoscluster.sdfarm.kr:1096] Calling MsgHandler: 0x252ff60 (message: kXR_stat (path: /eos/gsdc/rain12/file-1g-jbod-mgmt-04, flags: none) ) with status: [ERROR] Error response: No such file or directory.
[2020-07-09 00:09:31.671706 +0000][Debug  ][ExDbgMsg          ] [jbod-mgmt-06.eoscluster.sdfarm.kr:1096] Destroying MsgHandler: 0x252ff60.

For eos stat, it returns

sh-4.2# eos stat /eos/gsdc/rain12/file-1g-jbod-mgmt-04
  File: '/eos/gsdc/rain12/file-1g-jbod-mgmt-04'  Size: 1073741824            1.07 GB regular file

And for eos file info,

sh-4.2# eos file info /eos/gsdc/rain12/file-1g-jbod-mgmt-04
  File: '/eos/gsdc/rain12/file-1g-jbod-mgmt-04'  Flags: 0640
  Size: 1073741824
Modify: Thu Jul  9 00:04:14 2020 Timestamp: 1594253054.423574000
Change: Thu Jul  9 00:04:10 2020 Timestamp: 1594253050.303429710
Birth : Thu Jul  9 00:04:10 2020 Timestamp: 1594253050.303429710
  CUid: 2 CGid: 2  Fxid: 000002eb Fid: 747    Pid: 78   Pxid: 0000004e
XStype: adler    XS: 42 cc a7 27    ETAGs: "200521285632:42cca727"
Layout: qrain Stripes: 12 Blocksize: 1M LayoutId: 40640b52
  #Rep: 12
┌───┬──────┬─────────────────────────────────┬────────────────┬─────────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│                             host│      schedgroup│                 path│      boot│  configstatus│       drain│  active│                  geotag│
└───┴──────┴─────────────────────────────────┴────────────────┴─────────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
 0     1025 jbod-mgmt-07.eoscluster.sdfarm.kr       default.16 /jbod/box_13_disk_016     booted             rw      nodrain   online         kisti::gsdc::g03
 1     1445 jbod-mgmt-09.eoscluster.sdfarm.kr       default.16 /jbod/box_18_disk_016     booted             rw      nodrain   online         kisti::gsdc::g03
 2      941 jbod-mgmt-06.eoscluster.sdfarm.kr       default.16 /jbod/box_12_disk_016     booted             rw      nodrain   online         kisti::gsdc::g02
 3      101 jbod-mgmt-01.eoscluster.sdfarm.kr       default.16 /jbod/box_02_disk_016     booted             rw      nodrain   online         kisti::gsdc::g01
 4     1277 jbod-mgmt-08.eoscluster.sdfarm.kr       default.16 /jbod/box_16_disk_016     booted             rw      nodrain   online         kisti::gsdc::g03
 5      269 jbod-mgmt-02.eoscluster.sdfarm.kr       default.16 /jbod/box_04_disk_016     booted             rw      nodrain   online         kisti::gsdc::g01
 6      521 jbod-mgmt-04.eoscluster.sdfarm.kr       default.16 /jbod/box_07_disk_016     booted             rw      nodrain   online         kisti::gsdc::g02
 7     1361 jbod-mgmt-09.eoscluster.sdfarm.kr       default.16 /jbod/box_17_disk_016     booted             rw      nodrain   online         kisti::gsdc::g03
 8      185 jbod-mgmt-02.eoscluster.sdfarm.kr       default.16 /jbod/box_03_disk_016     booted             rw      nodrain   online         kisti::gsdc::g01
 9      857 jbod-mgmt-06.eoscluster.sdfarm.kr       default.16 /jbod/box_11_disk_016     booted             rw      nodrain   online         kisti::gsdc::g02
 10      17 jbod-mgmt-01.eoscluster.sdfarm.kr       default.16 /jbod/box_01_disk_016     booted             rw      nodrain   online         kisti::gsdc::g01
 11     605 jbod-mgmt-04.eoscluster.sdfarm.kr       default.16 /jbod/box_08_disk_016     booted             rw      nodrain   online         kisti::gsdc::g02

*******

Also the successful full eos cp dump for read a file stored in qrain with 10 stripes is here and the file info is following:

sh-4.2# eos file info /eos/gsdc/rain10/file-1g-jbod-mgmt-04
  File: '/eos/gsdc/rain10/file-1g-jbod-mgmt-04'  Flags: 0640
  Size: 1073741824
Modify: Thu Jul  9 00:02:46 2020 Timestamp: 1594252966.532386000
Change: Thu Jul  9 00:02:40 2020 Timestamp: 1594252960.273673763
Birth : Thu Jul  9 00:02:40 2020 Timestamp: 1594252960.273673763
  CUid: 2 CGid: 2  Fxid: 000002e3 Fid: 739    Pid: 74   Pxid: 0000004a
XStype: adler    XS: 42 cc a7 27    ETAGs: "198373801984:42cca727"
Layout: qrain Stripes: 10 Blocksize: 1M LayoutId: 40640952
  #Rep: 10
┌───┬──────┬─────────────────────────────────┬────────────────┬─────────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│                             host│      schedgroup│                 path│      boot│  configstatus│       drain│  active│                  geotag│
└───┴──────┴─────────────────────────────────┴────────────────┴─────────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
 0     1017 jbod-mgmt-07.eoscluster.sdfarm.kr        default.8 /jbod/box_13_disk_008     booted             rw      nodrain   online         kisti::gsdc::g03
 1     1353 jbod-mgmt-09.eoscluster.sdfarm.kr        default.8 /jbod/box_17_disk_008     booted             rw      nodrain   online         kisti::gsdc::g03
 2      849 jbod-mgmt-06.eoscluster.sdfarm.kr        default.8 /jbod/box_11_disk_008     booted             rw      nodrain   online         kisti::gsdc::g02
 3     1269 jbod-mgmt-08.eoscluster.sdfarm.kr        default.8 /jbod/box_16_disk_008     booted             rw      nodrain   online         kisti::gsdc::g03
 4      345 jbod-mgmt-03.eoscluster.sdfarm.kr        default.8 /jbod/box_05_disk_008     booted             rw      nodrain   online         kisti::gsdc::g01
 5      261 jbod-mgmt-02.eoscluster.sdfarm.kr        default.8 /jbod/box_04_disk_008     booted             rw      nodrain   online         kisti::gsdc::g01
 6     1101 jbod-mgmt-07.eoscluster.sdfarm.kr        default.8 /jbod/box_14_disk_008     booted             rw      nodrain   online         kisti::gsdc::g03
 7      933 jbod-mgmt-06.eoscluster.sdfarm.kr        default.8 /jbod/box_12_disk_008     booted             rw      nodrain   online         kisti::gsdc::g02
 8      177 jbod-mgmt-02.eoscluster.sdfarm.kr        default.8 /jbod/box_03_disk_008     booted             rw      nodrain   online         kisti::gsdc::g01
 9      513 jbod-mgmt-04.eoscluster.sdfarm.kr        default.8 /jbod/box_07_disk_008     booted             rw      nodrain   online         kisti::gsdc::g02

*******

Please just let me know if you need any further information.

Thank you.

Best regards,
Sang-Un

esindril · July 9, 2020, 7:34am

Hi Sang-Un,

Glad to hear there is some progress. Could you lease paste the following information from the MGM container?
grep “symkey” /var/eos/md/so.mgm.dump.<fqdn_mgm>:"

Looking at the log from the failed transfer is looks like the symkey that the MGM uses and the FSTs have still don’t match.

The stat error is not relevant the problem is still the “capability illegal” messages.

Thanks,
Elvin

sahn · July 13, 2020, 12:39am

Hi Elvin,

The output of grep "symkey" /var/eos/md/so.mgm.dump.jbod-mgmt-01.eoscluster.sdfarm.kr\:1094 is following:

sh-4.2# grep "symkey" /var/eos/md/so.mgm.dump.jbod-mgmt-01.eoscluster.sdfarm.kr\:1094
key=symkey                   value:F5igzjdI+pFyJ49/45e9kA14sCc=     age:1193001.02 changeid:0
key=symkey                   value:F5igzjdI+pFyJ49/45e9kA14sCc=     age:1193001.02 changeid:0
key=symkey                   value:F5igzjdI+pFyJ49/45e9kA14sCc=     age:1193001.01 changeid:0
key=symkey                   value:F5igzjdI+pFyJ49/45e9kA14sCc=     age:1193000.71 changeid:0
key=symkey                   value:F5igzjdI+pFyJ49/45e9kA14sCc=     age:1193001.01 changeid:0
key=symkey                   value:F5igzjdI+pFyJ49/45e9kA14sCc=     age:1193001.01 changeid:0
key=symkey                   value:F5igzjdI+pFyJ49/45e9kA14sCc=     age:1193001.01 changeid:0
key=symkey                   value:F5igzjdI+pFyJ49/45e9kA14sCc=     age:411753.27 changeid:0
key=symkey                   value:F5igzjdI+pFyJ49/45e9kA14sCc=     age:1193000.34 changeid:0
key=symkey                   value:F5igzjdI+pFyJ49/45e9kA14sCc=     age:1193000.28 changeid:0
key=symkey                   value:F5igzjdI+pFyJ49/45e9kA14sCc=     age:1193000.22 changeid:0
key=symkey                   value:F5igzjdI+pFyJ49/45e9kA14sCc=     age:1193001.02 changeid:0
key=symkey                   value:F5igzjdI+pFyJ49/45e9kA14sCc=     age:1193001.01 changeid:0
key=symkey                   value:F5igzjdI+pFyJ49/45e9kA14sCc=     age:1193001.01 changeid:0
key=symkey                   value:F5igzjdI+pFyJ49/45e9kA14sCc=     age:1192999.97 changeid:0
key=symkey                   value:F5igzjdI+pFyJ49/45e9kA14sCc=     age:1193001.02 changeid:0
key=symkey                   value:F5igzjdI+pFyJ49/45e9kA14sCc=     age:1192999.85 changeid:0
key=symkey                   value:F5igzjdI+pFyJ49/45e9kA14sCc=     age:1192999.78 changeid:0

It looks the value of symkey is identical for all FSTs. Is there anything else to check?

Thank you.

Best regards,
Sang-Un

CERN Accelerating science

Cp works except eos cp