Dear EOS Expert and Administrator,
Three days ago we restart Kolkata:EOS2 after migrated to Quarkdb from Memory-NS.
At first we found that there were permission error in xrdlog.mgm on Master and Slave. We was able read file from EOS , but unable to write. Before migration, we had able read and write also.
Today, we had enable “eos fsck” and then check the xrdlog.mgm log; and found that there are many replication error i.e. “could not place new replica” and “[3009] Unable to schedule stripes for reconstruction”.
======================
[root@eos-mgm ~]# tail -25 /var/log/eos/mgm/xrdlog.mgm
210514 13:15:05 time=1620978305.070644 func=Repair level=INFO logid=static… unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc5ef7f6700 source=FsckEntry:794 tident= sec=(null) uid=99 gid=99 name=- geo="" msg=“fsck repair” fxid=02cf6484 err_type=6
210514 13:15:05 time=1620978305.070660 func=RepairReplicaInconsistencies level=INFO logid=4bfd2332-b488-11eb-8294-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc5ef7f6700 source=FsckEntry:507 tident= sec= uid=0 gid=0 name= geo="" fxid=02cf6484 fsid=91
210514 13:15:05 time=1620978305.070667 func=open level=INFO logid=4c20ca80-b488-11eb-9eab-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc915bfc700 source=XrdMgmOfsFile:2490 tident=daemon.5023:568@eos-mgm sec=sss uid=2 gid=2 name=daemon geo="" msg=“nominal stripes:7 reconstructed stripes=2 group_idx=12”
210514 13:15:05 time=1620978305.070677 func=open level=INFO logid=4c20ca80-b488-11eb-9eab-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc915bfc700 source=XrdMgmOfsFile:2499 tident=daemon.5023:568@eos-mgm sec=sss uid=2 gid=2 name=daemon geo="" msg=“plain booking size is 5246976
210514 13:15:05 time=1620978305.070675 func=RepairReplicaInconsistencies level=INFO logid=4bfd2332-b488-11eb-8294-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc5ef7f6700 source=FsckEntry:507 tident= sec= uid=0 gid=0 name= geo=”" fxid=02cf6484 fsid=88
210514 13:15:05 time=1620978305.070687 func=RepairReplicaInconsistencies level=INFO logid=4bfd2332-b488-11eb-8294-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc5ef7f6700 source=FsckEntry:507 tident= sec= uid=0 gid=0 name= geo="" fxid=02cf6484 fsid=86
210514 13:15:05 time=1620978305.070696 func=RepairReplicaInconsistencies level=INFO logid=4bfd2332-b488-11eb-8294-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc5ef7f6700 source=FsckEntry:507 tident= sec= uid=0 gid=0 name= geo="" fxid=02cf6484 fsid=89
210514 13:15:05 time=1620978305.070705 func=RepairReplicaInconsistencies level=INFO logid=4bfd2332-b488-11eb-8294-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc5ef7f6700 source=FsckEntry:507 tident= sec= uid=0 gid=0 name= geo="" fxid=02cf6484 fsid=90
210514 13:15:05 time=1620978305.070713 func=RepairReplicaInconsistencies level=INFO logid=4bfd2332-b488-11eb-8294-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc5ef7f6700 source=FsckEntry:507 tident= sec= uid=0 gid=0 name= geo="" fxid=02cf6484 fsid=85
210514 13:15:05 time=1620978305.070718 func=Emsg level=ERROR logid=4c20ca80-b488-11eb-9eab-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc915bfc700 source=XrdMgmOfsFile:3231 tident=daemon.5023:568@eos-mgm sec=sss uid=2 gid=2 name=daemon geo="" Unable to schedule stripes for reconstruction /eos/alicekolkata/grid/15/40831/7d17b8b4-5a75-11eb-9b25-c36162373fca; No space left on device
210514 13:15:05 time=1620978305.070794 func=IdMap level=INFO logid=static… unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc915bfc700 source=Mapping:993 tident= sec=(null) uid=99 gid=99 name=- geo="" sec.prot=sss sec.name=“daemon” sec.host=“eos-mgm.tier2-kol.res.in” sec.vorg="" sec.grps=“daemon” sec.role="" sec.info="" sec.app="" sec.tident=“daemon.5023:568@eos-mgm” vid.uid=2 vid.gid=2
210514 13:15:05 time=1620978305.070825 func=open level=INFO logid=4c20e2e0-b488-11eb-9eab-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc915bfc700 source=XrdMgmOfsFile:500 tident=daemon.5023:568@eos-mgm sec=sss uid=2 gid=2 name=daemon geo="" op=read path=/eos/alicekolkata/grid/11/14854/93066332-5a75-11eb-af56-37d7d0319c85 info=cap.msg=<…>&cap.sym=<…>&eos.encodepath=curl&eos.pio.action=reconstruct&eos.pio.recfs=91&mgm.logid=4c1fc2f2-b488-11eb-a3ff-e4434b664554&tpc.stage=placement
210514 13:15:05 time=1620978305.070831 func=DoIt level=ERROR logid=4c1fb3ca-b488-11eb-8caf-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc5ecff1700 source=DrainTransferJob:154 tident= sec= uid=0 gid=0 name= geo="" src=root://eoskolkata.tier2-kol.res.in:1094//#curl#/eos/alicekolkata/grid/15/40831/7d17b8b4-5a75-11eb-9b25-c36162373fca dst=root://eos06.tier2-kol.res.in:1095//replicate:0 logid=4c1fbffa-b488-11eb-8caf-e4434b664554 tpc_err=[ERROR] Server responded with an error: [3009] Unable to schedule stripes for reconstruction /eos/alicekolkata/grid/15/40831/7d17b8b4-5a75-11eb-9b25-c36162373fca; No space left on device
210514 13:15:05 time=1620978305.070982 func=DoIt level=INFO logid=4c20d1e2-b488-11eb-b3d1-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc5f17fa700 source=DrainTransferJob:145 tident= sec= uid=0 gid=0 name= geo="" [tpc]: app=fsck logid=4c20dbce-b488-11eb-b3d1-e4434b664554 src_url=root://eoskolkata.tier2-kol.res.in:1094//#curl#/eos/alicekolkata/grid/00/19018/f2fc4c7a-5a75-11eb-b976-87f2ec6129f1 => dst_url=root://eos10.tier2-kol.res.in:1095//replicate:0 prepare_msg=[SUCCESS]
210514 13:15:05 time=1620978305.070983 func=open level=INFO logid=4c20e2e0-b488-11eb-9eab-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc915bfc700 source=XrdMgmOfsFile:1037 tident=daemon.5023:568@eos-mgm sec=sss uid=2 gid=2 name=daemon geo="" acl=0 r=0 w=0 wo=0 egroup=0 shared=0 mutable=1 facl=0
210514 13:15:05 time=1620978305.071061 func=SelectDstFs level=ERROR logid=4c1fb3ca-b488-11eb-8caf-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc5ecff1700 source=DrainTransferJob:546 tident= sec= uid=0 gid=0 name= geo="" msg=“fxid=02cf6164 could not place new replica”
210514 13:15:05 time=1620978305.071082 func=ReportError level=ERROR logid=4c1fb3ca-b488-11eb-8caf-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc5ecff1700 source=DrainTransferJob:45 tident= sec= uid=0 gid=0 name= geo="" msg=“failed to select destination file system” fxid=02cf6164
210514 13:15:05 time=1620978305.071100 func=RepairReplicaInconsistencies level=ERROR logid=4bfd193c-b488-11eb-8294-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc5ecff1700 source=FsckEntry:666 tident= sec= uid=0 gid=0 name= geo="" msg=
“replica inconsistency repair failed fxid=02cf6164 src_fsid=87
210514 13:15:05 time=1620978305.071177 func=Repair level=INFO logid=static… unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc5eb7ee700 source=FsckEntry:794 tident= sec=(null) uid=99 gid=99 name=- geo=”" msg=“fsck repair” fxid=02cf64c0 err_type=6
210514 13:15:05 time=1620978305.071204 func=RepairReplicaInconsistencies level=INFO logid=4bfd23c8-b488-11eb-8294-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc5eb7ee700 source=FsckEntry:507 tident= sec= uid=0 gid=0 name= geo="" fxid=02cf64c0 fsid=91
210514 13:15:05 time=1620978305.071221 func=RepairReplicaInconsistencies level=INFO logid=4bfd23c8-b488-11eb-8294-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc5eb7ee700 source=FsckEntry:507 tident= sec= uid=0 gid=0 name= geo="" fxid=02cf64c0 fsid=90
210514 13:15:05 time=1620978305.071232 func=RepairReplicaInconsistencies level=INFO logid=4bfd23c8-b488-11eb-8294-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc5eb7ee700 source=FsckEntry:507 tident= sec= uid=0 gid=0 name= geo="" fxid=02cf64c0 fsid=87
210514 13:15:05 time=1620978305.071243 func=RepairReplicaInconsistencies level=INFO logid=4bfd23c8-b488-11eb-8294-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc5eb7ee700 source=FsckEntry:507 tident= sec= uid=0 gid=0 name= geo="" fxid=02cf64c0 fsid=88
210514 13:15:05 time=1620978305.071253 func=open level=INFO logid=4c20e2e0-b488-11eb-9eab-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc915bfc700 source=XrdMgmOfsFile:2490 tident=daemon.5023:568@eos-mgm sec=sss uid=2 gid=2 name=daemon geo="" msg=“nominal stripes:7 reconstructed stripes=2 group_idx=12”
210514 13:15:05 time=1620978305.071257 func=RepairReplicaInconsistencies level=INFO logid=4bfd23c8-b488-11eb-8294-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc5eb7ee700 source=FsckEntry:507 tident= sec= uid=0 gid=0 name= geo="" fxid=02cf64c0 fsid=89
[root@eos-mgm ~]#
We had check the group ls and seem that 4 group balancing were going on. Also check “Unable to schedule stripes for reconstruction” on xrdlog.mgm log for deep investigation.
[root@eos-mgm ~]# eos group ls | sort -k6h
┌──────────┬────────────────┬────────────┬──────┬────────────┬────────────┬────────────┬──────────┬──────────┐
└──────────┴────────────────┴────────────┴──────┴────────────┴────────────┴────────────┴──────────┴──────────┘
│type │ name│ status│ N(fs)│ dev(filled)│ avg(filled)│ sig(filled)│ balancing│ bal-shd│
groupview default.5 on 7 11.60 14.65 4.95 balancing 11
groupview default.12 on 7 1.19 34.70 0.53 balancing 8
groupview default.3 on 7 1.17 34.72 0.51 balancing 14
groupview default.14 on 7 0.27 35.48 0.16 idle 0
groupview default.7 on 7 13.98 35.75 5.73 balancing 7
groupview default.13 on 7 0.85 35.97 0.40 idle 0
groupview default.11 on 7 0.84 36.07 0.39 idle 0
groupview default.15 on 7 0.97 36.15 0.44 idle 0
groupview default.10 on 7 0.80 36.18 0.36 idle 0
groupview default.1 on 7 0.79 36.23 0.37 idle 0
groupview default.4 on 7 0.59 36.24 0.28 idle 0
groupview default.8 on 7 0.78 36.26 0.36 idle 0
groupview default.2 on 7 0.62 36.33 0.31 idle 0
groupview default.9 on 7 0.92 36.34 0.42 idle 0
groupview default.6 on 7 0.75 36.36 0.34 idle 0
groupview default.0 on 7 0.75 36.67 0.35 idle 0
[root@eos-mgm ~]#
[root@eos-mgm ~]# tail -5 /var/log/eos/mgm/xrdlog.mgm | grep “Unable to schedule stripes for reconstruction”
210514 12:39:42 time=1620976182.125798 func=DoIt level=ERROR logid=5abf4170-b483-11eb-b3d1-e4434b664554 unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007fc5f17fa700 source=DrainTransferJob:154 tident= sec= uid=0 gid=0 name= geo="" src=root://eoskolkata.tier2-kol.res.in:1094//#curl#/eos/alicekolkata/grid/08/25677/29079d00-579e-11e5-96ba-732cceb30bbe dst=root://eos05.tier2-kol.res.in:1095//replicate:0 logid=5abf5610-b483-11eb-b3d1-e4434b664554 tpc_err=[ERROR] Server responded with an error: [3009] Unable to schedule stripes for reconstruction /eos/alicekolkata/grid/08/25677/29079d00-579e-11e5-96ba-732cceb30bbe; No space left on device
[root@eos-mgm ~]# eos file info /eos/alicekolkata/grid/08/25677/29079d00-579e-11e5-96ba-732cceb30bbe
File: ‘/eos/alicekolkata/grid/08/25677/29079d00-579e-11e5-96ba-732cceb30bbe’ Flags: 0600
Size: 101415327
Modify: Fri Mar 29 20:36:52 2019 Timestamp: 1553872012.447777000
Change: Fri Mar 29 20:36:47 2019 Timestamp: 1553872007.671813569
Birth: Thu Jan 1 05:30:00 1970 Timestamp: 0.000000000
CUid: 10367 CGid: 1395 Fxid: 0028751f Fid: 2651423 Pid: 22691 Pxid: 000058a3
XStype: adler XS: 69 91 75 39 ETAGs: “711735942053888:69917539”
Layout: raid6 Stripes: 7 Blocksize: 1M LayoutId: 20640642 Redundancy: d2::t0
#Rep: 6
┌───┬──────┬────────────────────────┬────────────────┬────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│ host│ schedgroup│ path│ boot│ configstatus│ drain│ active│ geotag│
└───┴──────┴────────────────────────┴────────────────┴────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
0 91 eos09.tier2-kol.res.in default.12 /xdata6 booted rw nodrain online Kolkata::EOS2
1 86 eos08.tier2-kol.res.in default.12 /xdata6 booted rw nodrain online Kolkata::EOS2
2 89 eos06.tier2-kol.res.in default.12 /xdata6 booted rw nodrain online Kolkata::EOS2
3 88 eos07.tier2-kol.res.in default.12 /xdata6 booted rw nodrain online Kolkata::EOS2
4 90 eos10.tier2-kol.res.in default.12 /xdata6 booted rw nodrain online Kolkata::EOS2
5 85 eos04.tier2-kol.res.in default.12 /xdata6 booted rw nodrain online Kolkata::EOS2
[root@eos-mgm ~]
On above output of eos file info, one of replica i.e. eos05 had miising. There are too many files in Kolkata::EOS2 where one or two replicas are missing for group no 5,12,3 and 7 where balancing were going on.
[root@eos-mgm ~]# eos file info /eos/alicekolkata/grid/12/34178/92cf7dd2-3f30-11eb-997d-4bf1c2cd6a34
File: ‘/eos/alicekolkata/grid/12/34178/92cf7dd2-3f30-11eb-997d-4bf1c2cd6a34’ Flags: 0664
Size: 11998913
Modify: Wed Dec 16 05:22:22 2020 Timestamp: 1608076342.468447000
Change: Wed Dec 16 05:22:22 2020 Timestamp: 1608076342.179614875
Birth: Wed Dec 16 05:22:22 2020 Timestamp: 1608076342.179614875
CUid: 10367 CGid: 1395 Fxid: 02904406 Fid: 43009030 Pid: 7597 Pxid: 00001dad
XStype: adler XS: 47 a3 39 79 ETAGs: “11545148580167680:47a33979”
Layout: raid6 Stripes: 7 Blocksize: 1M LayoutId: 20640642 Redundancy: d2::t0
#Rep: 6
┌───┬──────┬────────────────────────┬────────────────┬────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│ host│ schedgroup│ path│ boot│ configstatus│ drain│ active│ geotag│
└───┴──────┴────────────────────────┴────────────────┴────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
0 55 eos10.tier2-kol.res.in default.7 /xdata15 booted rw nodrain online Kolkata::EOS2
1 50 eos04.tier2-kol.res.in default.7 /xdata15 booted rw nodrain online Kolkata::EOS2
2 53 eos07.tier2-kol.res.in default.7 /xdata15 booted rw nodrain online Kolkata::EOS2
3 56 eos09.tier2-kol.res.in default.7 /xdata15 booted rw nodrain online Kolkata::EOS2
4 52 eos05.tier2-kol.res.in default.7 /xdata15 booted rw nodrain online Kolkata::EOS2
5 54 eos06.tier2-kol.res.in default.7 /xdata15 booted rw nodrain online Kolkata::EOS2
[root@eos-mgm ~]#
[root@eos-mgm ~]# eos file info /eos/alicekolkata/grid/12/23882/13ddb16a-2927-11e7-8d56-8fb51abc3cf0
File: ‘/eos/alicekolkata/grid/12/23882/13ddb16a-2927-11e7-8d56-8fb51abc3cf0’ Flags: 0644
Size: 276749879
Modify: Tue Mar 26 11:05:58 2019 Timestamp: 1553578558.297332000
Change: Tue Mar 26 11:05:19 2019 Timestamp: 1553578519.076744161
Birth: Thu Jan 1 05:30:00 1970 Timestamp: 0.000000000
CUid: 10367 CGid: 1395 Fxid: 00033d22 Fid: 212258 Pid: 3326 Pxid: 00000cfe
XStype: adler XS: 57 d2 af ce ETAGs: “56977573019648:57d2afce”
Layout: raid6 Stripes: 7 Blocksize: 1M LayoutId: 20640642 Redundancy: d1::t0
#Rep: 5
┌───┬──────┬────────────────────────┬────────────────┬────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│ host│ schedgroup│ path│ boot│ configstatus│ drain│ active│ geotag│
└───┴──────┴────────────────────────┴────────────────┴────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
0 36 eos04.tier2-kol.res.in default.5 /xdata13 booted rw nodrain online Kolkata::EOS2
1 38 eos05.tier2-kol.res.in default.5 /xdata13 booted rw nodrain online Kolkata::EOS2
2 41 eos10.tier2-kol.res.in default.5 /xdata13 booted rw nodrain online Kolkata::EOS2
3 37 eos08.tier2-kol.res.in default.5 /xdata13 booted rw nodrain online Kolkata::EOS2
4 39 eos07.tier2-kol.res.in default.5 /xdata13 booted rw nodrain online Kolkata::EOS2
[root@eos-mgm ~]#
On above output of eos file info, two replica i.e. eos06 and eos09 are missing. But when we check those mount point i.e. xdata13 in “eos fs ls”, it’s had exist.
[root@eos-mgm ~]# eos -b fs ls | grep xdata13 (It’s exist also in FST)
eos04.tier2-kol.res.in 1095 36 /xdata13 default.5 Kolkata::EOS2 booted rw nodrain online N/A
eos08.tier2-kol.res.in 1095 37 /xdata13 default.5 Kolkata::EOS2 booted rw nodrain online N/A
eos05.tier2-kol.res.in 1095 38 /xdata13 default.5 Kolkata::EOS2 booted rw nodrain online N/A
eos07.tier2-kol.res.in 1095 39 /xdata13 default.5 Kolkata::EOS2 booted rw nodrain online N/A
eos06.tier2-kol.res.in 1095 40 /xdata13 default.5 Kolkata::EOS2 booted rw nodrain online N/A
eos10.tier2-kol.res.in 1095 41 /xdata13 default.5 Kolkata::EOS2 booted rw nodrain online N/A
eos09.tier2-kol.res.in 1095 42 /xdata13 default.5 Kolkata::EOS2 booted rw nodrain online N/A
[root@eos-mgm ~]# ssh eos06 df -kh |grep xdata13
/dev/sdo1 9.0T 279G 8.7T 4% /xdata13
[root@eos-mgm ~]# ssh eos05 df -kh |grep xdata13
/dev/sdo1 9.0T 1.7T 7.3T 19% /xdata13
[root@eos-mgm ~]# ssh eos04 df -kh |grep xdata13
/dev/sdo1 9.0T 1.4T 7.6T 16% /xdata13
[root@eos-mgm ~]#
However the the space occupied on xdata13 by different FST are not same.
the fsck report are below for information:-
[root@eos-mgm ~]# eos fsck report
timestamp=1620979518 tag=“blockxs_err” count=576
timestamp=1620979518 tag=“d_mem_sz_diff” count=2113
timestamp=1620979518 tag=“orphans_n” count=432097
timestamp=1620979518 tag=“rep_diff_n” count=1692234
timestamp=1620979518 tag=“rep_missing_n” count=3529158
timestamp=1620979518 tag=“unreg_n” count=160310
[root@eos-mgm ~]#
So, kindly suggest what to do . How to repair those missing and faulty replicas with safeguard?
Regards
Prasun