Hi,
I have several questions on how to set my FS and group balancer when deploying EOS.
- I have two sets of JBOD disks on two different machines. Is it better to put them in two different groups like
default.0
anddefault.1
or just to put them in the same group likedefault.0
? - When using RAIN layout like raid6 or qrain on JBODs, does the stripe number affect the IO performance? For example, the more stripes the more disks for simultaneous reading and writing? What would be the best number of stripes? Does replica layout have similar benefits from simultaneous RW?
- Besides JBODs, I may also have a hardware RAID array. Is it possible or is it a good practice to mix JBODs and hardware RAID? Should I put JBODs and the hardware RAID array in different groups or even in the same group?
- What layout will be chosen by the layout conversion of the group balancer when balancing between groups? It seems that layout settings are in space config or folder attr config and groups don’t have a specific layout setting. For example, if I have two groups of JBODs, with RAIN layout on one JBOD and the number of FS is not sufficient for the same RAIN layout in another JBOD, what layout will the group balancer use when converting layouts of files for balancing?
- The last question is about a specific error when using group balancer:
When I try to enable group balancer
eos space config default space.groupbalancer=on
with the FS config (the FS config seems a bit strange because I was trying to figure out the answer of the above question 4 but just got errors )
➜ ~ eos fs ls
┌────────────────────────┬────┬──────┬────────────────────────────────┬────────────────┬────────────────┬────────────┬──────────────┬────────────┬──────┬────────┬────────────────┐
│host │port│ id│ path│ schedgroup│ geotag│ boot│ configstatus│ drain│ usage│ active│ health│
└────────────────────────┴────┴──────┴────────────────────────────────┴────────────────┴────────────────┴────────────┴──────────────┴────────────┴──────┴────────┴────────────────┘
mgm.localdomain 1095 1 /data/fst/01/eos default.0 local::geo booted rw nodrain 45.31 online no smartctl
mgm.localdomain 1095 2 /data/fst/03/eos default.0 local::geo booted rw nodrain 37.53 online no smartctl
mgm.localdomain 1095 3 /data/fst/04/eos default.0 local::geo booted rw nodrain 47.31 online no smartctl
mgm.localdomain 1095 4 /data/fst/05/eos default.0 local::geo booted rw nodrain 46.33 online no smartctl
mgm.localdomain 1095 5 /data/fst/06/eos default.0 local::geo booted rw nodrain 46.34 online no smartctl
mgm.localdomain 1095 6 /data/fst/07/eos default.0 local::geo booted rw nodrain 26.34 online no smartctl
node1.localdomain 1095 7 /data/fst/01/eos default.0 local::geo booted rw failed 22.17 online no smartctl
node1.localdomain 1095 8 /data/fst/02/eos default.1 local::geo booted rw nodrain 0.81 online no smartctl
the balancing keeps failing with errors like the following:
eos convert list
gives
➜ ~ eos convert list
┌──────────────────────────────────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│Conversion string │Failure │
├──────────────────────────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│0000000000000053:default.1#10640562^groupbalancer^ 2023-10-31T16:39:22Z | [ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/0000000000000053:default.1#10640562^groupbalancer^; Operation not permitted; (destination)
| tpc_src=root://mgm.localdomain:1094//eos/user/laf/test.bin tpc_dst=root://mgm.localdomain:1094//eos/dev/proc/conversion/0000000000000053:default.1#10640562^groupbalancer^│
│000000000000010c:default.1#10640562^groupbalancer^ 2023-10-31T16:39:22Z | [ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/000000000000010c:default.1#10640562^groupbalancer^; Operation not permitted; (destination)
| tpc_src=root://mgm.localdomain:1094//eos/user/laf/test2.bin tpc_dst=root://mgm.localdomain:1094//eos/dev/proc/conversion/000000000000010c:default.1#10640562^groupbalancer^│
│0000000000000111:default.1#10640562^groupbalancer^ 2023-10-31T16:39:22Z | [ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/0000000000000111:default.1#10640562^groupbalancer^; Operation not permitted; (destination)
| tpc_src=root://mgm.localdomain:1094//eos/user/laf/test3.bin tpc_dst=root://mgm.localdomain:1094//eos/dev/proc/conversion/0000000000000111:default.1#10640562^groupbalancer^│
│0000000000000113:default.1#10640562^groupbalancer^ 2023-10-31T16:39:22Z | [ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/0000000000000113:default.1#10640562^groupbalancer^; Operation not permitted; (destination)
| tpc_src=root://mgm.localdomain:1094//eos/user/laf/test4.bin tpc_dst=root://mgm.localdomain:1094//eos/dev/proc/conversion/0000000000000113:default.1#10640562^groupbalancer^│
│0000000000000115:default.1#10640562^groupbalancer^ 2023-10-31T16:39:22Z | [ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/0000000000000115:default.1#10640562^groupbalancer^; Operation not permitted; (destination)
| tpc_src=root://mgm.localdomain:1094//eos/user/laf/test5.bin tpc_dst=root://mgm.localdomain:1094//eos/dev/proc/conversion/0000000000000115:default.1#10640562^groupbalancer^│
│0000000000000117:default.1#10640562^groupbalancer^ 2023-10-31T16:39:22Z | [ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/0000000000000117:default.1#10640562^groupbalancer^; Operation not permitted; (destination)
| tpc_src=root://mgm.localdomain:1094//eos/user/laf/test6.bin tpc_dst=root://mgm.localdomain:1094//eos/dev/proc/conversion/0000000000000117:default.1#10640562^groupbalancer^│
│0000000000000119:default.1#10640562^groupbalancer^ 2023-10-31T16:39:22Z | [ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/0000000000000119:default.1#10640562^groupbalancer^; Operation not permitted; (destination)
| tpc_src=root://mgm.localdomain:1094//eos/user/laf/test7.bin tpc_dst=root://mgm.localdomain:1094//eos/dev/proc/conversion/0000000000000119:default.1#10640562^groupbalancer^│
│000000000000011b:default.1#10640562^groupbalancer^ 2023-10-31T16:39:22Z | [ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/000000000000011b:default.1#10640562^groupbalancer^; Operation not permitted; (destination)
| tpc_src=root://mgm.localdomain:1094//eos/user/laf/test8.bin tpc_dst=root://mgm.localdomain:1094//eos/dev/proc/conversion/000000000000011b:default.1#10640562^groupbalancer^│
│000000000000011d:default.1#10640562^groupbalancer^ 2023-10-31T16:39:22Z | [ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/000000000000011d:default.1#10640562^groupbalancer^; Operation not permitted; (destination)
| tpc_src=root://mgm.localdomain:1094//eos/user/laf/test9.bin tpc_dst=root://mgm.localdomain:1094//eos/dev/proc/conversion/000000000000011d:default.1#10640562^groupbalancer^│
│000000000000011f:default.1#10640562^groupbalancer^ 2023-10-31T16:39:22Z | [ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/000000000000011f:default.1#10640562^groupbalancer^; Operation not permitted; (destination)
| tpc_src=root://mgm.localdomain:1094//eos/user/laf/test10.bin tpc_dst=root://mgm.localdomain:1094//eos/dev/proc/conversion/000000000000011f:default.1#10640562^groupbalancer^│
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
and /var/log/eos/mgm/xrdlog.mgm
gives error messages like
231101 00:46:53 time=1698770813.964260 func=Emsg level=ERROR logid=18b74174-780d-11ee-ab52-000c29a8e890 unit=mgm@mgm.localdomain:1094 tid=00007f5f8ebfb700 sour
ce=XrdMgmOfsFile:3427 tident=root.85499:460@mgm sec=sss uid=99 gid=99 name=eosnobody geo="" Unable to open file /eos/dev/proc/conversion/0000000000000113:default.1#1
0640562^groupbalancer^; Operation not permitted
231101 00:46:53 time=1698770813.964414 func=IdMap level=INFO logid=static.............................. unit=mgm@mgm.localdomain:1094 tid=00007f5f8ebfb700 sour
ce=Mapping:994 tident= sec=(null) uid=99 gid=99 name=- geo="" sec.prot=sss sec.name="eosnobody" sec.host="mgm.localdomain" sec.vorg="" sec.grps="eosnobody" sec.
role="" sec.info="" sec.app="groupbalancer" sec.tident="root.85499:460@mgm" vid.uid=99 vid.gid=99 sudo=0 gateway=0
231101 00:46:53 time=1698770813.964482 func=open level=INFO logid=18b76348-780d-11ee-ab52-000c29a8e890 unit=mgm@mgm.localdomain:1094 tid=00007f5f8ebfb700 sour
ce=XrdMgmOfsFile:549 tident=root.85499:460@mgm sec=sss uid=99 gid=99 name=eosnobody geo="" op=write trunc=512 path=/eos/dev/proc/conversion/0000000000000119:default
.1#10640562^groupbalancer^ info=eos.app=groupbalancer&eos.checksum=c02d0001&eos.excludefsid=1,6,5,4,3,2&eos.group=1&eos.layout.blockchecksum=crc32c&eos.layout.blocksize=1M&eos.lay
out.checksum=adler&eos.layout.nstripes=6&eos.layout.type=raid5&eos.rgid=2&eos.ruid=2&eos.space=default&eos.targetsize=1073741824&oss.asize=1073741824&tpc.dlg=root@mgm.localdomain:
1094&tpc.dlgon=0&tpc.key=396e907e00014dfb65412f7d&tpc.lfn=/eos/user/laf/test7.bin&tpc.spr=root&tpc.src=root@mgm.localdomain:1095&tpc.stage=copy&tpc.str=1&tpc.tpr=root
231101 00:46:53 time=1698770813.964562 func=open level=INFO logid=18b76348-780d-11ee-ab52-000c29a8e890 unit=mgm@mgm.localdomain:1094 tid=00007f5f8ebfb700 sour
ce=XrdMgmOfsFile:698 tident=root.85499:460@mgm sec=sss uid=99 gid=99 name=eosnobody geo="" msg="rewrote symlinks" sym-path=/eos/dev/proc/conversion/0000000000000119
:default.1#10640562^groupbalancer^ realpath=/eos/dev/proc/conversion/0000000000000119:default.1#10640562^groupbalancer^
231101 00:46:53 time=1698770813.964629 func=HandleError level=ERROR logid=static.............................. unit=mgm@mgm.localdomain:1094 tid=00007f5f023ec700 sour
ce=ConversionJob:379 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="[ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/0
000000000000113:default.1#10640562^groupbalancer^; Operation not permitted" tpc_src=root://mgm.localdomain:1094//eos/user/laf/test4.bin tpc_dst=root://mgm.localdomain:1094//eos/de
v/proc/conversion/0000000000000113:default.1#10640562^groupbalancer^ conversion_id=0000000000000113:default.1#10640562^groupbalancer^
231101 00:46:53 time=1698770813.964814 func=open level=INFO logid=18b76348-780d-11ee-ab52-000c29a8e890 unit=mgm@mgm.localdomain:1094 tid=00007f5f8ebfb700 sour
ce=XrdMgmOfsFile:1095 tident=root.85499:460@mgm sec=sss uid=99 gid=99 name=eosnobody geo="" acl=0 r=0 w=0 wo=0 egroup=0 shared=0 mutable=1 facl=0
231101 00:46:53 time=1698770813.964940 func=Emsg level=ERROR logid=18b76348-780d-11ee-ab52-000c29a8e890 unit=mgm@mgm.localdomain:1094 tid=00007f5f8ebfb700 sour
ce=XrdMgmOfsFile:3427 tident=root.85499:460@mgm sec=sss uid=99 gid=99 name=eosnobody geo="" Unable to open file /eos/dev/proc/conversion/0000000000000119:default.1#1
0640562^groupbalancer^; Operation not permitted
231101 00:46:53 time=1698770813.965284 func=HandleError level=ERROR logid=static.............................. unit=mgm@mgm.localdomain:1094 tid=00007f5f013ea700 sour
ce=ConversionJob:379 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="[ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/0
000000000000119:default.1#10640562^groupbalancer^; Operation not permitted" tpc_src=root://mgm.localdomain:1094//eos/user/laf/test7.bin tpc_dst=root://mgm.localdomain:1094//eos/de
v/proc/conversion/0000000000000119:default.1#10640562^groupbalancer^ conversion_id=0000000000000119:default.1#10640562^groupbalancer^
231101 00:46:58 time=1698770818.918137 func=AcquireLease level=INFO logid=c05793f0-759e-11ee-a1e7-000c29a8e890 unit=mgm@mgm.localdomain:1094 tid=00007f5f953f9700 sour
ce=QdbMaster:510 tident=<service> sec= uid=0 gid=0 name= geo="" msg="qclient acquire lease call took 2ms"
231101 00:46:58 time=1698770818.949174 func=_rem level=INFO logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@mgm.localdomain:1094 tid=00007f5eff3e6700 sour
ce=Rm:109 tident=<single-exec> sec=local uid=0 gid=0 name=root geo="" path=/eos/dev/proc/conversion/000000000000011d:default.1#10640562^groupbalancer^ vid.
uid=0 vid.gid=0
231101 00:46:58 time=1698770818.949384 func=Emsg level=ERROR logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@mgm.localdomain:1094 tid=00007f5eff3e6700 sour
ce=XrdMgmOfs:844 tident=<single-exec> sec= uid=0 gid=0 name= geo="" Unable to remove /eos/dev/proc/conversion/000000000000011d:default.1#10640562^groupbalanc
er^; No such file or directory
231101 00:46:58 time=1698770818.951953 func=_rem level=INFO logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@mgm.localdomain:1094 tid=00007f5efd3e2700 sour
ce=Rm:109 tident=<single-exec> sec=local uid=0 gid=0 name=root geo="" path=/eos/dev/proc/conversion/0000000000000053:default.1#10640562^groupbalancer^ vid.
uid=0 vid.gid=0
231101 00:46:58 time=1698770818.952177 func=Emsg level=ERROR logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@mgm.localdomain:1094 tid=00007f5efd3e2700 sour
ce=XrdMgmOfs:844 tident=<single-exec> sec= uid=0 gid=0 name= geo="" Unable to remove /eos/dev/proc/conversion/0000000000000053:default.1#10640562^groupbalanc
er^; No such file or directory
It seems to be an authentication problem, but my VID config and eos whoami
output are
➜ ~ eos vid ls
publicaccesslevel: => 1024
sudoer => uids()
tident:"unix@localhost":gid => root
tident:"unix@localhost":uid => root
tident:"unix@localhost.localdomain":gid => root
tident:"unix@localhost.localdomain":uid => root
tident:"unix@mgm":gid => root
tident:"unix@mgm":uid => root
tident:"unix@mgm.localdomain":gid => root
tident:"unix@mgm.localdomain":uid => root
tident:"unix@node1":gid => root
tident:"unix@node1":uid => root
tident:"unix@node1.localdomain":gid => root
tident:"unix@node1.localdomain":uid => root
tokensudo => always
➜ ~ eos whoami
Virtual Identity: uid=0 (0,3,99) gid=0 (0,4,99) [authz:sss] sudo* host=localhost domain=localdomain
is this because of some misconfiguration of the authentication?