About filesystem configuration and group balancer

Hi,

I have several questions on how to set my FS and group balancer when deploying EOS.

  1. I have two sets of JBOD disks on two different machines. Is it better to put them in two different groups like default.0 and default.1 or just to put them in the same group like default.0?
  2. When using RAIN layout like raid6 or qrain on JBODs, does the stripe number affect the IO performance? For example, the more stripes the more disks for simultaneous reading and writing? What would be the best number of stripes? Does replica layout have similar benefits from simultaneous RW?
  3. Besides JBODs, I may also have a hardware RAID array. Is it possible or is it a good practice to mix JBODs and hardware RAID? Should I put JBODs and the hardware RAID array in different groups or even in the same group?
  4. What layout will be chosen by the layout conversion of the group balancer when balancing between groups? It seems that layout settings are in space config or folder attr config and groups don’t have a specific layout setting. For example, if I have two groups of JBODs, with RAIN layout on one JBOD and the number of FS is not sufficient for the same RAIN layout in another JBOD, what layout will the group balancer use when converting layouts of files for balancing?
  5. The last question is about a specific error when using group balancer:

When I try to enable group balancer

eos space config default space.groupbalancer=on

with the FS config (the FS config seems a bit strange because I was trying to figure out the answer of the above question 4 but just got errors :rofl:)

➜  ~ eos fs ls
┌────────────────────────┬────┬──────┬────────────────────────────────┬────────────────┬────────────────┬────────────┬──────────────┬────────────┬──────┬────────┬────────────────┐
│host                    │port│    id│                            path│      schedgroup│          geotag│        boot│  configstatus│       drain│ usage│  active│          health│
└────────────────────────┴────┴──────┴────────────────────────────────┴────────────────┴────────────────┴────────────┴──────────────┴────────────┴──────┴────────┴────────────────┘
 mgm.localdomain          1095      1                 /data/fst/01/eos        default.0       local::geo       booted             rw      nodrain  45.31   online      no smartctl
 mgm.localdomain          1095      2                 /data/fst/03/eos        default.0       local::geo       booted             rw      nodrain  37.53   online      no smartctl
 mgm.localdomain          1095      3                 /data/fst/04/eos        default.0       local::geo       booted             rw      nodrain  47.31   online      no smartctl
 mgm.localdomain          1095      4                 /data/fst/05/eos        default.0       local::geo       booted             rw      nodrain  46.33   online      no smartctl
 mgm.localdomain          1095      5                 /data/fst/06/eos        default.0       local::geo       booted             rw      nodrain  46.34   online      no smartctl
 mgm.localdomain          1095      6                 /data/fst/07/eos        default.0       local::geo       booted             rw      nodrain  26.34   online      no smartctl
 node1.localdomain        1095      7                 /data/fst/01/eos        default.0       local::geo       booted             rw       failed  22.17   online      no smartctl
 node1.localdomain        1095      8                 /data/fst/02/eos        default.1       local::geo       booted             rw      nodrain   0.81   online      no smartctl

the balancing keeps failing with errors like the following:

eos convert list gives

➜  ~ eos convert list
┌──────────────────────────────────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│Conversion string                                 │Failure                                                                                                                                                                                                                                                                                                                                                                                      │
├──────────────────────────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│0000000000000053:default.1#10640562^groupbalancer^ 2023-10-31T16:39:22Z | [ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/0000000000000053:default.1#10640562^groupbalancer^; Operation not permitted; (destination)
 | tpc_src=root://mgm.localdomain:1094//eos/user/laf/test.bin tpc_dst=root://mgm.localdomain:1094//eos/dev/proc/conversion/0000000000000053:default.1#10640562^groupbalancer^│
│000000000000010c:default.1#10640562^groupbalancer^ 2023-10-31T16:39:22Z | [ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/000000000000010c:default.1#10640562^groupbalancer^; Operation not permitted; (destination)
 | tpc_src=root://mgm.localdomain:1094//eos/user/laf/test2.bin tpc_dst=root://mgm.localdomain:1094//eos/dev/proc/conversion/000000000000010c:default.1#10640562^groupbalancer^│
│0000000000000111:default.1#10640562^groupbalancer^ 2023-10-31T16:39:22Z | [ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/0000000000000111:default.1#10640562^groupbalancer^; Operation not permitted; (destination)
 | tpc_src=root://mgm.localdomain:1094//eos/user/laf/test3.bin tpc_dst=root://mgm.localdomain:1094//eos/dev/proc/conversion/0000000000000111:default.1#10640562^groupbalancer^│
│0000000000000113:default.1#10640562^groupbalancer^ 2023-10-31T16:39:22Z | [ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/0000000000000113:default.1#10640562^groupbalancer^; Operation not permitted; (destination)
 | tpc_src=root://mgm.localdomain:1094//eos/user/laf/test4.bin tpc_dst=root://mgm.localdomain:1094//eos/dev/proc/conversion/0000000000000113:default.1#10640562^groupbalancer^│
│0000000000000115:default.1#10640562^groupbalancer^ 2023-10-31T16:39:22Z | [ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/0000000000000115:default.1#10640562^groupbalancer^; Operation not permitted; (destination)
 | tpc_src=root://mgm.localdomain:1094//eos/user/laf/test5.bin tpc_dst=root://mgm.localdomain:1094//eos/dev/proc/conversion/0000000000000115:default.1#10640562^groupbalancer^│
│0000000000000117:default.1#10640562^groupbalancer^ 2023-10-31T16:39:22Z | [ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/0000000000000117:default.1#10640562^groupbalancer^; Operation not permitted; (destination)
 | tpc_src=root://mgm.localdomain:1094//eos/user/laf/test6.bin tpc_dst=root://mgm.localdomain:1094//eos/dev/proc/conversion/0000000000000117:default.1#10640562^groupbalancer^│
│0000000000000119:default.1#10640562^groupbalancer^ 2023-10-31T16:39:22Z | [ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/0000000000000119:default.1#10640562^groupbalancer^; Operation not permitted; (destination)
 | tpc_src=root://mgm.localdomain:1094//eos/user/laf/test7.bin tpc_dst=root://mgm.localdomain:1094//eos/dev/proc/conversion/0000000000000119:default.1#10640562^groupbalancer^│
│000000000000011b:default.1#10640562^groupbalancer^ 2023-10-31T16:39:22Z | [ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/000000000000011b:default.1#10640562^groupbalancer^; Operation not permitted; (destination)
 | tpc_src=root://mgm.localdomain:1094//eos/user/laf/test8.bin tpc_dst=root://mgm.localdomain:1094//eos/dev/proc/conversion/000000000000011b:default.1#10640562^groupbalancer^│
│000000000000011d:default.1#10640562^groupbalancer^ 2023-10-31T16:39:22Z | [ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/000000000000011d:default.1#10640562^groupbalancer^; Operation not permitted; (destination)
 | tpc_src=root://mgm.localdomain:1094//eos/user/laf/test9.bin tpc_dst=root://mgm.localdomain:1094//eos/dev/proc/conversion/000000000000011d:default.1#10640562^groupbalancer^│
│000000000000011f:default.1#10640562^groupbalancer^ 2023-10-31T16:39:22Z | [ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/000000000000011f:default.1#10640562^groupbalancer^; Operation not permitted; (destination)
 | tpc_src=root://mgm.localdomain:1094//eos/user/laf/test10.bin tpc_dst=root://mgm.localdomain:1094//eos/dev/proc/conversion/000000000000011f:default.1#10640562^groupbalancer^│
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

and /var/log/eos/mgm/xrdlog.mgm gives error messages like

231101 00:46:53 time=1698770813.964260 func=Emsg                     level=ERROR logid=18b74174-780d-11ee-ab52-000c29a8e890 unit=mgm@mgm.localdomain:1094 tid=00007f5f8ebfb700 sour
ce=XrdMgmOfsFile:3427             tident=root.85499:460@mgm sec=sss   uid=99 gid=99 name=eosnobody geo="" Unable to open file /eos/dev/proc/conversion/0000000000000113:default.1#1
0640562^groupbalancer^; Operation not permitted
231101 00:46:53 time=1698770813.964414 func=IdMap                    level=INFO  logid=static.............................. unit=mgm@mgm.localdomain:1094 tid=00007f5f8ebfb700 sour
ce=Mapping:994                    tident= sec=(null) uid=99 gid=99 name=- geo="" sec.prot=sss sec.name="eosnobody" sec.host="mgm.localdomain" sec.vorg="" sec.grps="eosnobody" sec.
role="" sec.info="" sec.app="groupbalancer" sec.tident="root.85499:460@mgm" vid.uid=99 vid.gid=99 sudo=0 gateway=0
231101 00:46:53 time=1698770813.964482 func=open                     level=INFO  logid=18b76348-780d-11ee-ab52-000c29a8e890 unit=mgm@mgm.localdomain:1094 tid=00007f5f8ebfb700 sour
ce=XrdMgmOfsFile:549              tident=root.85499:460@mgm sec=sss   uid=99 gid=99 name=eosnobody geo="" op=write trunc=512 path=/eos/dev/proc/conversion/0000000000000119:default
.1#10640562^groupbalancer^ info=eos.app=groupbalancer&eos.checksum=c02d0001&eos.excludefsid=1,6,5,4,3,2&eos.group=1&eos.layout.blockchecksum=crc32c&eos.layout.blocksize=1M&eos.lay
out.checksum=adler&eos.layout.nstripes=6&eos.layout.type=raid5&eos.rgid=2&eos.ruid=2&eos.space=default&eos.targetsize=1073741824&oss.asize=1073741824&tpc.dlg=root@mgm.localdomain:
1094&tpc.dlgon=0&tpc.key=396e907e00014dfb65412f7d&tpc.lfn=/eos/user/laf/test7.bin&tpc.spr=root&tpc.src=root@mgm.localdomain:1095&tpc.stage=copy&tpc.str=1&tpc.tpr=root
231101 00:46:53 time=1698770813.964562 func=open                     level=INFO  logid=18b76348-780d-11ee-ab52-000c29a8e890 unit=mgm@mgm.localdomain:1094 tid=00007f5f8ebfb700 sour
ce=XrdMgmOfsFile:698              tident=root.85499:460@mgm sec=sss   uid=99 gid=99 name=eosnobody geo="" msg="rewrote symlinks" sym-path=/eos/dev/proc/conversion/0000000000000119
:default.1#10640562^groupbalancer^ realpath=/eos/dev/proc/conversion/0000000000000119:default.1#10640562^groupbalancer^
231101 00:46:53 time=1698770813.964629 func=HandleError              level=ERROR logid=static.............................. unit=mgm@mgm.localdomain:1094 tid=00007f5f023ec700 sour
ce=ConversionJob:379              tident= sec=(null) uid=99 gid=99 name=- geo="" msg="[ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/0
000000000000113:default.1#10640562^groupbalancer^; Operation not permitted" tpc_src=root://mgm.localdomain:1094//eos/user/laf/test4.bin tpc_dst=root://mgm.localdomain:1094//eos/de
v/proc/conversion/0000000000000113:default.1#10640562^groupbalancer^ conversion_id=0000000000000113:default.1#10640562^groupbalancer^
231101 00:46:53 time=1698770813.964814 func=open                     level=INFO  logid=18b76348-780d-11ee-ab52-000c29a8e890 unit=mgm@mgm.localdomain:1094 tid=00007f5f8ebfb700 sour
ce=XrdMgmOfsFile:1095             tident=root.85499:460@mgm sec=sss   uid=99 gid=99 name=eosnobody geo="" acl=0 r=0 w=0 wo=0 egroup=0 shared=0 mutable=1 facl=0
231101 00:46:53 time=1698770813.964940 func=Emsg                     level=ERROR logid=18b76348-780d-11ee-ab52-000c29a8e890 unit=mgm@mgm.localdomain:1094 tid=00007f5f8ebfb700 sour
ce=XrdMgmOfsFile:3427             tident=root.85499:460@mgm sec=sss   uid=99 gid=99 name=eosnobody geo="" Unable to open file /eos/dev/proc/conversion/0000000000000119:default.1#1
0640562^groupbalancer^; Operation not permitted
231101 00:46:53 time=1698770813.965284 func=HandleError              level=ERROR logid=static.............................. unit=mgm@mgm.localdomain:1094 tid=00007f5f013ea700 sour
ce=ConversionJob:379              tident= sec=(null) uid=99 gid=99 name=- geo="" msg="[ERROR] Server responded with an error: [3010] Unable to open file /eos/dev/proc/conversion/0
000000000000119:default.1#10640562^groupbalancer^; Operation not permitted" tpc_src=root://mgm.localdomain:1094//eos/user/laf/test7.bin tpc_dst=root://mgm.localdomain:1094//eos/de
v/proc/conversion/0000000000000119:default.1#10640562^groupbalancer^ conversion_id=0000000000000119:default.1#10640562^groupbalancer^
231101 00:46:58 time=1698770818.918137 func=AcquireLease             level=INFO  logid=c05793f0-759e-11ee-a1e7-000c29a8e890 unit=mgm@mgm.localdomain:1094 tid=00007f5f953f9700 sour
ce=QdbMaster:510                  tident=<service> sec=      uid=0 gid=0 name= geo="" msg="qclient acquire lease call took 2ms"
231101 00:46:58 time=1698770818.949174 func=_rem                     level=INFO  logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@mgm.localdomain:1094 tid=00007f5eff3e6700 sour
ce=Rm:109                         tident=<single-exec> sec=local uid=0 gid=0 name=root geo="" path=/eos/dev/proc/conversion/000000000000011d:default.1#10640562^groupbalancer^ vid.
uid=0 vid.gid=0
231101 00:46:58 time=1698770818.949384 func=Emsg                     level=ERROR logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@mgm.localdomain:1094 tid=00007f5eff3e6700 sour
ce=XrdMgmOfs:844                  tident=<single-exec> sec=      uid=0 gid=0 name= geo="" Unable to remove /eos/dev/proc/conversion/000000000000011d:default.1#10640562^groupbalanc
er^; No such file or directory
231101 00:46:58 time=1698770818.951953 func=_rem                     level=INFO  logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@mgm.localdomain:1094 tid=00007f5efd3e2700 sour
ce=Rm:109                         tident=<single-exec> sec=local uid=0 gid=0 name=root geo="" path=/eos/dev/proc/conversion/0000000000000053:default.1#10640562^groupbalancer^ vid.
uid=0 vid.gid=0
231101 00:46:58 time=1698770818.952177 func=Emsg                     level=ERROR logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@mgm.localdomain:1094 tid=00007f5efd3e2700 sour
ce=XrdMgmOfs:844                  tident=<single-exec> sec=      uid=0 gid=0 name= geo="" Unable to remove /eos/dev/proc/conversion/0000000000000053:default.1#10640562^groupbalanc
er^; No such file or directory

It seems to be an authentication problem, but my VID config and eos whoami output are

➜  ~ eos vid ls
publicaccesslevel: => 1024
sudoer                 => uids()
tident:"unix@localhost":gid => root
tident:"unix@localhost":uid => root
tident:"unix@localhost.localdomain":gid => root
tident:"unix@localhost.localdomain":uid => root
tident:"unix@mgm":gid => root
tident:"unix@mgm":uid => root
tident:"unix@mgm.localdomain":gid => root
tident:"unix@mgm.localdomain":uid => root
tident:"unix@node1":gid => root
tident:"unix@node1":uid => root
tident:"unix@node1.localdomain":gid => root
tident:"unix@node1.localdomain":uid => root
tokensudo              => always
➜  ~ eos whoami
Virtual Identity: uid=0 (0,3,99) gid=0 (0,4,99) [authz:sss] sudo* host=localhost domain=localdomain

is this because of some misconfiguration of the authentication?