File base manual conversion issue

Dear Experts,

I have encountered an issue regarding manual file conversion. A new conversion job of a file that its conversions were already failed or pending cannot be pushed to QuarkDB after clearing the failed or pending jobs.

The converter is enabled on the space:

sh-4.2# eos space status default | grep convert
converter                        := on
converter.ntx                    := 10

This is the target file information. It is stored on raid6 RAIN layout with 6 stripes as you can see below:

sh-4.2# eos file info /eos/gsdc/user/s/sahn/file-1g.raid6.3
  File: '/eos/gsdc/user/s/sahn/file-1g.raid6.3'  Flags: 0640
  Size: 1073741824
Modify: Fri Sep 18 00:56:25 2020 Timestamp: 1600390585.738257000
Change: Fri Sep 18 00:56:21 2020 Timestamp: 1600390581.977760760
 Birth: Fri Sep 18 00:56:21 2020 Timestamp: 1600390581.977760760
  CUid: 556800006 CGid: 556800006 Fxid: 00000068 Fid: 104 Pid: 31 Pxid: 0000001f
XStype: adler    XS: 33 34 ab 4b    ETAGs: "27917287424:3334ab4b"
Layout: raid6 Stripes: 6 Blocksize: 1M LayoutId: 20640542 Redundancy: d3::t0
  #Rep: 6
┌───┬──────┬────────────────────────┬────────────────┬─────────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│                    host│      schedgroup│                 path│      boot│  configstatus│       drain│  active│                  geotag│
└───┴──────┴────────────────────────┴────────────────┴─────────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
 0     1083   jbod-mgmt-07.sdfarm.kr       default.74 /jbod/box_13_disk_074     booted             rw      nodrain   online         kisti::gsdc::g03
 1     1419   jbod-mgmt-09.sdfarm.kr       default.74 /jbod/box_17_disk_074     booted             rw      nodrain   online         kisti::gsdc::g03
 2      495   jbod-mgmt-03.sdfarm.kr       default.74 /jbod/box_06_disk_074     booted             rw      nodrain   online         kisti::gsdc::g01
 3      663   jbod-mgmt-04.sdfarm.kr       default.74 /jbod/box_08_disk_074     booted             rw      nodrain   online         kisti::gsdc::g02
 4      411   jbod-mgmt-03.sdfarm.kr       default.74 /jbod/box_05_disk_074     booted             rw      nodrain   online         kisti::gsdc::g01
 5      915   jbod-mgmt-06.sdfarm.kr       default.74 /jbod/box_11_disk_074     booted             rw      nodrain   online         kisti::gsdc::g02

*******

And this is the output of attr ls for the parent directory.

sh-4.2# eos attr ls /eos/gsdc/user/s/sahn/
sys.eos.btime="1600303554.621731370"
sys.forced.blockchecksum="crc32c"
sys.forced.blocksize="1M"
sys.forced.checksum="adler"
sys.forced.layout="raid6"
sys.forced.nstripes="6"
sys.forced.space="default"
sys.recycle="/eos/gsdc/proc/recycle/"
user.acl=""

This is the command that I run to convert the file into replica layout:

sh-4.2# eos file convert /eos/gsdc/user/s/sahn/file-1g.raid6.3 replica:2
info: conversion based layout+stripe arguments
success: pushed conversion job '0000000000000068:default#02650112' to QuarkDB

However for some reason the job went to pending and failed eventually.

sh-4.2# eos convert list
┌────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────┐
│Conversion string                   │Failure                                                                         │
├────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────┤
│0000000000000065:default.48#00650112 converted file replica number mismatch -- expected=2 actual=0                 │
│0000000000000068:default.0#02650112  converted file replica number mismatch -- expected=2 actual=0                 │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Then, I tried to clear the failed jobs and run another conversion job but the job was unable to be submitted.

sh-4.2# eos convert clear --failed
info: list cleared
sh-4.2# eos convert list
info: no failed conversions
sh-4.2# eos file convert /eos/gsdc/user/s/sahn/file-1g.raid6.3 raid6:8
info: conversion based layout+stripe arguments
error: unable to push conversion job '0000000000000068:default#22650742' to QuarkDB (errc=0) (Success)

Repeated tries turned out that conversion jobs can be pushed to QuarkDB only for newly created files. It seems that the cleared jobs are still there in somewhere else… Indeed, I found that there are files in /eos/gsdc/proc/conversion:

sh-4.2# eos ls -al /eos/gsdc/proc/conversion
drwxr-xr-+   1 daemon   daemon     2147483648 Sep 18 00:57 .
drwxr-xr-x   1 daemon   daemon     2147483648 Sep 18 00:46 ..
-rw-------   2 daemon   daemon     1073741824 Sep 18 00:46 0000000000000065:default.48#00650112
-rw-------   2 daemon   daemon     1073741824 Sep 18 00:57 0000000000000068:default.0#02650112

By the way, the conversion jobs were not be able to run because the operation was not permitted on the directory /eos/gsdc/proc, which was owned only by root initially. After changing its ownership so that daemon can access, the jobs were able to run but the actual task, i.e. conversion, does not work…

And the last point, when the access to a user home directory is only allowed for that user, the conversion job (probably run by daemon) does not have access to the target file, e.g.

sh-4.2# eos convert list
┌───────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│Conversion string                  │Failure                                                                                                                                                                                                                                                                                                                         │
├───────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│000000000000006a:default.0#02650112 [ERROR] Server responded with an error: [3010] Unable to open file /eos/gsdc/user/s/sahn/file-1g.raid6.4; Operation not permitted;
 -- tpc_src=root://jbod-mgmt-01.sdfarm.kr:1094//eos/gsdc/user/s/sahn/file-1g.raid6.4 tpc_dst=root://jbod-mgmt-01.sdfarm.kr:1094//eos/gsdc/proc/conversion/000000000000006a:default.0#02650112│
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

It would be very much appreciated if someone can point out where and what I have missed. Please just let me know if you need any further information.

Thank you.

Best regards,
Sang-Un

Hi Sang-Un,

Thank you for the report! This was a bug in the thread pool implementation which was still holding a reference to the last task submitted. In the case of conversion the clean up of the file in the proc directory and also removing the fid from the tracker is done in the destructor. But since the reference to the task was never dropped this was not happening - so you could not submit a new conversion for the same fid.
As a silly workaround you can submit a new conversion for a different fid and then retry the initial file conversion.

The commit which fixes it is below:
https://gitlab.cern.ch/dss/eos/-/commit/9c923586150f34d85cd0ddf08aede9fc499b22db

This will be released in 4.8.19.

Thanks again,
Elvin

1 Like

Dear Elvin,

Thanks a lot for the fix. I will wait for the update.

By the way, do you have any idea on the permission issue? I wonder whether all directories (including user home space) must be accessible by daemon.

Thank you.

Best regards,
Sang-Un

Hi Sang-Un,

The /eos/<instance>/proc/ directory should be owned by root that is correct and should have mode 755 - this mode allows the daemon user to traverse and create entries in the conversion directory.
https://gitlab.cern.ch/dss/eos/-/blob/master/mgm/XrdMgmOfsConfigure.cc#L1707

Just for your reference this is how the output of the proc directory should look like:

[esindril@esdss000 build_clang_ninja]$ sudo eos ls -lrta /eos/dev/proc/
drwxr-xr-+   1 root     root             9942 May 22  2018 .
drwxrwsrw+   1 root     root        217288459 Sep 18 09:30 ..
drwxrwx--+   1 daemon   daemon              0 Jan  1  1970 archive
drwxr-xr-x   1 root     root                0 Jan  1  1970 clone
drwxrwx---   1 daemon   daemon           6628 Sep 18 10:51 conversion
drwx-----+   1 daemon   root                0 Jan  1  1970 delegation
drwx-----+   1 daemon   root                0 Jan  1  1970 lock
-rw-r--r--   0 root     root             4096 May 18  2018 master
-rw-r--r--   0 root     root             4096 May 18  2018 quota
-rw-r--r--   0 root     root             4096 May 18  2018 reconnect
drwx-----+   1 root     root             3314 Sep 16 09:33 recycle
drwx------   1 root     root                0 Jan  1  1970 token
drwx------   1 daemon   root                0 Jan  1  1970 tracker
-rw-r--r--   0 root     root             4096 May 18  2018 who
-rw-r--r--   0 root     root             4096 May 18  2018 whoami
drwx-----+   1 daemon   root                0 Oct  3  2019 workflow

The conversion entries are written to the /eos/<instance>/proc/conversion/ directory which is owned by daemon. So there should be no problems related to this.

Concerning the last point on permissions: The conversion is run with admin privileges so there should never be any issues with the conversion job not having enough privileges to access the converted entries.

Can you grep in the logs of the MGM the relevant part when access is attempted to the converted entry and send me that section?

Thanks,
Elvin

Hi all,
Running EOS v 4.8.40.
I seem to have a very similar issue, my conversions end up in error, like this:

│00000000011be2c9:default.0#00650102 converted file replica number mismatch -- expected=2 actual=7                 │
│00000000011be2c9:default.0#22650542 converted file replica number mismatch -- expected=6 actual=7                 │

I tried to convert a file with raid6:7 layout:

[root@mgm-1 ~]# eos fileinfo /eos/vbc/user/erich.birngruber/plain.txt
  File: '/eos/vbc/user/erich.birngruber/plain.txt'  Flags: 0777
  Size: 11
  Modify: Mon Mar  1 17:56:41 2021 Timestamp: 1614617801.987816000
Change: Wed Mar  3 14:40:37 2021 Timestamp: 1614778837.128168702
 Birth: Mon Mar  1 17:56:41 2021 Timestamp: 1614617801.947152352
  CUid: 10661 CGid: 0 Fxid: 011be2c9 Fid: 18604745 Pid: 18 Pxid: 00000012
XStype: adler    XS: 18 3c 03 ba    ETAGs: "4994173207838720:183c03ba"
Layout: raid6 Stripes: 7 Blocksize: 1M LayoutId: 20640642 Redundancy: d3::t0 
  #Rep: 7
┌───┬──────┬────────────────────────┬────────────────┬─────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│                    host│      schedgroup│             path│      boot│  configstatus│       drain│  active│                  geotag│
└───┴──────┴────────────────────────┴────────────────┴─────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
 0      137 fst-5.eos.grid.vbc.ac.at       default.24 /srv/data/data.24     booted             rw      nodrain   online         vbc::rack1::pod2 
 1       25 fst-1.eos.grid.vbc.ac.at       default.24 /srv/data/data.24     booted             rw      nodrain   online         vbc::rack1::pod1 
 2       81 fst-3.eos.grid.vbc.ac.at       default.24 /srv/data/data.24     booted             rw      nodrain   online         vbc::rack1::pod1 
 3      165 fst-6.eos.grid.vbc.ac.at       default.24 /srv/data/data.24     booted             rw      nodrain   online         vbc::rack1::pod2 
 4      221 fst-8.eos.grid.vbc.ac.at       default.24 /srv/data/data.24     booted             rw      nodrain   online         vbc::rack1::pod3 
 5      193 fst-7.eos.grid.vbc.ac.at       default.24 /srv/data/data.24     booted             rw      nodrain   online         vbc::rack1::pod3 
 6      249 fst-9.eos.grid.vbc.ac.at       default.24 /srv/data/data.24     booted             rw      nodrain   online         vbc::rack1::pod3 

I’m wondering, is the problem here the sys.forced layout on the directory? I’ve also tried removing the sys.forced.layout and sys.forced.nstripes attributes.

[root@mgm-1 ~]# eos attr ls /eos/vbc/user/erich.birngruber
sys.eos.btime="1592934175.805353799"
sys.forced.checksum="adler"
sys.forced.layout="raid6"
sys.forced.nstripes="7"
sys.recycle="/eos/vbc/proc/recycle/"
user.acl=""

eos convert status says

[root@mgm-1 ~]# eos convert status
Status: enabled
Config: maxthreads=100
Threadpool: thread_pool=converter min=64 max=100 size=64 queue_size=0
Running jobs: 0
Pending jobs: 0
Total failed jobs : 1

On the mgm, I see in the logs:

210309 10:23:51 time=1615281831.014400 func=DoIt                     level=INFO  logid=static.............................. unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007fad877f8700 source=ConversionJob:235              tident= sec=(null) uid=99 gid=99 name=- geo="" [tpc]: root@eos.grid.vbc.ac.at:1094@root://eos.grid.vbc.ac.at:1094//eos/vbc/user/erich.birngruber/plain.txt => root@eos.grid.vbc.ac.at:1094@root://eos.grid.vbc.ac.at:1094//eos/vbc/proc/conversion/00000000011be2c9:default.0#00650112 prepare_msg=[SUCCESS] 
210309 10:23:51 time=1615281831.014983 func=open                     level=INFO  logid=2900b1ee-80b9-11eb-9522-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007fadbebfd700 source=XrdMgmOfsFile:500              tident=root.43731:274@mgm-1 sec=sss   uid=99 gid=99 name=daemon geo="vbc" op=read path=/eos/vbc/user/erich.birngruber/plain.txt info=eos.app=eos/converter&eos.rgid=0&eos.ruid=0&tpc.stage=placement
210309 10:23:51 time=1615281831.015192 func=open                     level=INFO  logid=2900b1ee-80b9-11eb-9522-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007fadbebfd700 source=XrdMgmOfsFile:1037             tident=root.43731:274@mgm-1 sec=sss   uid=99 gid=99 name=daemon geo="vbc" acl=0 r=0 w=0 wo=0 egroup=0 shared=0 mutable=1 facl=0
210309 10:23:51 time=1615281831.015624 func=open                     level=INFO  logid=2900b1ee-80b9-11eb-9522-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007fadbebfd700 source=XrdMgmOfsFile:2930             tident=root.43731:274@mgm-1 sec=sss   uid=99 gid=99 name=daemon geo="vbc" op=read path=/eos/vbc/user/erich.birngruber/plain.txt info=eos.app=eos/converter&eos.rgid=0&eos.ruid=0&tpc.stage=placement target[0]=(fst-5.eos.grid.vbc.ac.at,137) target[1]=(fst-1.eos.grid.vbc.ac.at,25) target[2]=(fst-3.eos.grid.vbc.ac.at,81) target[3]=(fst-6.eos.grid.vbc.ac.at,165) target[4]=(fst-8.eos.grid.vbc.ac.at,221) target[5]=(fst-7.eos.grid.vbc.ac.at,193) target[6]=(fst-9.eos.grid.vbc.ac.at,249)  redirection=fst-5.eos.grid.vbc.ac.at?&cap.sym=<...>&cap.msg=<...>&mgm.logid=2900b1ee-80b9-11eb-9522-3868dd28d0c0&mgm.replicaindex=0&mgm.replicahead=0&mgm.id=011be2c9&mgm.mtime=1614617801 xrd_port=1095 http_port=9001
210309 10:23:51 time=1615281831.015643 func=open                     level=INFO  logid=2900b1ee-80b9-11eb-9522-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007fadbebfd700 source=XrdMgmOfsFile:2934             tident=root.43731:274@mgm-1 sec=sss   uid=99 gid=99 name=daemon geo="vbc" info="redirection" hostport=fst-5.eos.grid.vbc.ac.at?&cap.sym=<...>&cap.msg=<...>&mgm.logid=2900b1ee-80b9-11eb-9522-3868dd28d0c0&mgm.replicaindex=0&mgm.replicahead=0&mgm.id=011be2c9&mgm.mtime=1614617801:1095
210309 10:23:51 time=1615281831.020056 func=open                     level=INFO  logid=290174c6-    80b9-11eb-9ad3-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007fae0befd700 source=XrdMgmOfsFile:498              tident=root.43731:274@mgm-1 sec=sss   uid=2 gid=2 name=daemon geo="vbc" op=write trunc=512 path=/eos/vbc/proc/conversion/00000000011be2c9:default.0#00650112 info=eos.app=eos/converter&eos.checksum=183c03ba&eos.excludefsid=137,25,81,165,221,193,249&eos.layout.blockchecksum=none&eos.layout.blocksize=4M&eos.layout.checksum=adler&eos.layout.nstripes=2&eos.layout.type=replica&eos.rgid=2&eos.ruid=2&eos.space=default&eos.targetsize=11&oss.asize=11&tpc.dlg=root@eos.grid.vbc.ac.at:1094&tpc.dlgon=0&tpc.key=0119ea0c0001aad360473ea7&tpc.lfn=/eos/vbc/user/erich.birngruber/plain.txt&tpc.spr=root&tpc.src=root@fst-5.eos.grid.vbc.ac.at:1095&tpc.stage=copy&tpc.str=1&tpc.tpr=root
210309 10:23:51 time=1615281831.022565 func=open                     level=INFO  logid=290174c6-80b9-11eb-9ad3-3868dd28d0c0 unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007fae0befd700 source=XrdMgmOfsFile:2926             tident=root.43731:274@mgm-1 sec=sss   uid=2 gid=2 name=daemon geo="vbc" op=write path=/eos/vbc/proc/conversion/00000000011be2c9:default.0#00650112 info=eos.app=eos/converter&eos.checksum=183c03ba&eos.excludefsid=137,25,81,165,221,193,249&eos.layout.blockchecksum=none&eos.layout.blocksize=4M&eos.layout.checksum=adler&eos.layout.nstripes=2&eos.layout.type=replica&eos.rgid=2&eos.ruid=2&eos.space=default&eos.targetsize=11&oss.asize=11&tpc.dlg=root@eos.grid.vbc.ac.at:1094&tpc.dlgon=0&tpc.key=0119ea0c0001aad360473ea7&tpc.lfn=/eos/vbc/user/erich.birngruber/plain.txt&tpc.spr=root&tpc.src=root@fst-5.eos.grid.vbc.ac.at:1095&tpc.stage=copy&tpc.str=1&tpc.tpr=root target[0]=(fst-1.eos.grid.vbc.ac.at,11) target[1]=(fst-9.eos.grid.vbc.ac.at,235) target[2]=(fst-4.eos.grid.vbc.ac.at,95) target[3]=(fst-5.eos.grid.vbc.ac.at,123) target[4]=(fst-7.eos.grid.vbc.ac.at,179) target[5]=(fst-3.eos.grid.vbc.ac.at,67) target[6]=(fst-6.eos.grid.vbc.ac.at,151)  redirection=fst-7.eos.grid.vbc.ac.at?&cap.sym=<...>&cap.msg=<...>&mgm.logid=290174c6-80b9-11eb-9ad3-3868dd28d0c0&mgm.checksum=183c03ba&mgm.replicaindex=4&mgm.replicahead=4&mgm.id=0120dbd3 xrd_port=1095 http_port=9001
210309 10:24:02 time=1615281842.705995 func=DoIt                     level=INFO  logid=static.............................. unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007fad877f8700 source=ConversionJob:255              tident= sec=(null) uid=99 gid=99 name=- geo="" [tpc]: root://eos.grid.vbc.ac.at:1094//eos/vbc/user/erich.birngruber/plain.txt => root://eos.grid.vbc.ac.at:1094//eos/vbc/proc/conversion/00000000011be2c9:default.0#00650112 status=success tpc_msg=[SUCCESS] 
210309 10:24:02 time=1615281842.706197 func=HandleError              level=ERROR logid=static.............................. unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007fad877f8700 source=ConversionJob:378              tident= sec=(null) uid=99 gid=99 name=- geo="" msg="converted file replica number mismatch" expected=2 actual=7 conversion_id=00000000011be2c9:default.0#00650112
210309 10:24:06 time=1615281846.017469 func=_rem                     level=INFO  logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007fad6cfc3700 source=Rm:108                         tident=<single-exec> sec=local uid=0 gid=0 name=root geo="" path=/eos/vbc/proc/conversion/00000000011be2c9:default.0#00650112 vid.uid=0 vid.gid=0
210309 10:24:06 time=1615281846.018088 func=_rem                     level=INFO  logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007fad6cfc3700 source=Rm:271                         tident=<single-exec> sec=local uid=0 gid=0 name=root geo="" unlinking from view /eos/vbc/proc/conversion/00000000011be2c9:default.0#00650112
210309 10:24:06 time=1615281846.019095 func=BroadcastDeletionFromExternal level=INFO  logid=static.............................. unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007fad6cfc3700 source=Caps:324                       tident= sec=(null) uid=99 gid=99 name=- geo="" id=6 name=00000000011be2c9:default.0#00650112
210309 10:24:06 time=1615281846.019173 func=PurgeVersion             level=INFO  logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007fad6cfc3700 source=Version:170                    tident=<single-exec> sec=      uid=0 gid=0 name= geo="" version-dir=/eos/vbc/proc/conversion/.sys.v#.00000000011be2c9:default.0#00650112/ max-versions=0
210309 10:24:06 time=1615281846.019552 func=_rem                     level=INFO  logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@mgm-1.eos.grid.vbc.ac.at:1094 tid=00007fad6cfc3700 source=Rm:433                         tident=<single-exec> sec=local uid=0 gid=0 name=root geo="" msg="deleted" can-recycle=0 path=/eos/vbc/proc/conversion/00000000011be2c9:default.0#00650112 owner.uid=2 owner.gid=2 vid.uid=0 vid.gid=0

Any hints on what’s not right here?
Best,
Erich

Hi Erich,

I tried exactly the same command on my dev instance and things work as expected. Do you have enough file system in the group to perform the conversion? What command did you use to trigger the conversion?

Below is the sequence of commands that I’ve done:

EOS Console [root://localhost] |/eos/dev/rain7/> fileinfo 28B9D1FB-8B31-E711-AA4E-0025905B85B2.root
  File: '/eos/dev/rain7/28B9D1FB-8B31-E711-AA4E-0025905B85B2.root'  Flags: 0644
  Size: 1185995054
Modify: Tue Feb 16 12:00:39 2021 Timestamp: 1613473239.047406000
Change: Tue Feb 16 11:59:50 2021 Timestamp: 1613473190.046056579
 Birth: Tue Feb 16 11:59:50 2021 Timestamp: 1613473190.046056579
  CUid: 58602 CGid: 1028 Fxid: 0001c94c Fid: 117068 Pid: 18315 Pxid: 0000478b
XStype: adler    XS: 78 f8 eb 29    ETAGs: "31425201963008:78f8eb29"
Layout: raid6 Stripes: 7 Blocksize: 1M LayoutId: 20640642 Redundancy: d3::t0 
  #Rep: 7
┌───┬──────┬────────────────────────┬────────────────┬─────────────────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│                    host│      schedgroup│                         path│      boot│  configstatus│       drain│  active│                  geotag│
└───┴──────┴────────────────────────┴────────────────┴─────────────────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
 0        5         esdss000.cern.ch        default.0  /home/esindril/Eos_data/fst5     booted             rw      nodrain   online                    elvin 
 1        1         esdss000.cern.ch        default.0  /home/esindril/Eos_data/fst1     booted             rw      nodrain   online                    elvin 
 2        2         esdss000.cern.ch        default.0  /home/esindril/Eos_data/fst2     booted             rw      nodrain   online                    elvin 
 3        6         esdss000.cern.ch        default.0  /home/esindril/Eos_data/fst6     booted             rw      nodrain   online                    elvin 
 4       10         esdss000.cern.ch        default.0 /home/esindril/Eos_data/fst10     booted             rw      nodrain   online                    elvin 
 5        7         esdss000.cern.ch        default.0  /home/esindril/Eos_data/fst7     booted             rw      nodrain   online                    elvin 
 6        3         esdss000.cern.ch        default.0  /home/esindril/Eos_data/fst3     booted             rw      nodrain   online                    elvin 

*******
EOS Console [root://localhost] |/eos/dev/rain7/> convert file fxid:0001c94c replica:2 default                                                                                                                                                                                             
Scheduled conversion job: 000000000001c94c:default#00650112
EOS Console [root://localhost] |/eos/dev/rain7/> convert list --pending
┌──────────┬─────────────────────────────────┐
│Fxid      │Conversion string                │
├──────────┴─────────────────────────────────┤
│0001c94c   000000000001c94c:default#00650112│
└────────────────────────────────────────────┘
EOS Console [root://localhost] |/eos/dev/rain7/> convert status
Status: enabled
Config: maxthreads=100
Threadpool: thread_pool=converter min=8 max=100 size=8 queue_size=0
Running jobs: 1
Pending jobs: 0
Total failed jobs : 0
EOS Console [root://localhost] |/eos/dev/rain7/> attr ls .
sys.eos.btime="1611782652.138704892"
sys.forced.blockchecksum="crc32c"
sys.forced.blocksize="1M"
sys.forced.checksum="adler"
sys.forced.layout="raid6"
sys.forced.nstripes="7"
sys.forced.space="default"
sys.recycle="/eos/dev/proc/recycle/"
EOS Console [root://localhost] |/eos/dev/rain7/> convert status
Status: enabled
Config: maxthreads=100
Threadpool: thread_pool=converter min=8 max=100 size=8 queue_size=0
Running jobs: 0
Pending jobs: 0
Total failed jobs : 0
EOS Console [root://localhost] |/eos/dev/rain7/> fileinfo 28B9D1FB-8B31-E711-AA4E-0025905B85B2.root                                                                                                                                                                                       
  File: '/eos/dev/rain7/28B9D1FB-8B31-E711-AA4E-0025905B85B2.root'  Flags: 0644
  Size: 1185995054
Modify: Tue Feb 16 12:00:39 2021 Timestamp: 1613473239.047406000
Change: Tue Feb 16 11:59:50 2021 Timestamp: 1613473190.046056579
 Birth: Tue Feb 16 11:59:50 2021 Timestamp: 1613473190.046056579
  CUid: 58602 CGid: 1028 Fxid: 0001c94c Fid: 117068 Pid: 18315 Pxid: 0000478b
XStype: adler    XS: 78 f8 eb 29    ETAGs: "31425201963008:78f8eb29"
Layout: replica Stripes: 2 Blocksize: 4M LayoutId: 00650112 Redundancy: d2::t0 
  #Rep: 2
┌───┬──────┬────────────────────────┬────────────────┬────────────────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│                    host│      schedgroup│                        path│      boot│  configstatus│       drain│  active│                  geotag│
└───┴──────┴────────────────────────┴────────────────┴────────────────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
 0        4         esdss000.cern.ch        default.0 /home/esindril/Eos_data/fst4     booted             rw      nodrain   online                    elvin 
 1        9         esdss000.cern.ch        default.0 /home/esindril/Eos_data/fst9     booted             rw      nodrain   online                    elvin 

*******

Cheers,
Elvin

Hi Elvin,

There is definitively enough space, on FST disks and mgm. Except for the conversion error I don’t see anything suspicous in the logs

msg="converted file replica number mismatch" expected=2 actual=7 conversion_id=00000000011be2c9:default.0#00650112

All my conversions fail, always with the replica mismatch error.

Best,
Erich

Hi Erich,

It’s not only a matter of space but a matter of available file systems. How many file systems in rw mode you have in default.0 group?

What is the command that you run to convert these files?

Cheers,
Elvin

Hi Elvin,
In default.0 group we have 9 filesystems in rw:

[root@mgm-1 mgm]# eos fs ls | grep default.0
 fst-1.eos.grid.vbc.ac.at 1095      1                /srv/data/data.00        default.0 vbc::rack1::pod1       booted             rw      nodrain   online        no mdstat 
 fst-2.eos.grid.vbc.ac.at 1095     29                /srv/data/data.00        default.0 vbc::rack1::pod1       booted             rw      nodrain   online        no mdstat 
 fst-3.eos.grid.vbc.ac.at 1095     57                /srv/data/data.00        default.0 vbc::rack1::pod1       booted             rw      nodrain   online        no mdstat 
 fst-4.eos.grid.vbc.ac.at 1095     85                /srv/data/data.00        default.0 vbc::rack1::pod2       booted             rw      nodrain   online        no mdstat 
 fst-5.eos.grid.vbc.ac.at 1095    113                /srv/data/data.00        default.0 vbc::rack1::pod2       booted             rw      nodrain   online        no mdstat 
 fst-6.eos.grid.vbc.ac.at 1095    141                /srv/data/data.00        default.0 vbc::rack1::pod2       booted             rw      nodrain   online        no mdstat 
 fst-7.eos.grid.vbc.ac.at 1095    169                /srv/data/data.00        default.0 vbc::rack1::pod3       booted             rw      nodrain   online        no mdstat 
 fst-8.eos.grid.vbc.ac.at 1095    197                /srv/data/data.00        default.0 vbc::rack1::pod3       booted             rw      nodrain   online        no mdstat 
 fst-9.eos.grid.vbc.ac.at 1095    225                /srv/data/data.00        default.0 vbc::rack1::pod3       booted             rw      nodrain   online        no mdstat 

The group filling looks ok:

[root@mgm-1 mgm]# eos group ls default.0
┌──────────┬────────────────┬────────────┬──────┬────────────┬────────────┬────────────┬──────────┬──────────┐
│type      │            name│      status│ N(fs)│ dev(filled)│ avg(filled)│ sig(filled)│ balancing│   bal-shd│
└──────────┴────────────────┴────────────┴──────┴────────────┴────────────┴────────────┴──────────┴──────────┘
 groupview         default.0           on      9         1.56        37.07         0.78       idle          0 

This is how I start the conversion:

[root@mgm-1 mgm]# eos convert file /eos/vbc/user/erich.birngruber/plain.txt replica:2 default
Scheduled conversion job: 00000000011be2c9:default#00650112
[root@mgm-1 mgm]# eos convert list --pending
info: no pending conversions
[root@mgm-1 mgm]# eos convert list --failed
[...]
│00000000011be2c9:default.0#00650112 converted file replica number mismatch -- expected=2 actual=7                                                                                                                                                                                                                                                                          │

[root@mgm-1 mgm]# eos fileinfo /eos/vbc/user/erich.birngruber/plain.txt
  File: '/eos/vbc/user/erich.birngruber/plain.txt'  Flags: 0777
  Size: 11
Modify: Mon Mar  1 17:56:41 2021 Timestamp: 1614617801.987816000
Change: Wed Mar  3 14:40:37 2021 Timestamp: 1614778837.128168702
 Birth: Mon Mar  1 17:56:41 2021 Timestamp: 1614617801.947152352
  CUid: 10661 CGid: 0 Fxid: 011be2c9 Fid: 18604745 Pid: 18 Pxid: 00000012
XStype: adler    XS: 18 3c 03 ba    ETAGs: "4994173207838720:183c03ba"
Layout: raid6 Stripes: 7 Blocksize: 1M LayoutId: 20640642 Redundancy: d3::t0 
  #Rep: 7
┌───┬──────┬────────────────────────┬────────────────┬─────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│                    host│      schedgroup│             path│      boot│  configstatus│       drain│  active│                  geotag│
└───┴──────┴────────────────────────┴────────────────┴─────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
 0      137 fst-5.eos.grid.vbc.ac.at       default.24 /srv/data/data.24     booted             rw      nodrain   online         vbc::rack1::pod2 
 1       25 fst-1.eos.grid.vbc.ac.at       default.24 /srv/data/data.24     booted             rw      nodrain   online         vbc::rack1::pod1 
 2       81 fst-3.eos.grid.vbc.ac.at       default.24 /srv/data/data.24     booted             rw      nodrain   online         vbc::rack1::pod1 
 3      165 fst-6.eos.grid.vbc.ac.at       default.24 /srv/data/data.24     booted             rw      nodrain   online         vbc::rack1::pod2 
 4      221 fst-8.eos.grid.vbc.ac.at       default.24 /srv/data/data.24     booted             rw      nodrain   online         vbc::rack1::pod3 
 5      193 fst-7.eos.grid.vbc.ac.at       default.24 /srv/data/data.24     booted             rw      nodrain   online         vbc::rack1::pod3 
 6      249 fst-9.eos.grid.vbc.ac.at       default.24 /srv/data/data.24     booted             rw      nodrain   online         vbc::rack1::pod3 

*******

(FS in sched group default.24 and the group default.24 look similar to group default.0)

Best,
Erich

Hi Erich,

Can you send me the output of eos attr ls /eos/<instance_name>/proc/conversion/?

Thanks,
Elvin

Hi Elvin,
There are no attributes set:

[root@mgm-1 ~]# eos attr ls /eos/vbc/proc
[root@mgm-1 ~]# 
[root@mgm-1 ~]# eos attr -r ls /eos/vbc/proc/conversions
[root@mgm-1 ~]# 

Best,
Erich

What is the output of eos space status default?

Thanks,
Elvin

Hi Elvin,

[root@mgm-1 ~]# eos space status default
# ------------------------------------------------------------------------------------
# Space Variables
# ....................................................................................
autorepair                       := on
balancer                         := on
balancer.node.ntx                := 2
balancer.node.rate               := 25
balancer.threshold               := 20
converter                        := on
converter.ntx                    := 2
drainer.node.nfs                 := 5
drainer.node.ntx                 := 2
drainer.node.rate                := 25
drainperiod                      := 86400
filearchivedgc                   := off
fusex.hbi                        := 10
fusex.qti                        := 10
geobalancer                      := off
geobalancer.ntx                  := 10
geobalancer.threshold            := 5
graceperiod                      := 86400
groupbalancer                    := on
groupbalancer.ntx                := 10
groupbalancer.threshold          := 5
groupmod                         := 28
groupsize                        := 9
lru                              := on
lru.interval                     := 86400
nominalsize                      := 2.70 PB
policy.blockchecksum             := crc32c
policy.blocksize                 := 1M
policy.checksum                  := adler
policy.layout                    := raid6
policy.nstripes                  := 7
quota                            := on
scan_disk_interval               := 14400
scan_ns_interval                 := 259200
scan_ns_rate                     := 50
scaninterval                     := 604800
scanrate                         := 100
tgc.availbytes                   := 0
tgc.qryperiodsecs                := 320
tgc.totalbytes                   := 1000000000000000000
tracker                          := on
wfe                              := off
wfe.interval                     := 10
wfe.ntx                          := 1

Best,
Erich

Hi Erich,

Is there any particular reason why you enabled the space policy?
This is actually the reason why your conversion fails. The space policy was added explicitly for the TAPE backend setup and there is indeed some interference with the normal operations. We will add a workaround for this in the future, but for the moment if you want to get things unstuck you need to remove the space policy attributes.

Cheers,
Elvin

Hi Elvin,

Thanks for the hint! this was not clear to me from the help text:

space config <space-name> space.policy.[layout|nstripes|checksum|blockchecksum|blocksize]=<value>      
                                                                  : configure default file layout creation settings as a space policy - a value='remove' deletes the space policy

I understood this as the general policy for file creation if nothing else is defined on the directory. That this is for tape backend only was not clear to me. I’ve removed all policy entries in the space config.

Now conversion is working, right es expected, and fast!
On a side note: could this also be relating our fsck issues that we saw (i.e. should I try turning the repair back on)?

Best,
Erich

Hi Erich,

This is a situation where some functionality was added with a certain use case in mind and then this was also re-purposed for other things. Note that you should make sure you have the proper attributes at the directory level so that files get stored with the desired layout.
I don’t know if you were relying on the space policy for that … Once a directory has some attributes all the new directories inside will inherit the parent attributes.
I don’t think is related to fsck since that does not use the conversion mechanism for the repair stage.

Cheers,
Elvin

Hi Erich, this is not only for tape backend,
but the diea is that you set the layout for the whole space and you don’t fiddle around by hand with files forcing layouts against the policy. That’s why it is called space policy, all files in that space should have this layout. We will make this work together for convenience, but it is just a flaw, that we never tested this ‘against’ manual conversion.

Cheers Andreas.

Hi,
In fact we have the attr set everywhere on the directories, so it’s ok to remove.

This seems to make conversion work for us now.
Thanks,
Erich