EOS replicas auto-repair?

Hello,

I am trying different settings for replicas on a test server. I have set attribute of directory /eos/testarea /testmirror to “replica”:

EOS Console [root://localhost] |/> attr ls /eos/testarea/testmirror
sys.forced.blocksize=“4k”
sys.forced.checksum=“adler”
sys.forced.layout=“replica”
sys.forced.nstripes=“2”
sys.forced.space=“default”

… and copied a file there:

EOS Console [root://localhost] |/> file check /eos/testarea/testmirror/DSET-Report-for-nanwn57.in2p3.fr-68T7H32.zip
path="/eos/testarea/testmirror/DSET-Report-for-nanwn57.in2p3.fr-68T7H32.zip" fid=“00000050” size=“3105776” nrep=“2” checksumtype=“adler” checksum=“2729314300000000000000000000000000000000”
nrep=“00” fsid=“1” host=“nanxrd15.in2p3.fr:1095” fstpath="/data01/00000000/00000050" size=“3105776” statsize=“3105776” checksum=“2729314300000000000000000000000000000000”
nrep=“01” fsid=“4” host=“nanxrd17.in2p3.fr:1095” fstpath="/data02/00000000/00000050" size=“3105776” statsize=“3105776” checksum=“2729314300000000000000000000000000000000”

Now if I suppress one of the replicas on disk (trying to simulate a disk failure), I observe that the statsize is changed to 18446744073709551615, which is found by a file check:
EOS Console [root://localhost] |/> file check /eos/testarea/testmirror/DSET-Report-for-nanwn57.in2p3.fr-68T7H32.zip %output
path="/eos/testarea/testmirror/DSET-Report-for-nanwn57.in2p3.fr-68T7H32.zip" fid=“00000050” size=“3105776” nrep=“2” checksumtype=“adler” checksum=“2729314300000000000000000000000000000000”
nrep=“00” fsid=“1” host=“nanxrd15.in2p3.fr:1095” fstpath="/data01/00000000/00000050" size=“3105776” statsize=“18446744073709551615” checksum=“2729314300000000000000000000000000000000”
nrep=“01” fsid=“4” host=“nanxrd17.in2p3.fr:1095” fstpath="/data02/00000000/00000050" size=“3105776” statsize=“3105776” checksum=“2729314300000000000000000000000000000000”
INCONSISTENCY STATFAILED path=/eos/testarea/testmirror/DSET-Report-for-nanwn57.in2p3.fr-68T7H32.zip fid=00000050 size=3105776 stripes=2 nrep=2 nrepstored=2 nreponline=2 checksumtype=adler checksum=2729314300000000000000000000000000000000

but not with fsck:
EOS Console [root://localhost] |/> fsck stat
181219 09:32:10 1545208330.686749 started check
181219 09:32:10 1545208330.686804 Filesystems to check: 6
181219 09:32:20 1545208340.690951 d_mem_sz_diff : 1 (1)
181219 09:32:20 1545208340.690977 rep_diff_n : 0 (0)
181219 09:32:20 1545208340.690990 rep_offline : 0 (0)
181219 09:32:20 1545208340.691010 stopping check
181219 09:32:20 1545208340.691020 => next run in 30 minutes

And a fsck repair does nothing though the absence of the replica is mentioned in the FST log:

81219 09:33:24 18118 FstOfs_stat: root.29897:59@nanxrd15 Unable to stat file /data01/00000000/00000050; no such file or directory
181219 09:33:37 13982 FstOfs_stat: root.29897:59@nanxrd15 Unable to stat file /data01/00000000/00000050; no such file or directory

=> How is this supposed to work ? And how to test it ?

Thank you

JM

This can only be found bu fsck, if you boot a filesystem from scratch with the syncmgm option. In that case it scans the whole disks and marks missing files for fsck. It is very slow to scan the whole tree on a harddisk, therefore this is only done, if there is no FST database or it is asked by the admin.

Strange, even a fs boot --syncmgm does not notice that the file is missing:
181219 11:14:04 time=1545214444.824611 func=Boot level=INFO logid=FstOfsStorage unit=fst@nanxrd15.in2p3.fr:1095 tid=00007f0a69af5700 source=Storage:434 tident= sec= uid=0 gid=0 name= geo="" msg=“start disk synchronisation” fsid=1
181219 11:14:04 time=1545214444.825174 func=Boot level=INFO logid=FstOfsStorage unit=fst@nanxrd15.in2p3.fr:1095 tid=00007f0a69af5700 source=Storage:452 tident= sec= uid=0 gid=0 name= geo="" msg=“finished disk synchronisation” fsid=1
181219 11:14:04 time=1545214444.825404 func=Boot level=INFO logid=FstOfsStorage unit=fst@nanxrd15.in2p3.fr:1095 tid=00007f0a69af5700 source=Storage:462 tident= sec= uid=0 gid=0 name= geo="" msg=“start mgm synchronisation” fsid=1
181219 11:14:04 time=1545214444.848029 func=Boot level=INFO logid=FstOfsStorage unit=fst@nanxrd15.in2p3.fr:1095 tid=00007f0a69af5700 source=Storage:482 tident= sec= uid=0 gid=0 name= geo="" msg=“finished mgm synchronization” fsid=1

EOS Console [root://localhost] |/> fsck stat
181219 11:10:36 1545214236.836829 started check
181219 11:10:36 1545214236.836886 Filesystems to check: 6
181219 11:10:46 1545214246.840597 d_mem_sz_diff : 2 (2)
181219 11:10:46 1545214246.840623 rep_diff_n : 0 (0)
181219 11:10:46 1545214246.840640 rep_offline : 0 (0)
181219 11:10:46 1545214246.840660 stopping check
181219 11:10:46 1545214246.840671 => next run in 30 minutes

THe fsck report is from 11:10 while you booted 11:14 … you have to wait …

Ah sorry, I did not paste the right log extract, I tried twice the boot --syncmgm, anyway a more recent fsck says:

EOS Console [root://localhost] |/> fsck stat
181219 11:35:58 1545215758.264447 started check
181219 11:35:58 1545215758.264501 Filesystems to check: 6
181219 11:36:08 1545215768.268278 d_mem_sz_diff : 2 (2)
181219 11:36:08 1545215768.268304 rep_diff_n : 0 (0)
181219 11:36:08 1545215768.268322 rep_offline : 0 (0)
181219 11:36:08 1545215768.268342 stopping check
181219 11:36:08 1545215768.268352 => next run in 30 minutes

JM