Enable inline/auto repair when updating files when one FS is down

Hello,

We have encountered the case where users (on fuse mounts) tried to modified files during a FST upgrade campaign, and they couldn’t because one of the replicas was offline, and the update of the file was refused. The feature inline-repair is enabled on the side of the client (default values), but maybe the option also needs to be on mgm side with space.autorepair parameter ?
However, while testing this with the space.autorepair parameter, the update is still not possible on a file when a FS holding a replica is is read-only mode, or off, or the node is offfline.

How can we correctly enable this possibility ?

File are replica 2, MGM is 4.2.17, FST are 4.2.18

Hi Franck,

I looked over the logs that you provided on this issue. This functionality works as advertised when the FST is in read-only mode but will fail if it’s offline. There is already a ticket opened for the underlying issue here:
https://its.cern.ch/jira/browse/EOS-2410

I’ve run a test with the FST in ro mode and it works as expected.

Hi Elvin,

Thank you for having looked. OK for the case when the FST is offline (it seems I don’t have any more access to the JIRA issues with my lightweight account).

But I can confirm that with my set-up I get Input/Output error even when changing a file which has one replica on a FS with configstatus=ro.

So there might be some parameter that causes this behaviour on our instances (or the fuse client ?). The fact is, indeed, that when the FS is read-only, and we try to access the file in rw, the MGM sends to the valid node a request with __offline_ prefix for the hostname of the node with read-only FS, and it doesn’t seem to try to repair it, although the client writes inline-repair=true:

180416 17:15:51 t=1523891751.821256 f=MakeOpen         l=DEBUG tid=00007eff49ffc700 s=LayoutWrapper:126        opening file root://AAAAAAAQ@eos.acme.org///#curl#/eos/acme/new/test-repair2, lazy open is 0 flags=2 **inline-repair=true** async-open=0
180416 17:15:51 t=1523891751.821270 f=Open             l=DEBUG tid=00007eff49ffc700 s=LayoutWrapper:653        Sync-open path=root://AAAAAAAQ@eos.acme.org///#curl#/eos/acme/new/test-repair2 opaque=eos.app=fuse&eos.encodepath=1&eos.bookingsize=0&fst.readahead=true&fst.blocksize=262144&xrd.k5ccname=/tmp/krb5cc_61928_GLIO4hk6wT&xrd.wantprot=krb5,unix&xrdcl.secuid=61928&xrdcl.secgid=40507&eos.lfn=fxid:0001a74e
180416 17:15:51 t=1523891751.821278 f=DumpConnectionPool l=DEBUG tid=00007eff49ffc700 s=XrdIo:1670               [connection-pool-dump]
180416 17:15:51 t=1523891751.827112 f=fileOpen         l=ERROR tid=00007eff49ffc700 s=XrdIo:226                error= "open failed url=root://AAAAAAAQ@eos.acme.org///#curl#/eos/acme/new/test-repair2?fst.valid=1523891810&eos.app=fuse&eos.encodepath=1&eos.bookingsize=0&fst.readahead=true&fst.blocksize=262144&xrd.k5ccname=/tmp/krb5cc_61928_GLIO4hk6wT&xrd.wantprot=krb5,unix&xrdcl.secuid=61928&xrdcl.secgid=40507&eos.lfn=fxid:0001a74e, errno=3014, errc=400, msg=[ERROR] Error response: No route to host"
180416 17:15:51 t=1523891751.827129 f=Open             l=ERROR tid=00007eff49ffc700 s=LayoutWrapper:660        Sync-open got errNo=3014 errCode=400
180416 17:15:51 t=1523891751.827140 f=Open             l=DEBUG tid=00007eff49ffc700 s=LayoutWrapper:675        not retrying
180416 17:15:51 t=1523891751.827183 f=Open             l=DEBUG tid=00007eff49ffc700 s=LayoutWrapper:704        LastErrNo=3014  _lasturl=  LastUrl=root://AAAAAAAQ@eos-eos.acme.org:1094///#curl#/eos/acme/new/test-repair2?cap.msg=9WO5zmhtYe0R6S93G34/e3JfkZV9eo1apBDB0QwDMBPhXN2XLQkq0YkjmvE/rfzNHkZsArE/2qLtr87bMl7g5IgPCX4qFYt7Ni8SN5hTCg9JYRmk4Ose853QOdA08Yct/zlii3ZN5+405NCfsFctJyb8TVMRXHz6+EXXaGnsJttFwDgu+jQ8EQA038lJH1Brgbbq+5rB+j0yiVt5SEcVhqYtWfalUo/pNoLMOjhwKNmlGpwkTi6lFIYWp4bWblEg1CeSKju9j7ZeJ8YpqOYI48RfhFTZsC2b3ElJkVzDL4imreUycn+bipWIfm7yZDGCIUluxXxnFp623nY1WV+np6KF6CAhLPWWtmROgbk8wlqz2K4PnAfeBu/UCyJktpBp2zGqWWSbCQJgTe66T4Mfg4eyYliOzLgfD2HWEW4QEr0pCWBr5RrnOeR57CJKP8cJdv9BHysZDFdGXKMnkm3U+jasZlyo+2gYS8GxwtDAwjD9lFoMbEzm8nYB+5t4xoeLL8CEDooz2Hy+lnPHWa4FHIFdLcDUww+SxzPkf442MALSG0nj042zyfnoScSj6xkGOgaK6kOq+TvavD2L4ScuGt+2F/ok4frA&cap.sym=CcBJAd/dHQ4HZTjYp4kbSukpfQA=&eos.app=fuse&eos.bookingsize=0&eos.encodepath=1&eos.lfn=fxid:0001a74e&fst.blocksize=262144&fst.readahead=true&fst.valid=1523891810&mgm.id=0001a74e&mgm.logid=0cf09cb6-4189-11e8-ba3e-62cb3bd524b1&mgm.mtime=0&mgm.replicahead=1&mgm.replicaindex=1&xrd.k5ccname=/tmp/krb5cc_61928_GLIO4hk6wT&xrd.wantprot=krb5,unix&xrdcl.secgid=40507&xrdcl.secuid=61928  _path=root://AAAAAAAQ@eos.acme.org///#curl#/eos/acme/new/test-repair2

I’m surely missing something, and probably an obvious one… :thinking:

Hi Franck,

I had another look at the logs you provided and there is smth I can’t understand. There is a node which apparently is offline but still selected to server the file. Can you retry with a fresh file and send me the logs again?
Also please append the result for the commands:

eos node ls
eos fs ls

Thanks,
Elvin

Hi Elvin,

Thank you, I sent you the logs & information via email.

No node is offline, just one FS in RO mode.

Dear Elvin,

I have news : when I upgrade the MGM to version 4.2.20, it correctly works, I see this kind of message :

180420 18:38:17 time=1524242297.570604 func=open                     level=INFO  logid=3a790c30-44b9-11e8-90cb-9a9ab1a9ef91 unit=mgm@xxxxxs91v.acme.org:1094 tid=00007f5d7a692700 source=XrdMgmOfsFile:1265             tident=AAAAAAAU.5066:84@xxxx87v sec=krb5  uid=61928 gid=40507 name=franck geo="" location 8 is excluded as an unavailable filesystem - returning ENETUNREACH

Then, a new version of the file is created, and updated.

It seems that a recent change fixed it for our situation. Could that be ? Up to 4.2.19 I can see the issue of files not being updated (I downgraded back to 4.2.19 to check).

Hi Franck,

This was added in a recent commit:
https://gitlab.cern.ch/dss/eos/commit/a3c688bff4e8283456f02bedd0ea6101bd43fe91

The strange thing is that I also tried with 4.2.19 but couldn’t reproduce this. I use 4.2.19 everywhere, in the server, for the fuse mount and the fsts. Maybe if you use different versions there is a corner case not covered. But from the logs that you sent me, it seems that the selection of the new FST was the problem - and this is clearly addressed by the above commit.

Cheers,
Elvin

Ciao Elvin,

I have also tested this with everything running 4.2.19, on our test instance, without any success. Until I tested the new release.

Our instances have upgraded from aquamarine, maybe there is some left over situation that creates our issue. Otherwise, I agree that there is a strange thing.

We are planning to update the production instance to at least 4.2.20 (I see 4.2.21 is in the tag releases, but not announced, so we probably leave it aside for now), so we might be clear from this side (I saw that update works with either r/o or offline FS, so everything is good!)