We have encountered the case where users (on fuse mounts) tried to modified files during a FST upgrade campaign, and they couldn’t because one of the replicas was offline, and the update of the file was refused. The feature inline-repair is enabled on the side of the client (default values), but maybe the option also needs to be on mgm side with space.autorepair parameter ?
However, while testing this with the space.autorepair parameter, the update is still not possible on a file when a FS holding a replica is is read-only mode, or off, or the node is offfline.
I looked over the logs that you provided on this issue. This functionality works as advertised when the FST is in read-only mode but will fail if it’s offline. There is already a ticket opened for the underlying issue here: https://its.cern.ch/jira/browse/EOS-2410
I’ve run a test with the FST in ro mode and it works as expected.
Thank you for having looked. OK for the case when the FST is offline (it seems I don’t have any more access to the JIRA issues with my lightweight account).
But I can confirm that with my set-up I get Input/Output error even when changing a file which has one replica on a FS with configstatus=ro.
So there might be some parameter that causes this behaviour on our instances (or the fuse client ?). The fact is, indeed, that when the FS is read-only, and we try to access the file in rw, the MGM sends to the valid node a request with __offline_ prefix for the hostname of the node with read-only FS, and it doesn’t seem to try to repair it, although the client writes inline-repair=true:
I had another look at the logs you provided and there is smth I can’t understand. There is a node which apparently is offline but still selected to server the file. Can you retry with a fresh file and send me the logs again?
Also please append the result for the commands:
I have news : when I upgrade the MGM to version 4.2.20, it correctly works, I see this kind of message :
180420 18:38:17 time=1524242297.570604 func=open level=INFO logid=3a790c30-44b9-11e8-90cb-9a9ab1a9ef91 unit=mgm@xxxxxs91v.acme.org:1094 tid=00007f5d7a692700 source=XrdMgmOfsFile:1265 tident=AAAAAAAU.5066:84@xxxx87v sec=krb5 uid=61928 gid=40507 name=franck geo="" location 8 is excluded as an unavailable filesystem - returning ENETUNREACH
Then, a new version of the file is created, and updated.
It seems that a recent change fixed it for our situation. Could that be ? Up to 4.2.19 I can see the issue of files not being updated (I downgraded back to 4.2.19 to check).
This was added in a recent commit: https://gitlab.cern.ch/dss/eos/commit/a3c688bff4e8283456f02bedd0ea6101bd43fe91
The strange thing is that I also tried with 4.2.19 but couldn’t reproduce this. I use 4.2.19 everywhere, in the server, for the fuse mount and the fsts. Maybe if you use different versions there is a corner case not covered. But from the logs that you sent me, it seems that the selection of the new FST was the problem - and this is clearly addressed by the above commit.
I have also tested this with everything running 4.2.19, on our test instance, without any success. Until I tested the new release.
Our instances have upgraded from aquamarine, maybe there is some left over situation that creates our issue. Otherwise, I agree that there is a strange thing.
We are planning to update the production instance to at least 4.2.20 (I see 4.2.21 is in the tag releases, but not announced, so we probably leave it aside for now), so we might be clear from this side (I saw that update works with either r/o or offline FS, so everything is good!)