Unable to store file - file has been cleaned because of a client disconnect

Hi All,

On one of our newly set up EOS instance of JUNO experiment, we have repeatedly hit the error “Unable to store file - file has been cleaned because of a client disconnect” for hundreds of times. From the error messages, we can see that this error happened on file close. When we hit this error, the certain file is removed from the system.

This is one episode of this errors:

201208 01:24:12 time=1607361852.712424 func=CallManager level=ERROR logid=static… unit=fst@junoeos01.ihep.ac.cn:1095 tid=00007fcf81afc700 source=XrdFstOfs:1084 tident= sec=(null) uid=99 gid=99 name=- geo="" msg=“MGM query failed” opaque="/?&mgm.access=create&mgm.ruid=12328&mgm.rgid=1004&mgm.uid=99&mgm.gid=99&mgm.path=/eos/juno/user/jiangw/Production/J20v2r0-pre1/TestEOS/AmC/elecsim/root/elecsim-200.root&mgm.manager:1094=junoeos01.ihep.ac.cn&mgm.fid=0014f0d5&mgm.cid=149830&mgm.sec=unix|jiangw|hxmt079.ihep.ac.cn||juno|||&mgm.lid=1048850&mgm.bookingsize=1024&mgm.fsid=1&mgm.url0=root://junoeos01.ihep.ac.cn:1095//&mgm.fsid0=1&mgm.url1=root://junoeos04.ihep.ac.cn:1095//&mgm.fsid1=4&cap.valid=1607357213&mgm.pcmd=commit&mgm.size=4373399630&mgm.checksum=6af67882&mgm.mtime=1607361852&mgm.mtime_ns=711088000&mgm.modified=1&mgm.add.fsid=1&mgm.commit.size=1&mgm.commit.checksum=1&mgm.logid=d73797b4-389d-11eb-9c29-0c42a15d0b00"
201208 01:24:27 time=1607361867.924783 func=Close level=ERROR logid=d73797b4-389d-11eb-9c29-0c42a15d0b00 unit=fst@junoeos01.ihep.ac.cn:1095 tid=00007fcf81afc700 source=ReplicaParLayout:500 tident=jiangw.46748:78@hxmt079 sec=unix uid=0 gid=0 name=jiangw geo="" error=failed to close replica root://junoeos04.ihep.ac.cn:1095///eos/juno/user/jiangw/Production/J20v2r0-pre1/TestEOS/AmC/elecsim/root/elecsim-200.root?&cap.msg=Z/kidSdAyI/juId7KxH2ZUMODZmbFjsNfcvCTgIQyotktKxo+wfYdOxqnjcpgltS1HNrSBMT6rZ/tSKimB1MPQYoiK4bqzQ71XZZCxUCvktiTVGZO/LajkGIhe2vD87+Xm8xJCBQhrj6QSbAwg8yznYm0bvo2aWqCoHeKGhTxHGlQuX0cTOymQwN8p6sFOK3V1ZeNLvIoxVZkCXqGwwjstxOQw+SgpfdiJO+wNqTTRAsioJfI+slXE7uz+sU1wlIy7lwrG+5U7oofnGBfJ/RYSLmmoG/jGObap/iWo9WTgy2g0imi3WhfjtRU6oPPFFWR++7YZRzc8nnDgxBLUKPigHX1TYsZ2/bQTNQVtHKbA8LUGqrnJZgRVsljd6XfFyl3zbE0kLmsGcaTdKyXQUCPAIFGbw+SFD7xjCIUMoVHX2ZMc1+mcgQFa2Cvd60Q3g82og9Aeips7zfoa2gySgHpE5ntL8LN8RiN6xGlgjr2jgnZ4bOdglwv0HesPemywV7yLtILHbgwHUnQ8WlGNYVJuFI4dap94QWZO16ZmbQRNMXUsFAQGYA56K0sbSJiZG2/DNcw6koE+adzKZAc6s/9eLIH+t/u7hcZGDGYOZceLBOGMwKSCPRcQ==&cap.sym=iV1RwnSEr1+FpZdsw5F5WM8r538=&mgm.id=0014f0d5&mgm.logid=d73797b4-389d-11eb-9c29-0c42a15d0b00&mgm.replicahead=0&mgm.replicaindex=1&mgm.path=/eos/juno/user/jiangw/Production/J20v2r0-pre1/TestEOS/AmC/elecsim/root/elecsim-200.root
201208 01:24:27 217972 FstOfs_ReplicaParClose: jiangw.46748:78@hxmt079 Unable to close failed ; remote I/O error
201208 01:24:28 217972 FstOfs__close: jiangw.46748:78@hxmt079 Unable to store file - file has been cleaned because of a client disconnect /eos/juno/user/jiangw/Production/J20v2r0-pre1/TestEOS/AmC/elecsim/root/elecsim-200.root; input/output error

eos fileinfo /eos/juno/user/jiangw/Production/J20v2r0-pre1/TestEOS/AmC/elecsim/root/elecsim-200.root
error: cannot stat ‘/eos/juno/user/jiangw/Production/J20v2r0-pre1/TestEOS/AmC/elecsim/root/elecsim-200.root’
(errc=2) (No such file or directory)

Here is the configuration:
Client: xrootd 4.10 ( user opens files with root:// directly , without any fuse )
ROOT: /cvmfs/juno.ihep.ac.cn/centos7_amd64_gcc830/Pre-Release/J20v2r0-Pre0/ExternalLibs/ROOT/6.20.02/bin/root

Server: 4.7.7
image

We started 3 FST deamon for each FST server, and the first 3 FST server were shared with QDB servers, the first FST is also MGM.

Any suggestion?

Hi Lu,

Does this issue happen systematically or just for some transfers?
Can you also have a look at the MGM logs and check for the corresponding error that you see on the FST (fst@junoeos01.ihep.ac.cn:1095)? Or just send me the logs for the FSTs and MGM during that hour.

If these errors are sporadic, they might indicate a problem on the client side and not necessarily an issue with your EOS setup.

Also, I see you run a quite old setup - especially the XRootD version (4.10) is worrying since there were quite some nasty bugs in that release. It would be really great if you could run something more recent like XRootD 4.12.5 and EOS >= 4.8.23.

Cheers,
Elvin

Hi Elvin,

  1. According to our statistic, this error has spreaded over all the clients and FST instances.
  2. The larger the output file is , the larger the failure rate. For an output directory with average file size of

5GB, the failure rate is >50%, that is only 156/400 output file is successfully written to EOS.

3.1 Here is the related lines I grepped from MGM for the file “elecsim-200.root” in previous post:

#grep “elecsim-200.root” error.log-20201208

201207 23:45:28 time=1607355928.117759 func=Close level=ERROR logid=d73797b4-389d-11eb-9c29-

0c42a15d0b00 unit=fst@junoeos01.ihep.ac.cn:1095 tid=00007fcf07afe700 source=ReplicaParLayout:500

tident=jiangw.46748:83@hxmt079 sec=unix uid=0 gid=0 name=jiangw geo="" error=failed to close replica

root://junoeos04.ihep.ac.cn:1095///eos/juno/user/jiangw/Production/J20v2r0-pre1/TestEOS/AmC/elecsim/root/elecsim-

200.root?&cap.msg=Z/kidSdAyI/juId7KxH2ZUMODZmbFjsNfcvCTgIQyotktKxo

+wfYdOxqnjcpgltS1HNrSBMT6rZ/tSKimB1MPQYoiK4bqzQ71XZZCxUCvktiTVGZO/LajkGIhe2vD87+Xm8xJCBQhrj6QSbAwg8yznYm0bvo2aWqCoH

eKGhTxHGlQuX0cTOymQwN8p6sFOK3V1ZeNLvIoxVZkCXqGwwjstxOQw+SgpfdiJO+wNqTTRAsioJfI+slXE7uz+sU1wlIy7lwrG

+5U7oofnGBfJ/RYSLmmoG/jGObap/iWo9WTgy2g0imi3WhfjtRU6oPPFFWR+

+7YZRzc8nnDgxBLUKPigHX1TYsZ2/bQTNQVtHKbA8LUGqrnJZgRVsljd6XfFyl3zbE0kLmsGcaTdKyXQUCPAIFGbw

+SFD7xjCIUMoVHX2ZMc1+mcgQFa2Cvd60Q3g82og9Aeips7zfoa2gySgHpE5ntL8LN8RiN6xGlgjr2jgnZ4bOdglwv0HesPemywV7yLtILHbgwHUnQ8

WlGNYVJuFI4dap94QWZO16ZmbQRNMXUsFAQGYA56K0sbSJiZG2/DNcw6koE+adzKZAc6s/9eLIH

+t/u7hcZGDGYOZceLBOGMwKSCPRcQ==&cap.sym=iV1RwnSEr1+FpZdsw5F5WM8r538=&mgm.id=0014f0d5&mgm.logid=d73797b4-389d-11eb-

9c29-0c42a15d0b00&mgm.replicahead=0&mgm.replicaindex=1&mgm.path=/eos/juno/user/jiangw/Production/J20v2r0-

pre1/TestEOS/AmC/elecsim/root/elecsim-200.root
201208 01:24:12 time=1607361852.712424 func=CallManager level=ERROR

logid=static… unit=fst@junoeos01.ihep.ac.cn:1095 tid=00007fcf81afc700

source=XrdFstOfs:1084 tident= sec=(null) uid=99 gid=99 name=- geo="" msg=“MGM query failed”

opaque="/?

&mgm.access=create&mgm.ruid=12328&mgm.rgid=1004&mgm.uid=99&mgm.gid=99&mgm.path=/eos/juno/user/jiangw/Production/J20

v2r0-pre1/TestEOS/AmC/elecsim/root/elecsim-

200.root&mgm.manager:1094=junoeos01.ihep.ac.cn&mgm.fid=0014f0d5&mgm.cid=149830&mgm.sec=unix|jiangw|

hxmt079.ihep.ac.cn||juno|||

&mgm.lid=1048850&mgm.bookingsize=1024&mgm.fsid=1&mgm.url0=root://junoeos01.ihep.ac.cn:1095//&mgm.fsid0=1&mgm.url1=r

oot://junoeos04.ihep.ac.cn:1095//&mgm.fsid1=4&cap.valid=1607357213&mgm.pcmd=commit&mgm.size=4373399630&mgm.checksum

=6af67882&mgm.mtime=1607361852&mgm.mtime_ns=711088000&mgm.modified=1&mgm.add.fsid=1&mgm.commit.size=1&mgm.commit.ch

ecksum=1&mgm.logid=d73797b4-389d-11eb-9c29-0c42a15d0b00"
201208 01:24:27 time=1607361867.196393 func=CallManager level=ERROR

logid=static… unit=fst@junoeos04.ihep.ac.cn:1095 tid=00007fb6ba6f8700

source=XrdFstOfs:1084 tident= sec=(null) uid=99 gid=99 name=- geo="" msg=“MGM query failed”

opaque="/?

&mgm.access=create&mgm.ruid=12328&mgm.rgid=1004&mgm.uid=99&mgm.gid=99&mgm.path=/eos/juno/user/jiangw/Production/J20

v2r0-pre1/TestEOS/AmC/elecsim/root/elecsim-

200.root&mgm.manager:1094=junoeos01.ihep.ac.cn&mgm.fid=0014f0d5&mgm.cid=149830&mgm.sec=unix|jiangw|

hxmt079.ihep.ac.cn||juno|||

&mgm.lid=1048850&mgm.bookingsize=1024&mgm.fsid=1&mgm.url0=root://junoeos01.ihep.ac.cn:1095//&mgm.fsid0=1&mgm.url1=r

oot://junoeos04.ihep.ac.cn:1095//&mgm.fsid1=4&cap.valid=1607357213&mgm.pcmd=commit&mgm.size=4373399630&mgm.checksum

=6af67882&mgm.mtime=1607361867&mgm.mtime_ns=194833000&mgm.modified=1&mgm.add.fsid=4&mgm.replication=1&mgm.verify.ch

ecksum=1&mgm.logid=d73797b4-389d-11eb-9c29-0c42a15d0b00"
201208 01:24:27 time=1607361867.924783 func=Close level=ERROR logid=d73797b4-389d-11eb-9c29-

0c42a15d0b00 unit=fst@junoeos01.ihep.ac.cn:1095 tid=00007fcf81afc700 source=ReplicaParLayout:500

tident=jiangw.46748:78@hxmt079 sec=unix uid=0 gid=0 name=jiangw geo="" error=failed to close replica

root://junoeos04.ihep.ac.cn:1095///eos/juno/user/jiangw/Production/J20v2r0-pre1/TestEOS/AmC/elecsim/root/elecsim-

200.root?&cap.msg=Z/kidSdAyI/juId7KxH2ZUMODZmbFjsNfcvCTgIQyotktKxo

+wfYdOxqnjcpgltS1HNrSBMT6rZ/tSKimB1MPQYoiK4bqzQ71XZZCxUCvktiTVGZO/LajkGIhe2vD87+Xm8xJCBQhrj6QSbAwg8yznYm0bvo2aWqCoH

eKGhTxHGlQuX0cTOymQwN8p6sFOK3V1ZeNLvIoxVZkCXqGwwjstxOQw+SgpfdiJO+wNqTTRAsioJfI+slXE7uz+sU1wlIy7lwrG

+5U7oofnGBfJ/RYSLmmoG/jGObap/iWo9WTgy2g0imi3WhfjtRU6oPPFFWR+

+7YZRzc8nnDgxBLUKPigHX1TYsZ2/bQTNQVtHKbA8LUGqrnJZgRVsljd6XfFyl3zbE0kLmsGcaTdKyXQUCPAIFGbw

+SFD7xjCIUMoVHX2ZMc1+mcgQFa2Cvd60Q3g82og9Aeips7zfoa2gySgHpE5ntL8LN8RiN6xGlgjr2jgnZ4bOdglwv0HesPemywV7yLtILHbgwHUnQ8

WlGNYVJuFI4dap94QWZO16ZmbQRNMXUsFAQGYA56K0sbSJiZG2/DNcw6koE+adzKZAc6s/9eLIH

+t/u7hcZGDGYOZceLBOGMwKSCPRcQ==&cap.sym=iV1RwnSEr1+FpZdsw5F5WM8r538=&mgm.id=0014f0d5&mgm.logid=d73797b4-389d-11eb-

9c29-0c42a15d0b00&mgm.replicahead=0&mgm.replicaindex=1&mgm.path=/eos/juno/user/jiangw/Production/J20v2r0-

pre1/TestEOS/AmC/elecsim/root/elecsim-200.root

There are hundreds of similar failure traces in the logs.

Many thanks for your help!

Cheers,
Lu

I am sorry I have not found a way to upload log files. So I attached logs of the two replica in two pictures.

Also the mgm errors in one picture

Hi Lu,

Based on the things that you mentioned, I would say the problem comes due to the computation of the checksum in the close method. If this takes more than 60 seconds then the client will timeout and you will get the error message “file deleted because of client disconnect”. This happens particularly when you are transferring big files and the 5GB size that you mention would fit in this category.

Now we have two options:

  • if you can control the client, I would add the following environment variable when transferring the files to your instance: XRD_STREAMTIMEOUT=600 this would allow the close method to take up to 10 minutes without the client disconnecting
  • the other option is to deploy a new version of EOS that supports asynchronous notification of the clients when the close operation is done. Unfortunately, this is not yet released in any version. But if you are interested in this, I can do a new release with this feature available.

Let me know what you prefer.

Cheers,
Elvin

Hi Elvin,

After setting XRD_STREAMTIMEOUT=600, all the outputs have been successfully written to EOS. It seems that the problem has been solved without upgrade of XrootD and EOS.

After that, for the long term stability of the whole system, we have upgrade the servers to 4.8.25. I think the XrootD client can also be upgraded with next version of JUNO software.

Cheers,
Lu

Hi Lu,

Glad to head that this works. This also confirms where the problem is coming from. For future reference you need to install EOS >= 4.8.31 and set the following env variable to enable the proper fix for this issue on the server side. Once this is done you can remove the client timeout value that you currently set.

EOS_FST_ASYNC_CLOSE=1

This is the corresponding commit:
https://gitlab.cern.ch/dss/eos/-/commit/dc1bc4e6a752650fbace51056ac9e822fd17b35a

Cheers,
Elvin

Elvin. hello.

Got same issue while file converting.

210330 19:41:55 ERROR ConversionJob:378              msg="[ERROR] Server responded with an error: [3007] Unable to store file - file has been cleaned because of a client disconnect /eos/test/proc/conversion/0000000000000997:default.0#00650112; input/output error" tpc_src=root://nas-ad.intranet:1094//eos/test/data/tia/004-test tpc_dst=root://nas-ad.intranet:1094//eos/test/proc/conversion/0000000000000997:default.0#00650112 conversion_id=0000000000000997:default.0#00650112 
210330 19:41:56 ERROR ConverterDriver:123            msg="Failed to remove conversion job from QuarkDB" fid=2455 

It was a large (10G) file converting and I got this error while every converting attempt.

EOS_FST_ASYNC_CLOSE=1 is in eos_env at mgm and fst nodes.
Versions is

  • xrootd is 4.12.8-1.el8.x86_64 for mgm (we have quarkdb at same node) and 5.0.3-2.el8.x86_64 at fst
  • eos-server-4.8.31-1.el8.x86_64 at all nodes

Hi Igor,

First of all. you must not use XRootD5 with any of the EOS releases. These were not validated and in principle the daemon should not even be able to start. Secondly, there was bug in that version and a mess up with the EOS_FST_ASYNC_CLOSE flag. Therefore, I would strongly suggest to update to 4.8.40 and retry your tests.

You can get this release from the following repo:
https://storage-ci.web.cern.ch/storage-ci/eos/citrine/tag/testing/el-7/x86_64/

Please let me know if you still have issues after the update. Note that the env variable EOS_FST_ASYNC_CLOSE needs to be removed before restarting the services.

Cheers,
Elvn