New citrine release- 4.8.104?

Hello,

Can you please confirm if there will be a new citrine release, i.e 4.8.104? The latest citrine version, 4.8.103 has a timestamp, 2023-06-16 13:42, in our mirror repo.

We would very much like to benefit from this change, MGM/HTTP: Don't mask ENODEV errors as this leads to creation of 0-size files... (7c04411e) · Commits · dss / eos · GitLab, that is resolving the issue tracked in https://its.cern.ch/jira/browse/EOS-5771, which has an operational impact on the ATLAS WebDAV recalls at RAL

Also, the new release is expected to contain this change kindly made by Elvin in reponse to

Thanks,

George

Hi George,

We would need to coordinate on this with the CTA team, but currently there are many people absent. Therefore, I will try to give you a better estimate on this next week.

Cheers,
Elvin

Thanks for this Elvin. For my reference, the reason you would like to coordinate with the CTA team is that the above ENODEV change requires a change also in the CTA code?

George

Hi George,

No, I don’t think there are any changes required on the CTA side. I just want to make sure there is nothing else that CTA would like in this (last) EOS 4 release. :wink:

Cheers,
Elvin

Hi Elvin,

Sorry for the hassle. Do you have any update on this issue?

Thanks,

George

Hi George,

Yes, we will have a new EOS 4 release with the two fixes that you are interested in by the end of the week.

Cheers,
Elvin

Hi George,

We just released EOS 4.8.104 that includes the two fixes you are interested in. You can get the packages from the usual location:
https://storage-ci.web.cern.ch/storage-ci/eos/citrine/tag/testing/el-7/x86_64/eos-server-4.8.104-1.el7.cern.x86_64.rpm

Cheers,
Elvin

Hi Elvin,

Thank you so much for this! The rpms will be picked up by our repo server tonight and we test and push this version in production next week.

Best

George

Hi Elvin,

We installed EOS 4.8.104 on our preprod instance. Everything works as expected except WebDAV reads: when I try to copy of out of EOS a file that has been staged from tape I get the following error

-bash-4.2$ gfal-copy https://antares-preprod.stfc.ac.uk:9000/eos/antarespreprodtier1/dteam/random_400MB random_400MB_local
Copying https://antares-preprod.stfc.ac.uk:9000/eos/antarespreprodtier1/dteam/random_400MB [FAILED] after 0s
gfal-copy error: 112 (Host is down) - Result (Neon): Invalid Content-Length in response after 1 attempts

George

Hi George,

Can you please post the fileinfo about this particular file you are trying to read out?

eos fileinfo /eos/antarespreprodtier1/dteam/random_400MB random_400MB_local

Also retry to copy operation and please send the MGM logs for the corresponding interval of time.

Thanks,
Elvin

Hi Elvin,

[root@antares-eos94 ~]#
[root@antares-eos94 ~]#
[root@antares-eos94 ~]# eos fileinfo /eos/antarespreprodtier1/dteam/random_400MB
File: ‘/eos/antarespreprodtier1/dteam/random_400MB’ Flags: 0644
Size: 419430400
Status: healthy
Modify: Mon Sep 4 16:46:41 2023 Timestamp: 1693842401.548262000
Change: Wed Sep 6 10:29:46 2023 Timestamp: 1693992586.373337113
Birth: Mon Sep 4 16:46:39 2023 Timestamp: 1693842399.643205390
CUid: 36300 CGid: 24311 Fxid: 2faf080c Fid: 800000012 Pid: 800000004 Pxid: 2faf0804
XStype: adler XS: eb 8b 07 de ETAGs: “214748368021225472:eb8b07de”
Layout: replica Stripes: 1 Blocksize: 4k LayoutId: 00100012 Redundancy: d1::t1
#Rep: 2
TapeID: 4294967377 StorageClass: dteam_tapetest
┌───┬──────┬──────────────────────────┬────────────────┬────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│ host│ schedgroup│ path│ boot│ configstatus│ drain│ active│ geotag│
└───┴──────┴──────────────────────────┴────────────────┴────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
1 27 antares-eos96.scd.rl.ac.uk retrieve.0 /eos/data-sdk booted rw nodrain online undef


The MGM log during the interval when the copy operation was attempted are here

http://www-public.gridpp.rl.ac.uk/filelists/random_400MB_gfalcopy_mgmlog.txt

Also, in case it is of any use, here is the MGM log for the successfful bring-online operation

http://www-public.gridpp.rl.ac.uk/filelists/random_400MB_bringonline_mgmlog.txt

Best,
George

Hi George,

Things look fine at the MGM, could you send me the logs for the same transfer (around 10:40:29) from the following FST daemon: antares-eos96.scd.rl.ac.uk with the HTTP port 8001?

Thanks,
Elvin

Hi Elvin,

Here it is http://www-public.gridpp.rl.ac.uk/filelists/random_400MB_gfalcopy_fstlog.txt

I can see some errors but cant understand what they mean.

Best,

George

Hi George,

Looking over the FST logs, I still don’t see anything wrong there. The open arrives at the FST and the open is done and a reply is sent to the client but then it disconnects.

Also in 4.8.104 there is no code change to the FST part of the EOS setup so I don’t think this is a regression from the previous version.

Can you send me the logs from the command when running with the following options?
gfal-copy -vvv --log-file=gfal2.log

Are you actually able to copy out the file with simple xrdcp?

Thanks,
Elvin

Hi Elvin,

I noticed the following line in the FST log

230906 10:40:29 time=1693993229.238428 func=FileClose level=ERROR logid=static… unit=fst@antares-eos96.scd.rl.ac.uk:1095 tid=00007fa3c05f5700 source=HttpServer:230 tident= sec=(null) uid=99 gid=99 name=- geo=“” msg=“clean-up interrupted or IO error related PUT/GET request” path=“/eos/antarespreprodtier1/dteam/random_400MB”

Please see the gfal2 log in http://www-public.gridpp.rl.ac.uk/filelists/gfal2_random400MB.log

Yes, I can copy out the with xrdcp (and also with gfal-copy root://…) so using XRootD as a protocol. It is the WebDAV protocol that generates the above error.

Just to mention that after the above error - using gfal-copy https://… - the destination local file (random_400MB_local) is a stub, i.e. it has a zero size,

George

Hi George,

I see that the FST reply that the gfal client prints out has the Content-Length field, displayed twice.

  • Were these kind of requests working before doing the EOS upgrade to 4.8.104?
  • If it did what version of EOS were you using before?
  • Do you have the same version on both the MGM and FSTs? 4.8.104?
  • What is the version of gfal that you are using?

I will be on holidays until Tue the 12th of September, so I will follow up on this when I am back.

Cheers,
Elvin

Hi Elvin,

Thanks for this - to answer your questions:

  • Yes, gfal-copy of a staged file out of EOS was working before the upgrade to 4.8.104
  • 4.8.98 (the version we currently run in production)
  • All EOS nodes - MGM and FSTs - in our preprod cluster have the same version, 4.8.104
  • I paste the versin of the all the gfal rpms on the machine where I run gfal-copy

rpm -q --whatprovides /usr/bin/gfal-copy
gfal2-util-scripts-1.8.0-1.el7.noarch

-bash-4.2$ rpm -qa | grep gfal2
gfal2-plugin-http-2.21.2-1.el7.x86_64
gfal2-plugin-sftp-2.21.2-1.el7.x86_64
gfal2-plugin-lfc-2.21.2-1.el7.x86_64
gfal2-all-2.21.2-1.el7.x86_64
python2-gfal2-1.12.0-1.el7.x86_64
gfal2-2.21.2-1.el7.x86_64
gfal2-devel-2.21.2-1.el7.x86_64
gfal2-plugin-gridftp-2.21.2-1.el7.x86_64
gfal2-plugin-dcap-2.21.2-1.el7.x86_64
gfal2-plugin-srm-2.21.2-1.el7.x86_64
gfal2-plugin-mock-2.21.2-1.el7.x86_64
gfal2-util-scripts-1.8.0-1.el7.noarch
gfal2-plugin-xrootd-2.21.2-1.el7.x86_64
python3-gfal2-util-1.8.0-1.el7.noarch
python2-gfal2-util-1.8.0-1.el7.noarch
python3-gfal2-1.12.0-1.el7.x86_64
gfal2-plugin-rfio-2.21.2-1.el7.x86_64
gfal2-plugin-file-2.21.2-1.el7.x86_64

Best ,

George

Hi George,

There was indeed a regression in the last version, this is now fixed and a new release is building as we speak. This will be eos-4.8.105.

This issue affects only the FST nodes so if you are in a hurry you can run in a mixed setup with the MGM eos-4.8.104 and the FSTs eos-4.8.98. Otherwise, you can install the upcoming 4.8.105 everywhere.

I will let you know once it’s available in the usual yum repositories. Thank you for the bug report!

Cheers,
Elvin

Hi George,

The new packages are available in the usual location:
https://storage-ci.web.cern.ch/storage-ci/eos/citrine/tag/testing/el-7/x86_64/eos-server-4.8.105-1.el7.cern.x86_64.rpm

Cheers,
Elvin

Thanks for this Elvin. Our repos will be synced tonight and we will test this version tomorrow.

Best,
George