New citrine release- 4.8.104?

georgep · August 14, 2023, 8:26am

Hello,

Can you please confirm if there will be a new citrine release, i.e 4.8.104? The latest citrine version, 4.8.103 has a timestamp, 2023-06-16 13:42, in our mirror repo.

We would very much like to benefit from this change, MGM/HTTP: Don't mask ENODEV errors as this leads to creation of 0-size files... (7c04411e) · Commits · dss / eos · GitLab, that is resolving the issue tracked in https://its.cern.ch/jira/browse/EOS-5771, which has an operational impact on the ATLAS WebDAV recalls at RAL

Also, the new release is expected to contain this change kindly made by Elvin in reponse to

Thanks,

George

esindril · August 15, 2023, 7:08am

Hi George,

We would need to coordinate on this with the CTA team, but currently there are many people absent. Therefore, I will try to give you a better estimate on this next week.

Cheers,
Elvin

georgep · August 17, 2023, 2:04pm

Thanks for this Elvin. For my reference, the reason you would like to coordinate with the CTA team is that the above ENODEV change requires a change also in the CTA code?

George

esindril · August 17, 2023, 2:53pm

Hi George,

No, I don’t think there are any changes required on the CTA side. I just want to make sure there is nothing else that CTA would like in this (last) EOS 4 release.

Cheers,
Elvin

georgep · August 30, 2023, 2:20pm

Hi Elvin,

Sorry for the hassle. Do you have any update on this issue?

Thanks,

George

esindril · August 30, 2023, 2:33pm

Hi George,

Yes, we will have a new EOS 4 release with the two fixes that you are interested in by the end of the week.

Cheers,
Elvin

esindril · August 31, 2023, 9:50am

Hi George,

We just released EOS 4.8.104 that includes the two fixes you are interested in. You can get the packages from the usual location:
https://storage-ci.web.cern.ch/storage-ci/eos/citrine/tag/testing/el-7/x86_64/eos-server-4.8.104-1.el7.cern.x86_64.rpm

Cheers,
Elvin

georgep · August 31, 2023, 11:10am

Hi Elvin,

Thank you so much for this! The rpms will be picked up by our repo server tonight and we test and push this version in production next week.

Best

George

georgep · September 5, 2023, 2:51pm

Hi Elvin,

We installed EOS 4.8.104 on our preprod instance. Everything works as expected except WebDAV reads: when I try to copy of out of EOS a file that has been staged from tape I get the following error

-bash-4.2$ gfal-copy https://antares-preprod.stfc.ac.uk:9000/eos/antarespreprodtier1/dteam/random_400MB random_400MB_local
Copying https://antares-preprod.stfc.ac.uk:9000/eos/antarespreprodtier1/dteam/random_400MB [FAILED] after 0s
gfal-copy error: 112 (Host is down) - Result (Neon): Invalid Content-Length in response after 1 attempts

George

esindril · September 6, 2023, 6:57am

Hi George,

Can you please post the fileinfo about this particular file you are trying to read out?

eos fileinfo /eos/antarespreprodtier1/dteam/random_400MB random_400MB_local

Also retry to copy operation and please send the MGM logs for the corresponding interval of time.

Thanks,
Elvin

georgep · September 6, 2023, 9:58am

Hi Elvin,

[root@antares-eos94 ~]#
[root@antares-eos94 ~]#
[root@antares-eos94 ~]# eos fileinfo /eos/antarespreprodtier1/dteam/random_400MB
File: ‘/eos/antarespreprodtier1/dteam/random_400MB’ Flags: 0644
Size: 419430400
Status: healthy
Modify: Mon Sep 4 16:46:41 2023 Timestamp: 1693842401.548262000
Change: Wed Sep 6 10:29:46 2023 Timestamp: 1693992586.373337113
Birth: Mon Sep 4 16:46:39 2023 Timestamp: 1693842399.643205390
CUid: 36300 CGid: 24311 Fxid: 2faf080c Fid: 800000012 Pid: 800000004 Pxid: 2faf0804
XStype: adler XS: eb 8b 07 de ETAGs: “214748368021225472:eb8b07de”
Layout: replica Stripes: 1 Blocksize: 4k LayoutId: 00100012 Redundancy: d1::t1
#Rep: 2
TapeID: 4294967377 StorageClass: dteam_tapetest
┌───┬──────┬──────────────────────────┬────────────────┬────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│ host│ schedgroup│ path│ boot│ configstatus│ drain│ active│ geotag│
└───┴──────┴──────────────────────────┴────────────────┴────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
1 27 antares-eos96.scd.rl.ac.uk retrieve.0 /eos/data-sdk booted rw nodrain online undef

The MGM log during the interval when the copy operation was attempted are here

http://www-public.gridpp.rl.ac.uk/filelists/random_400MB_gfalcopy_mgmlog.txt

Also, in case it is of any use, here is the MGM log for the successfful bring-online operation

http://www-public.gridpp.rl.ac.uk/filelists/random_400MB_bringonline_mgmlog.txt

Best,
George

esindril · September 6, 2023, 11:33am

Hi George,

Things look fine at the MGM, could you send me the logs for the same transfer (around 10:40:29) from the following FST daemon: antares-eos96.scd.rl.ac.uk with the HTTP port 8001?

Thanks,
Elvin

georgep · September 6, 2023, 11:43am

Hi Elvin,

Here it is http://www-public.gridpp.rl.ac.uk/filelists/random_400MB_gfalcopy_fstlog.txt

I can see some errors but cant understand what they mean.

Best,

George

esindril · September 6, 2023, 1:24pm

Hi George,

Looking over the FST logs, I still don’t see anything wrong there. The open arrives at the FST and the open is done and a reply is sent to the client but then it disconnects.

Also in 4.8.104 there is no code change to the FST part of the EOS setup so I don’t think this is a regression from the previous version.

Can you send me the logs from the command when running with the following options?
gfal-copy -vvv --log-file=gfal2.log

Are you actually able to copy out the file with simple xrdcp?

Thanks,
Elvin

georgep · September 6, 2023, 2:29pm

Hi Elvin,

I noticed the following line in the FST log

230906 10:40:29 time=1693993229.238428 func=FileClose level=ERROR logid=static… unit=fst@antares-eos96.scd.rl.ac.uk:1095 tid=00007fa3c05f5700 source=HttpServer:230 tident= sec=(null) uid=99 gid=99 name=- geo=“” msg=“clean-up interrupted or IO error related PUT/GET request” path=“/eos/antarespreprodtier1/dteam/random_400MB”

Please see the gfal2 log in http://www-public.gridpp.rl.ac.uk/filelists/gfal2_random400MB.log

Yes, I can copy out the with xrdcp (and also with gfal-copy root://…) so using XRootD as a protocol. It is the WebDAV protocol that generates the above error.

Just to mention that after the above error - using gfal-copy https://… - the destination local file (random_400MB_local) is a stub, i.e. it has a zero size,

George

esindril · September 6, 2023, 2:50pm

Hi George,

I see that the FST reply that the gfal client prints out has the Content-Length field, displayed twice.

Were these kind of requests working before doing the EOS upgrade to 4.8.104?
If it did what version of EOS were you using before?
Do you have the same version on both the MGM and FSTs? 4.8.104?
What is the version of gfal that you are using?

I will be on holidays until Tue the 12th of September, so I will follow up on this when I am back.

Cheers,
Elvin

georgep · September 11, 2023, 2:37pm

Hi Elvin,

Thanks for this - to answer your questions:

Yes, gfal-copy of a staged file out of EOS was working before the upgrade to 4.8.104
4.8.98 (the version we currently run in production)
All EOS nodes - MGM and FSTs - in our preprod cluster have the same version, 4.8.104
I paste the versin of the all the gfal rpms on the machine where I run gfal-copy

rpm -q --whatprovides /usr/bin/gfal-copy
gfal2-util-scripts-1.8.0-1.el7.noarch

-bash-4.2$ rpm -qa | grep gfal2
gfal2-plugin-http-2.21.2-1.el7.x86_64
gfal2-plugin-sftp-2.21.2-1.el7.x86_64
gfal2-plugin-lfc-2.21.2-1.el7.x86_64
gfal2-all-2.21.2-1.el7.x86_64
python2-gfal2-1.12.0-1.el7.x86_64
gfal2-2.21.2-1.el7.x86_64
gfal2-devel-2.21.2-1.el7.x86_64
gfal2-plugin-gridftp-2.21.2-1.el7.x86_64
gfal2-plugin-dcap-2.21.2-1.el7.x86_64
gfal2-plugin-srm-2.21.2-1.el7.x86_64
gfal2-plugin-mock-2.21.2-1.el7.x86_64
gfal2-util-scripts-1.8.0-1.el7.noarch
gfal2-plugin-xrootd-2.21.2-1.el7.x86_64
python3-gfal2-util-1.8.0-1.el7.noarch
python2-gfal2-util-1.8.0-1.el7.noarch
python3-gfal2-1.12.0-1.el7.x86_64
gfal2-plugin-rfio-2.21.2-1.el7.x86_64
gfal2-plugin-file-2.21.2-1.el7.x86_64

Best ,

George

esindril · September 12, 2023, 7:36am

Hi George,

There was indeed a regression in the last version, this is now fixed and a new release is building as we speak. This will be eos-4.8.105.

This issue affects only the FST nodes so if you are in a hurry you can run in a mixed setup with the MGM eos-4.8.104 and the FSTs eos-4.8.98. Otherwise, you can install the upcoming 4.8.105 everywhere.

I will let you know once it’s available in the usual yum repositories. Thank you for the bug report!

Cheers,
Elvin

esindril · September 12, 2023, 10:00am

Hi George,

The new packages are available in the usual location:
https://storage-ci.web.cern.ch/storage-ci/eos/citrine/tag/testing/el-7/x86_64/eos-server-4.8.105-1.el7.cern.x86_64.rpm

Cheers,
Elvin

georgep · September 12, 2023, 11:52am

Thanks for this Elvin. Our repos will be synced tonight and we will test this version tomorrow.

Best,
George

CERN Accelerating science

New citrine release- 4.8.104?