Too many warning log about "outside of collection time window"

GeonmoRyu · August 25, 2025, 4:21am

Hello, everyone.

Currently, our EOS system is generating too many warning errors, making it difficult to resolve other issues.

These messages appear too frequently, making it difficult to find essential logs, so we are contacting you to resolve this issue first. The messages are as follows.

250824 15:06:54 time=1756048014.548668 func=Add level=WARN logid=static… unit=mgm@alice-t1-eos02.sdfarm.kr:1094 tid=00007fabc6bff640 source=Iostat:99 tident= sec=(null) uid=0 gid=0 name=- geo=“” xt=“” ob=“” msg=“failed report digest, transfer stopped outside of collection time window”

It seems that dozens of these occur every second. I think there is definitely a problem, but I have no idea what the cause is, so I am contacting you.

Regards,

– Geonmo

esindril · August 25, 2025, 6:31am

Hi Geonmo,

These warnings appear when the transfer summary messages from the FSTs arrive with a considerable delay at the MGM - so they are discarded and an not processed for the “io stat” display information.

If you are still using the MQ daemon, then I would check its logs, maybe do a restart of the MQ daemon to clear the backlog, otherwise pick one FST and look if there are any particular error or warning messages in the logs when sending the report information.

Cheers,

Elvin

GeonmoRyu · August 25, 2025, 9:41am

Hello, Elvin.

I am not sure if this is related to the issue, but I would like to confirm one thing.

On our site, we have configured firewalls between each EOS host to allow connections only on specific ports.

However, today I found logs where connections were attempted from the FST endpoint ports (1095 or 1096) to random ports on MGM or other FSTs, and these attempts were blocked.

I am worried whether this kind of connection is usual and if we have been blocking it until now.

In most cases, connections are usually established from random ports to specific destination ports during session access, so I never thought of this scenario. Please check if this connection behavior is expected.

If this was a standard connection, then our firewall settings may have been blocking internal transfers.

Regards,

– Geonmo

GeonmoRyu · August 25, 2025, 12:06pm

Hello, Elvin.

I collected logs from an arbitrary FST (jbod-mgmt-01.sdfarm.kr:1095), and there were too many different patterns. I found a few patterns and am posting them here.

250824 15:10:06 20729 FstOfs_ProcessTpcOpaque: 53.7:582@jbod-mgmt-05 Unable to open - capability illegal /eos/gsdc/grid/02/56768/c7d097f0-4a07-11e6-9717-0a38a10abeef; timer expired

250825 11:46:59 time=1756122419.989590 func=open level=ERROR logid=0e4e7bfa-819b-11f0-9121-c81f66db3ddd unit=fst@jbod-mgmt-01.sdfarm.kr:1095 tid=00007f6be92ff640 source=XrdFstOfsFile:194 tident=105.7:574@jbod-mgmt-05 sec=unix uid=0 gid=0 name=daemon geo=“” xt=“” ob=“” msg=“failed while processing TPC/open opaque” path=“/eos/gsdc/grid/13/21040/74ee2e20-42b6-11e5-9717-0939a10abeef”

250825 11:49:55 time=1756122595.648507 func=open level=ERROR logid=58b7d76e-81a9-11f0-bb22-c81f66db3ddd unit=fst@jbod-mgmt-01.sdfarm.kr:1095 tid=00007f6b559ed640 source=XrdFstOfsFile:202 tident=109.7:581
@ jbod-mgmt-04.sdfarm.kr sec=unix uid=0 gid=0 name=daemon geo=“” xt=“” ob=“” msg=“failed while processing open opaque info” path=“/eos/gsdc/grid/05/29240/374454e5-827b-11ef-800d-02425bf098e1”

250825 11:49:56 time=1756122596.074828 func=ProcessCapOpaque level=WARN logid=b9aa715e-81a8-11f0-b960-c81f66db61df unit=fst@jbod-mgmt-01.sdfarm.kr:1095 tid=00007f6b62032640 source=XrdFstOfsFile:2648 tident=54.7:1345
@jbod-mgmt-04 sec=(null) uid=65534 gid=65534 name=(null) geo=“” xt=“” ob=“” msg=“MGM master seems to have changed - adjusting global config” old-manager=“alice-t1-eos02.sdfarm.kr:1094” new-manager=“alice-t1-eos03.sdfarm.kr:1094”

It seems like the situation is getting really bad.

For reference, We are using MQ on QDB settings, so we do not operating MQ separately.

In that case, I don’t understand why the MGM value of FST goes to another machine and then comes back. I thought that using QDB would prevent this phenomenon.

Just in case, I set EOS_HA_REDIRECT_READS=1 in MGM, but the setting value is still fluctuating.

If you need detailed logs, I will send them to you by email.

Regards,

– Geonmo

esindril · August 25, 2025, 2:10pm

Hi Geonmo,

Indeed, the ports on which connections are estabilished are always the same depending on your configuration. Usually the MGM runs on 1094 and the FSTs on 1095 for the XRootD protocol and for HTTP it really depends on what you have written in your config.

Looking at the error messages that you posted, these look like timeout errors due to slow clients that takes long to perform the redirection step from the MGM to the FSTs.

The MGM value at the FST might change when there is a master-slave transition in your instance. Just looking at these logs I can not really tell what is the problem in your instance. I would try doing some simple read, write, thrid-party-copy operations in your instance to make sure the basic workflows work as expected.

Cheers,

Elvin

GeonmoRyu · August 25, 2025, 2:45pm

Hello, Elvin.

During the ALICE test, an ‘ADD’ related error occurred, while “GET” was successful. After tracking the ‘ADD’ related content, we found the following log in FST. (At MGM, only INFO logs are displayed for the same logid.)

As you suggested, if timeout is the biggest factor, the first error message is estimated to be the most important factor. Among the individual time factors, ‘layout::opened’ seems to take the most time.

In our RAIN system with many disk arrays (21 FSTs in QRAIN), we assumed that layout-related operations would naturally take a long time, but we did not expect the error message to appear in just 1 second. If you could provide any suggestions on how to optimise this, we would greatly appreciate it.

Regards,

– Geonmo

250825 12:52:51 time=1756126371.773699 func=open level=ERROR logid=67cfd6a8-81b2-11f0-b4dc-c81f66db3ddd unit=fst@jbod-mgmt-02.sdfarm.kr:1096 tid=00007f6cf4b3a640 source=XrdFstOfsFile:813 tident=monalisa.512032:193@pcalimonitor.cern.ch sec=(null) uid=10367 gid=1395 name=nobody geo=“” xt=“” ob=“” msg=“slow open operation” open-duration=1022.122ms fxid=06202535 path=“/02/12000/0b014523-81b2-11f0-bd8e-00620ba19d1e” path::print=5.659ms creation::barrier=20.021ms layout::exists=0.029ms clone::fst=0.079ms layout::open=0.018ms layout::opened=995.678ms get::localfmd=0.007ms resync::localfmd=0.368ms layout::stat=0.014ms full::mutex=0.003ms layout::fallocate=0.007ms layout::fallocated=0.133ms fileio::object=0.080ms open::accounting=0.013ms end=0.013ms open=1022.122ms

250825 12:56:25 time=1756126585.024610 func=Close level=ERROR logid=67cfd6a8-81b2-11f0-b4dc-c81f66db3ddd unit=fst@jbod-mgmt-02.sdfarm.kr:1096 tid=00007f6de2753640 source=RainMetaLayout:1841 tident=monalisa.512032:193@pcalimonitor.cern.ch sec=unix uid=0 gid=0 name=monalisa geo=“” xt=“” ob=“” msg=“failed remote file close” stripe_id=1
….

source=RainMetaLayout:1860 tident=monalisa.512032:193@pcalimonitor.cern.ch sec=unix uid=0 gid=0 name=monalisa geo=“” xt=“” ob=“” msg=“error setting stripe checksum”
250825 12:58:31 time=1756126711.690312 func=Close level=ERROR logid=67cfd6a8-81b2-11f0-b4dc-c81f66db3ddd unit=fst@jbod-mgmt-02.sdfarm.kr:1096 tid=00007f6de2753640 source=RainMetaLayout:1867 tident=monalisa.512032:193@pcalimonitor.cern.ch sec=unix uid=0 gid=0 name=monalisa geo=“” xt=“” ob=“” msg=“failed to close local file”

250825 12:58:31 time=1756126711.690436 func=_close_wr level=ERROR logid=67cfd6a8-81b2-11f0-b4dc-c81f66db3ddd unit=fst@jbod-mgmt-02.sdfarm.kr:1096 tid=00007f6de2753640 source=XrdFstOfsFile:1927 tident=monalisa.512032:193@pcalimonitor.cern.ch sec= uid=10367 gid=1395 name=nobody geo=“” xt=“” ob=“” msg=“delete on close” fxid=06202535 ns_path=“/02/12000/0b014523-81b2-11f0-bd8e-00620ba19d1e”

250825 12:58:31 time=1756126711.690611 func=Remove level=ERROR logid=67cfd6a8-81b2-11f0-b4dc-c81f66db3ddd unit=fst@jbod-mgmt-02.sdfarm.kr:1096 tid=00007f6de2753640 source=RainMetaLayout:1556 tident=monalisa.512032:193@pcalimonitor.cern.ch sec=unix uid=0 gid=0 name=monalisa geo=“” xt=“” ob=“” msg=“failed to remove remote stripe” stripe_id=1
…

250825 12:58:31 time=1756126711.697138 func=_close_wr level=CRIT logid=67cfd6a8-81b2-11f0-b4dc-c81f66db3ddd unit=fst@jbod-mgmt-02.sdfarm.kr:1096 tid=00007f6de2753640 source=XrdFstOfsFile:2031 tident=monalisa.512032:193@pcalimonitor.cern.ch sec= uid=10367 gid=1395 name=nobody geo=“” xt=“” ob=“” info=“deleting on close” fn=/02/12000/0b014523-81b2-11f0-bd8e-00620ba19d1e fstpath=/jbod/box_04_disk_052/00002824/06202535 reason=“failed layout close”

esindril · August 25, 2025, 3:06pm

Hi Geonmo,

This case is different from the one before (capability illegal timeout expiration). In this case, the transfer fails because closing one of the stripes of the RAIN layout failed.

But, looking at the error message I think you are using probably eos version 5.3.15 or 16 which both introduced the stripe level checksums. This has introduced a performance penalty that would explain the timeout errors during the close operations. Therefore, my recommendation is to either downgrade to 5.3.14 or more to a release >= 5.3.17 where functionality has been disabled until we properly address the performance issues.

Cheers,

Elvin

GeonmoRyu · August 26, 2025, 3:58am

Hello, Evlin.

We are currently using v5.3.14.

It was working fine, but around the weekend, an “ADD” error appeared, so we are investigating it.

As you mentioned, there was no failure handling among the disks where the stripe operation was executed, but we suspect that there may be disks that are not functioning normally, so we are checking them.

If you have any additional advice, please feel free to share it with me.

Regards,

– Geonmo

esindril · August 26, 2025, 6:30am

Hi Geonmo,

I made a mistake in my previous post the versions to avoid are anything in between >= 5.3.12 and < 5.3.16. Upgrading to >= 5.3.16 would at least eliminate one known issue with RAIN files performance.

Cheers,

Elvin

CERN Accelerating science

Too many warning log about "outside of collection time window"