Important notes for the eos-5.2.0 release

esindril · September 29, 2023, 9:41am

The EOS 5.2.0 release will contain a series of changes that have long been advertised, therefore please make sure to carefully review the following points before deciding to update.

We no longer support CentOS Stream. Server and client packages are available for CentOS7(el-7), Alma8(el-8), Alma9(el-9), and opportunistically some Fedora releases.
There’s a strong dependency on XRootD 5.6.2 / eos-xrootd 5.6.2, providing critical client fixes and ZTN support over XRoot protocol.
EOS now requires eos-grpc-1.56.1, replacing eos-grpc-1.41.0. Version <= 5.1.26 should lock to eos-grpc-1.41.0.
eos-grpc-1.56.1 includes grpc, protobuf, and abseil, making eos-protobuf obsolete.
Support for libmicrohttpd is deprecated and will be removed; XrdHttp is the alternative.
libmicrohttpd is no longer started by default, but you can enable it using env variables.
eos-nginx service is deprecated, with no new updates or releases.
Migrate FMD from LevelDB to extended attributes before upgrading to eos-5.2.0; conversion is NOT possible in 5.2.0.
LevelDB dependency and internal SQLite implementation have been dropped.
The eos find command has been redesigned for better performance and memory usage, equivalent to the old eos newfind.

Feel free to ask questions or share concerns in the comments.

Thanks,
Elvin
on behalf of the EOS team

ebirngru · October 4, 2023, 11:25am

We’re very much looking forward to this release, some of you users are eagerly awaiting some of the fixes. Do you have a rough timeline for the release?
Best, Erich

esindril · October 4, 2023, 11:56am

Hi Erich,

Glad to hear you are willing to test it! There are still some things to address for the latest release since it contains many new things so bear with me. Rather than having something quickly out the door, we’re testing things out to make sure you won’t need a new release (immediately) after upgrading to this one. Though this will pretty likely be the case …

Cheers,
Elvin

esindril · October 4, 2023, 12:03pm

ETA … in the next days!

sahn · February 16, 2024, 7:31am

Dear @esindril ,

I would like to just ensure before upgrading to 5.2.x for FSTs whether they are using LevelDB. FSTs are running now on 5.1.22 and they have /var/eos/md/fmd.00XX.LevelDB directories, which cause quite a lot confusion to me.

The output of eos-leveldb-inspect against one of them looks like,

# eos-leveldb-inspect --dbpath /var/eos/md/fmd.0005.LevelDB --fsck
Num. entries in DB[mem_n]:                     3397579
Num. files synced from disk[d_sync_n]:         3397162
Num, files synced from MGM[m_sync_n]:          3397359
Disk/referece size missmatch[d_mem_sz_diff]:   64
MGM/reference size missmatch[m_mem_sz_diff]:   754
Disk/reference checksum missmatch[d_cx_diff]:  0
MGM/reference checksum missmatch[m_cx_diff]:   0
Num. of orphans[orphans_n]:                    1080
Num. of unregistered replicas[unreg_n]:        1
Files with num. replica missmatch[rep_diff_n]: 1
Files missing on disk[rep_missing_n]:          0

Is this indicating that the FSTs are running on LevelDB? I found that there is eos-fmd-tool which converts from LevelDB but up to now I could not see any tool for such that converting FMD to LevelDB.

Would you kindly point to where I could reference?

Thank you.

Best regards,
Sang-Un

sahn · February 16, 2024, 7:36am

Ah, I completely misunderstood what is written in the banner… it meant to convert LevelDB to FMD, is that right?

Now it make sense the existence of eos-fmd-tool.

Sorry for my ignorance and noise.

sahn · February 16, 2024, 9:25am

I have got following output when trying to run eos-fmd-tool:

# eos-fmd-tool --log-level debug convert --fst-path /data/disk9/
240216 18:21:10 time=1708075270.896798 func=main                     level=INFO  logid=static.............................. unit=EOSFileMD tid=00007f2d7bddbac0 source=ConvertFileMD:104              tident= sec=(null) uid=99 gid=99 name=- geo="" msg="got FSID from .eosfsid" fsid=9
240216 18:21:10 time=1708075270.896962 func=SetDBFile                level=INFO  logid=CommonFmdDbMapHandler unit=EOSFileMD tid=00007f2d7bddbac0 source=FmdDbMap:80                    tident=<service> sec=      uid=0 gid=0 name= geo="" LevelDB DB is now /var/eos/md/fmd.0009.LevelDB
Aborted

I was just wondering whether the command and options are correct and it worked properly.

EDIT

First of all, very much sorry for making noises here. I managed to run eos-fmd-tool to convert LevelDB to FMD and then update to EOS v5.2.16 successfully. Just for the record I would like to leave the steps followed as below:

Stop EOS FST service
Run eos-fmd-tool command e.g. eos-fmd-tool --log-level debug convert --fst-path /data/disk9/ --log-file fmd-convert-disk9.log (this does not print any output, instead recommend to use tee, such that eos-fmd-tool --log-level debug convert --fst-path /data/disk9/ 2>&1| tee fmd-convert-disk9.log
Wait for until the work finishes. Approximately 25~30 minutes taken for 100TB of files
Update EOS packages
Start EOS FST service (carefully look into logs to check whether there is any messages such as conversion failed)

That’s it!

Thank you.

Best regards,
Sang-Un

franck-jrc · August 1, 2024, 2:23pm

We are currently planning the upgrade from version 5.1.22 to version 5.2.x.

We already have converted the fmd to extended attributes.

I have 2 questions for this :

Can MGM and FSTs work with mixed version 5.1.x and 5.2.x ? If yes, is there a preferred way to upgrade one or the other first ? If we can first upgrade the MGM, then roll upgrade the FST, this will help us minimising the maintenance period
Will the centos7 package remain available from the current repository ? We are not yet ready for migrating the EOS hosts to Alma9, although we hope to do it before the end of the year

esindril · August 1, 2024, 2:49pm

Hi Franck,

Yes, the MGM can work with mixed versions on the FSTs. You can start with the upgrade of the MGM and then the FSTs one by one. One thing you can do is to disable FSCK repair and collection until the transition is done - it’s not mandatory but it can avoid doing some useless checks.

Concerning the repositories, these will stay the same - nothing will change, although with the 5.3.0 release there will probably be no more CentOS7 packages for EOS.

Cheers,
Elvin

franck-jrc · September 6, 2024, 9:47am

Dear @esindril,

Maybe you can help. After the upgrade to version 5.2.24 2 days ago, we have an issue this morning with the MGM, and it seems linked to the kerberos authentication. Access go timeout very easily, like this :

franck@s-xxx57v$ XrdSecDEBUG=1 eos whoami
sec_Client: protocol request for host eos-jeodpp.cidsn.jrc.it token='&P=krb5,eosstorage@CIDSN.JRC.IT&P=sss,0.+13:/etc/eos.
client.keytab&P=unixoninfo'
sec_PM: Loaded krb5 protocol object from libXrdSeckrb5.so
sec_PM: Using krb5 protocol, args='eosstorage@CIDSN.JRC.IT'
Seckrb5: getCredentials
Seckrb5: context lock
Seckrb5: context locked
Seckrb5: /tmp/krb5cc_61928
Seckrb5: init context
Seckrb5: cc set default name
Seckrb5: cc default
Seckrb5: Returned 627 bytes of creds; p=eosstorage@CIDSN.JRC.IT
error: MGM root://eos-jeodpp.cidsn.jrc.it not online/reachable

Also fusex client are involved, many clients that access using kerberos get lokced, and the fusex client log is full of

240906 11:29:18 t=1725614958.619618 f=fetchResponse    l=ERROR tid=00007fe5c83e6700 s=backend:338              error=status is NOT ok : [FATAL] Socket timeout 103 0

It seems that also for the fusex client, it happens at the time of the first connection and ticket exchange.

I was wondering if there could be some limitation like the number of open files linked to kerberos authentication. The MGM process has 14000 open files (limit seems 65000 so below) of which 6200 are of file like /var/tmp/krb5_RCK4Z6FG (deleted), and some of them to file /var/tmp/eosstorage_2 (eosstorage@CIDSN.JRC.IT is the name of the kerberos principal used to get the ticket)

Or, maybe there is some other problem with external resource (kerberos or network, but both seem fine) that could explain this behaviour ?

franck-jrc · September 6, 2024, 10:22am

I noticed that the number of connection open XrootdXeq: AAAAACSQ.2182:10837@xxx-015 pub IPv4 login as xxxx) towards the MGM by some fuse clients have increased a lot, like 5 per seconds. They were due to some processing jobs that were running very few seconds on dozens of servers, and eahc one of them caused a new connection It seems to be the cause of the malfunction, because when we stop them authentication works correctly on other clients. it could be linked or not to the upgrade, but the user tells me that he has launched such jobs in the past, and we didn’t observe such a problem.

Could it be that the version upgrade either introduced a congestion in Kerberos authentication, or forces the fusex clients to open a new connection at each process start, which would cause the increased number of authentications ?

franck-jrc · September 9, 2024, 8:27am

After some extended analysis over the weeks before the upgrade, it appears that the same situation has been observed: a high kerberos authentication rate, and some general latency for the kerberos accesses; it has just been happening during the night, or with a shorter disturbance period.

So our issue is not linked to this new version.

So in general, is there a way to control the number of kerberos authentications, or increase the capacity to treat them ?

esindril · September 10, 2024, 6:32am

Hi @franck-jrc ,

Sorry for the late reply I was off the previous days. There were no modification to the Kerberos authentication in the 5.2.* release. Also we haven’t seen any issues with Krb5 in our instances, either before or after the upgrade. Is the MGM running on Alma9 or still CentOS7? Can you paste the krb5 configuration that you are using on the MGM?

Thanks,
Elvin

apeters · September 10, 2024, 9:30am

Just to make sure, you still have this in /etc/sysconfig/eos_env:

KRB5RCACHETYPE=“none”

franck-jrc · September 10, 2024, 1:56pm

Thank you @apeters for your answer.

This line is currently commented in our configuration file #KRB5RCACHETYPE=none. Should it be uncommented ?

If yes, it might be the reason for our issue, if not then we need to understand why we have such congestion (in that case I would open a different ticket). We managed to make the processing jobs more gentle to avoid the previous lock, but there is still some latency from time to time while it runs.

To answer Elvin’s question, we are still under CentOS7. For the kerberos configuration you mean krb5.conf, or the configuration part in xrd.cf.mgm ?

apeters · September 10, 2024, 2:25pm

You have to put this, otherwise you get extremely low authentication rates due to the replay cache congestion !

That should fix your problem!

esindril · September 10, 2024, 2:33pm

Hi Franck,

I was looking for something that enabled the krb5 caching as already mentioned by Andreas. The fact that you had these files /var/tmp/krb5_RCK4Z6FG in the MGM was suspicious.
Putting this env back will surely fix your issue.

Cheers,
Elvin

franck-jrc · September 10, 2024, 2:39pm

Thank you very much @peters and @esindril. We had already changed this option to none long time ago in the past; indeed this strange behaviour reminded me of some, but I couldn’t remember what.

It seems that the revert to the current incorrect value has occurred by mistake one year and a half ago when we re-factored the configuration file to make it more similar to the package one, and this line was forgotten. Fortunately, or not, we had no issue regarding this until these days… So sorry for having brought up an old issue, but without your expert look, we wouldn’t have found the cause so easily.

franck-jrc · September 11, 2024, 9:24am

One thing that as probably changed in this version is the balancing system functioning.

Although it seems that balancing os acting, we do not see anymore the balancing state in the output of eos group ls. All groups are idle, although the groups that are above the threshold have a non zero value in the bal-shd column.

Example with a threshold set to 10 :

[root@ ~]# eos group ls
┌──────────┬────────────────┬────────────┬──────┬────────────┬────────────┬────────────┬──────────┬──────────┐
│type      │            name│      status│ N(fs)│ dev(filled)│ avg(filled)│ sig(filled)│ balancing│   bal-shd│
└──────────┴────────────────┴────────────┴──────┴────────────┴────────────┴────────────┴──────────┴──────────┘
 groupview         default.0           on      3        11.84        53.58         9.49       idle          2
 groupview         default.1           on      3        10.79        51.58         8.24       idle          3
 groupview         default.2           on      3         4.67        53.62         3.77       idle          0

Another observation is that in eos ns stat, the status is given by BalanceStarted, BalanceSuccessful instead of Schedule2Balance, etc…

Are there also some settings that might have changed and need to be adapted to this new version, to keep the same rate as before ? We currently have something like :

balancer                         := on
balancer.node.ntx                := 25
balancer.node.rate               := 25
balancer.threshold               := 10

CERN Accelerating science

Important notes for the eos-5.2.0 release

EDIT