MGM not online/reachable

georgep · September 5, 2024, 1:43pm

Hello,

We have set up a client machine to EOS (eos-client-5.2.23-1.el9.x86_64) and although all the following vars have been exported

export EOS_MGM_URL=root://antares-eos01.scd.rl.ac.uk
export XrdSecPROTOCOL=sss
export XrdSecSSSKT=/etc/eos.keytab

We get the error

[root@cms-rucio-services1 ~]# eos ns
error: MGM root://antares-eos01.scd.rl.ac.uk not online/reachable
[root@cms-rucio-services1 ~]#

ping/traceroute/tracepath/ssh work both ways

I dont think we have missed anything but can you please confirm?

Thanks,

George

georgep · September 5, 2024, 2:04pm

Strangely, pointing to the MGM of another EOS cluster works…

[root@cms-rucio-services1 ~]# EOS_MGM_URL=root://antares-eos15.scd.rl.ac.uk eos node ls
┌──────────┬────────────────────────────────┬────────────────┬──────────┬────────────┬────────────────┬─────┐
│type │ hostport│ geotag│ status│ activated│ heartbeatdelta│ nofs│
└──────────┴────────────────────────────────┴────────────────┴──────────┴────────────┴────────────────┴─────┘
nodesview antares-eos15.scd.rl.ac.uk:1095 undef online on 2 15
nodesview antares-eos16.scd.rl.ac.uk:1095 undef online on 1 15
nodesview antares-eos17.scd.rl.ac.uk:1095 undef online on 2 23
nodesview antares-eos18.scd.rl.ac.uk:1095 undef online on 2 23
nodesview antares-eos19.scd.rl.ac.uk:1095 undef online on 3 23
nodesview antares-eos20.scd.rl.ac.uk:1095 undef online on 2 23
nodesview antares-eos21.scd.rl.ac.uk:1095 undef online on 3 23
nodesview antares-eos22.scd.rl.ac.uk:1095 undef online on 1 23
nodesview antares-eos23.scd.rl.ac.uk:1095 undef online on 2 23

[root@cms-rucio-services1 ~]#

rptaylor · September 5, 2024, 10:06pm

Running a client command with the XRD_LOGLEVEL=Dump env var can be useful.

You could also try e.g. nc -vz antares-eos01.scd.rl.ac.uk 1094 to check the network connection.

georgep · September 6, 2024, 10:35am

Thanks for these suggestions.

nc shows successfull connection to the MGM.

Running an eos client command having set XRD_LOGLEVEL=Dump
does show some fatal auth errors which I really cant understand as the /etc/eos.keytab is in place.

[2024-09-06 11:30:33.005701 +0100][Error ][AsyncSock ] [antares-eos01.scd.rl.ac.uk:1094.0] Socket error while handshaking: [FATAL] Auth failed
[2024-09-06 11:30:33.005801 +0100][Error ][PostMaster ] [antares-eos01.scd.rl.ac.uk:1094] Unable to recover: [FATAL] Auth failed.
[2024-09-06 11:30:33.005824 +0100][Debug ][XRootD ] [antares-eos01.scd.rl.ac.uk:1094] Handling error while processing kXR_ping (): [FATAL] Auth failed.
[2024-09-06 11:30:33.005898 +0100][Debug ][ExDbgMsg ] [antares-eos01.scd.rl.ac.uk:1094] Calling MsgHandler: 0xc778e0 (message: kXR_ping () ) with status: [FATAL] Auth failed.

George

georgep · September 9, 2024, 3:17pm

Sorry for the hassle.

Any thoughts on this issue?

esindril · September 10, 2024, 6:37am

Hi @georgep ,

As Ryan already pointed out, the full output of the eos whoami command with XRD_LOGLEVEL=Dump is very useful in this situation. By the looks of it (though not all relevant info is present in your snippet) the sss keytab that is on the client does not match the server or client sss key is not in the list of sss keys accepted by the server. A quick checksum of the concerned sss keytab files should clear the mystery.

Cheers,
Elvin

georgep · September 10, 2024, 8:14am

Hi Elvin,

I had forgotten to update the “sec.protbind” line in the /etc/xrd.cf.mgm with the client’s new hostname. Because our first (and overriding) binding is “sec.protbind * only gsi unix” the auth failed.

Apologies for this!.

George

CERN Accelerating science

MGM not online/reachable