MGM no online/reachable

Hello,

Occasionaly, we can’t issue eos client commands on the MGM node getting the following error message

error: MGM root://antares-eos01.scd.rl.ac.uk not online/reachable

Not sure if this poses a problem for the FSTs (if the rely on the MGM being up) but the FST GC service that we have set up to run with CTA cannot run, e.g.

2021/11/17 09:55:32.723000 antares-eos02.scd.rl.ac.uk ERROR cta-fst-gcd:LVL=“ERROR” PID=“70182” TID=“70182” MSG=“fsls: Failed to execute eos -r 0 0 root://antares-eos01.scd.rl.ac.uk fs ls -m: returncode=64 returncodestr=‘Machine is not on the network’ stderr=‘error: MGM root://antares-eos01.scd.rl.ac.uk not online/reachable’”

Restarting eos@mgm would temporarily fix this, but it is troubling that is happening. Do you have any ideas how to find our what is happening and how to resolve it on a permanent basis?

Thanks,

George

Hi George,

This usually points to some networking problem with your instance. What the client command does it to first try to ping the MGM and if it gets not reply within 5 seconds then it declares the MGM not online.

So unless your MGM is not crashing regularly and getting restarted by systemd, which you can easily check if you connect to the MGM machine, this means some problem with the networking.

Cheers,
Elvin

Hi Elvin,

Thanks for this. It is not very likely that this is happening because of a network problem. We often see that the MGM becomes gradually slower in the response to EOS client commands (which succeed eventually) until the point when it becomes unresponsive as mentioned above.

Is there any error in the MGM log indicating that the MGM somehow “hangs”?

For info, we are running 4.8.45-1.

George

Hi George,

It’s highly unlikely that the MGM can not deal with the number of requests. Unfortunately, given the available amount of information it’s very hard to say exactly what is the problem. Doing eos ns stat should already give you a pretty good view of the requests that it’s handling and the average response time per request. Hope this helps.

Cheers,
Elvin

Hi George,
the error you get is an xrootd ping message, this has nothing to do with the internal plug-in on the MGM, unless all xrootd threads are locked up with something. If you update to a newer version it displays you how many threads are currently used by whom in the end of the ‘eos ns stat’ output.

┌────────┬───────┬────────┬───────┬──────┬─────────┬────────────────┐
│     uid│threads│sessions│  limit│stalls│stalltime│          status│
└────────┴───────┴────────┴───────┴──────┴─────────┴────────────────┘
        0       1        4 65.54 K      0         1          user-OK 

Hi Elvin, Andreas

Thanks for the replies. In case it helps, I paste the requests with the highest (judged roughly by eye) number of sum of instances as I see them in the output"eos ns stat"

all AttrLs 187.18 K 16.50 8.81 3.81 14.18 0.11 ± 0.03
all Eosxd::int::MonitorCaps 13.04 K 0.75 0.98 1.00 1.00 0.02 ± 0.00
all Exists 192.02 K 6.50 7.59 4.19 14.58 0.29 ± 0.04
all HashGet 10.83 M 921.50 991.27 834.03 828.36 -NA- -NA-
all HashSet 6.29 M 519.00 516.07 488.33 483.00 -NA- -NA-
all IdMap 8.16 K 15.50 2.00 0.57 0.59 0.07 ± 0.04
all NsLockR 761.52 K 46.00 32.53 16.24 57.77 -NA- -NA-
all NsLockW 9.17 K 0.50 0.68 0.70 0.70 -NA- -NA-
all Open 3.39 K 10.00 0.68 0.13 0.23 0.91 ± 0.27
all OpenDir 5 0.00 0.00 0.00 0.00 7.17 ± 13.94
all OpenRead 3.39 K 10.00 0.68 0.13 0.23 -NA- -NA-
all QueryPrepare 180.73 K 6.50 6.12 3.25 13.72 6.55 ± 4.70
all Stat 182.48 K 11.50 7.12 3.45 13.84 0.13 ± 0.02
all ViewLockR 359.76 K 38.00 30.05 27.72 27.57 -NA- -NA-

George

Yeah, that looks all ‘peanuts’ … the last column shows you the average execution time … maybe you can post the full listing here to see if there is something sticking out.

Thanks, here is the full listing

│who│command │ sum│ 5s│ 1min│ 5min│ 1h│exec(ms)│sigma(ms)│
└───┴────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┴─────────┘
all Access 2.78 K 0.00 0.00 0.00 0.13 -NA- -NA-
all AdjustReplica 0 0.00 0.00 0.00 0.00 -NA- -NA-
all AttrGet 0 0.00 0.00 0.00 0.00 -NA- -NA-
all AttrLs 305.79 K 0.00 2.03 24.80 14.21 0.12 ± 0.01
all AttrRm 0 0.00 0.00 0.00 0.00 -NA- -NA-
all AttrSet 5 0.00 0.00 0.00 0.00 0.37 ± 0.04
all Cd 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Checksum 7 0.00 0.00 0.00 0.00 -NA- -NA-
all Chmod 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Chown 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Commit 2 0.00 0.00 0.00 0.00 0.68 ± 0.03
all CommitFailedFid 0 0.00 0.00 0.00 0.00 -NA- -NA-
all CommitFailedNamespace 0 0.00 0.00 0.00 0.00 -NA- -NA-
all CommitFailedParameters 0 0.00 0.00 0.00 0.00 -NA- -NA-
all CommitFailedUnlinked 0 0.00 0.00 0.00 0.00 -NA- -NA-
all ConversionDone 0 0.00 0.00 0.00 0.00 -NA- -NA-
all ConversionFailed 0 0.00 0.00 0.00 0.00 -NA- -NA-
all CopyStripe 0 0.00 0.00 0.00 0.00 -NA- -NA-
all DrainCentralFailed 0 0.00 0.00 0.00 0.00 -NA- -NA-
all DrainCentralStarted 0 0.00 0.00 0.00 0.00 -NA- -NA-
all DrainCentralSuccessful 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Drop 6 0.00 0.00 0.00 0.00 0.47 ± 0.10
all DropAllStripes 6 0.00 0.00 0.00 0.00 1.03 ± 1.27
all DropStripe 0 0.00 0.00 0.00 0.00 -NA- -NA-
all DumpMd 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::ext::BEGINFLUSH 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::ext::CREATE 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::ext::CREATELNK 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::ext::DELETE 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::ext::DELETELNK 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::ext::ENDFLUSH 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::ext::GET 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::ext::GETCAP 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::ext::GETLK 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::ext::LS 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::ext::LS-Entry 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::ext::MKDIR 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::ext::MV 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::ext::RENAME 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::ext::RMDIR 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::ext::SET 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::ext::SETLK 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::ext::SETLKW 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::ext::SETLNK 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::ext::UPDATE 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::int::AuthRevocation 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::int::BcConfig 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::int::BcDeletion 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::int::BcDeletionExt 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::int::BcDropAll 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::int::BcMD 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::int::BcRefresh 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::int::BcRefreshExt 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::int::BcRelease 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::int::BcReleaseExt 5 0.00 0.00 0.00 0.00 0.01 ± 0.00
all Eosxd::int::DeleteEntry 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::int::FillContainerCAP 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::int::FillContainerMD 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::int::FillFileMD 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::int::Heartbeat 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::int::MonitorCaps 21.28 K 1.00 1.00 1.00 1.00 0.02 ± 0.00
all Eosxd::int::RefreshEntry 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::int::ReleaseCap 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::int::SendCAP 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::int::SendMD 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::int::Store 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::int::ValidatePERM 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::prot::LS 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::prot::SET 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::prot::STAT 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::prot::evicted 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::prot::mount 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::prot::offline 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Eosxd::prot::umount 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Event 1.39 K 0.00 0.00 0.00 0.07 0.46 ± 0.06
all Exists 313.70 K 0.00 2.81 25.47 14.58 0.26 ± 0.07
all FileInfo 11 0.00 0.00 0.00 0.00 -NA- -NA-
all Find 0 0.00 0.00 0.00 0.00 -NA- -NA-
all FindEntries 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Fuse 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Fuse-Access 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Fuse-Checksum 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Fuse-Chmod 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Fuse-Chown 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Fuse-Mkdir 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Fuse-Stat 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Fuse-Statvfs 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Fuse-Utimes 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Fuse-XAttr 0 0.00 0.00 0.00 0.00 -NA- -NA-
all GetFusex 0 0.00 0.00 0.00 0.00 -NA- -NA-
all GetMd 2 0.00 0.00 0.00 0.00 -NA- -NA-
all GetMdLocation 0 0.00 0.00 0.00 0.00 -NA- -NA-
all HashGet 17.67 M 847.75 700.02 830.54 830.42 -NA- -NA-
all HashSet 10.28 M 692.00 469.15 486.02 484.34 -NA- -NA-
all HashSetNoLock 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Http-COPY 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Http-DELETE 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Http-GET 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Http-HEAD 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Http-LOCK 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Http-MKCOL 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Http-MOVE 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Http-OPTIONS 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Http-POST 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Http-PROPFIND 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Http-PROPPATCH 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Http-PUT 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Http-TRACE 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Http-UNLOCK 0 0.00 0.00 0.00 0.00 -NA- -NA-
all IdMap 13.34 K 0.25 0.20 0.34 0.63 0.09 ± 0.02
all LRUFind 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Ls 7 0.00 0.00 0.00 0.00 -NA- -NA-
all MarkClean 0 0.00 0.00 0.00 0.00 -NA- -NA-
all MarkDirty 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Mkdir 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Motd 0 0.00 0.00 0.00 0.00 -NA- -NA-
all MoveStripe 0 0.00 0.00 0.00 0.00 -NA- -NA-
all NsLockR 1.24 M 0.00 10.07 100.89 57.81 -NA- -NA-
all NsLockW 14.94 K 0.50 0.68 0.70 0.70 -NA- -NA-
all Open 5.58 K 0.00 0.00 0.09 0.26 0.89 ± 0.28
all OpenDir 8 0.00 0.00 0.00 0.00 16.17 ± 30.57
all OpenDir-Entry 1.03 K 0.00 0.00 0.00 0.00 -NA- -NA-
all OpenFailedCreate 0 0.00 0.00 0.00 0.00 -NA- -NA-
all OpenFailedENOENT 0 0.00 0.00 0.00 0.00 -NA- -NA-
all OpenFailedExists 0 0.00 0.00 0.00 0.00 -NA- -NA-
all OpenFailedNoUpdate 0 0.00 0.00 0.00 0.00 -NA- -NA-
all OpenFailedPermission 0 0.00 0.00 0.00 0.00 -NA- -NA-
all OpenFailedQuota 0 0.00 0.00 0.00 0.00 -NA- -NA-
all OpenFailedReconstruct 0 0.00 0.00 0.00 0.00 -NA- -NA-
all OpenFileOffline 0 0.00 0.00 0.00 0.00 -NA- -NA-
all OpenProc 879 0.25 0.03 0.05 0.04 -NA- -NA-
all OpenRead 5.58 K 0.00 0.00 0.09 0.26 -NA- -NA-
all OpenShared 0 0.00 0.00 0.00 0.00 -NA- -NA-
all OpenStalled 0 0.00 0.00 0.00 0.00 -NA- -NA-
all OpenWrite 2 0.00 0.00 0.00 0.00 -NA- -NA-
all OpenWriteCreate 0 0.00 0.00 0.00 0.00 -NA- -NA-
all OpenWriteTruncate 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Prepare 1.39 K 0.00 0.00 0.00 0.07 14.54 ± 9.06
all Proto::Close 2 0.00 0.00 0.00 0.00 1.40 ± 1.20
all Proto::EvictPrepare 6 0.00 0.00 0.00 0.00 3.29 ± 3.26
all Proto::Prepare 1.38 K 0.00 0.00 0.00 0.07 0.18 ± 0.02
all Proto::Prepare::Abort 1 0.00 0.00 0.00 0.00 11.77 -NA-
all Proto::Send::sync::abort_prepare 1 0.00 0.00 0.00 0.00 8.82 -NA-
all Proto::Send::sync::prepare 4 0.00 0.00 0.00 0.00 21.72 ± 4.03
all QueryPrepare 295.32 K 0.00 2.00 24.67 13.71 11.92 ± 16.83
all QueryResync 0 0.00 0.00 0.00 0.00 -NA- -NA-
all QuotaLockR 6 0.00 0.00 0.01 0.00 -NA- -NA-
all QuotaLockW 2 0.00 0.00 0.00 0.00 -NA- -NA-
all ReadLink 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Recycle 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Redirect 0 0.00 0.00 0.00 0.00 -NA- -NA-
all RedirectENOENT 0 0.00 0.00 0.00 0.00 -NA- -NA-
all RedirectENONET 0 0.00 0.00 0.00 0.00 -NA- -NA-
all RedirectR 0 0.00 0.00 0.00 0.00 -NA- -NA-
all RedirectR-Master 0 0.00 0.00 0.00 0.00 -NA- -NA-
all RedirectW 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Rename 0 0.00 0.00 0.00 0.00 -NA- -NA-
all ReplicaFailedChecksum 0 0.00 0.00 0.00 0.00 -NA- -NA-
all ReplicaFailedSize 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Rm 0 0.00 0.00 0.00 0.00 -NA- -NA-
all RmDir 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Schedule2Balance 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Schedule2Delete 3.85 K 0.00 0.19 0.18 0.18 -NA- -NA-
all Scheduled2Balance 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Scheduled2Delete 6 0.00 0.00 0.00 0.00 0.75 ± 0.04
all Scheduled2Drain 0 0.00 0.00 0.00 0.00 -NA- -NA-
all SchedulingFailedBalance 0 0.00 0.00 0.00 0.00 -NA- -NA-
all SchedulingFailedDrain 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Stall 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Stat 298.15 K 0.00 2.00 24.69 13.85 0.12 ± 0.01
all Symlink 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Touch 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Truncate 0 0.00 0.00 0.00 0.00 -NA- -NA-
all TxState 0 0.00 0.00 0.00 0.00 -NA- -NA-
all VerifyStripe 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Version 0 0.00 0.00 0.00 0.00 -NA- -NA-
all Versioning 0 0.00 0.00 0.00 0.00 -NA- -NA-
all ViewLockR 587.29 K 32.75 26.86 27.39 27.63 -NA- -NA-
all ViewLockW 112 0.00 0.00 0.00 0.00 -NA- -NA-
all WFEFind 2.12 K 0.00 0.10 0.10 0.10 2.27 ± 0.40
all WhoAmI 0 0.00 0.00 0.00 0.00 -NA- -NA-