CERN Accelerating science

FST does not connect to MGM (msg="failed to contact broker")

Hello,
I am stuck with a MGM+MQ and FST two-node setup, similar to one described at the EOS admin configuration — EOS CITRINE documentation page. The MGM+MQ server seems to run, but the FST server fails to connect to the MGM+MQ server.
In the following logs, grid05 is MGM+MQ and nfs10 is FST.

Each node can connect to the buddy’s 1094, 1095 or 1097 port when EOS is up:

[grid05 ~]$ nc -z nfs10.aligrid.hiroshima-u.ac.jp 1095; echo $?
0

[nfs10 ~]$ nc -z grid05.aligrid.hiroshima-u.ac.jp 1094; echo $?
0
[nfs10 ~]$ nc -z grid05.aligrid.hiroshima-u.ac.jp 1097; echo $?
0

The content of /etc/sysconfig/eos_dev is based on the example in the configuration page:

DAEMON_COREFILE_LIMIT=unlimited
XRD_ROLES="fst"
LD_PRELOAD=/usr/lib64/libjemalloc.so.1
EOS_BROKER_URL=root://grid05.aligrid.hiroshima-u.ac.jp:1097//eos/

# Not mentioned in https://eos-docs.web.cern.ch/quickstart/admin/configure.html#setup-fst
# At least EOS_MGM_ALIAS and EOS_GEOTAG seem mandatory 
EOS_INSTANCE_NAME=eosalice
EOS_GEOTAG="::EOS"
EOS_MGM_ALIAS=grid05.aligrid.hiroshima-u.ac.jp
EOS_MAIL_CC="***@***" # a mailing list address, actually
EOS_NOTIFY="mail -s `date +%s`-`hostname`-eos-notify $EOS_MAIL_CC"
EOS_NS_ACCOUNTING=1
EOS_SYNCTIME_ACCOUNTING=1
EOS_USE_SHARED_MUTEX=1
#EOS_FST_NO_SSS_ENFORCEMENT=1
EOS_HTTP_THREADPOOL="epoll"
EOS_HTTP_THREADPOOL_SIZE=16
EOS_HTTP_CONNECTION_MEMORY_LIMIT=4194304

The content of /etc/xrd.cf.fst is not changed from original.

The content of /var/log/eos/xrdlog.fst
210402 11:42:43 9531 Starting on Linux 3.10.0-1160.21.1.el7.x86_64
Copr.  2004-2012 Stanford University, xrd version v4.12.5
++++++ xrootd fst@nfs10.aligrid.hiroshima-u.ac.jp initialization started.
Config using configuration file /etc/xrd.cf.fst
=====> xrd.network keepalive
=====> xrd.port 1095
Config maximum number of connections restricted to 65000
Copr.  2012 Stanford University, xrootd protocol 4.0.0 version v4.12.5
++++++ xrootd protocol initialization started.
=====> xrootd.fslib -2 libXrdEosFst.so
=====> xrootd.async off nosf
=====> xrootd.redirect grid05.aligrid.hiroshima-u.ac.jp:1094 chksum
=====> xrootd.seclib libXrdSec.so
=====> all.export / nolock
Config exporting /
Plugin loaded
++++++ Authentication system initialization started.
Plugin loaded
=====> sec.protocol unix
Plugin loaded
=====> sec.protocol sss -c /etc/eos.keytab -s /etc/eos.keytab
=====> sec.protbind * only unix sss
Config 3 authentication directives processed in /etc/xrd.cf.fst
------ Authentication system initialization completed.
++++++ Protection system initialization started.
Config warning: Security level is set to none; request protection disabled!
Config Local  protection level: none
Config Remote protection level: none
------ Protection system initialization completed.
Config Routing for nfs10.aligrid.hiroshima-u.ac.jp: local pub4 prv4
Config Route all4: nfs10.aligrid.hiroshima-u.ac.jp Dest=[::133.41.115.112]:1095
Plugin loaded
++++++ (c) 2010 CERN/IT-DSS FstOfs (Object Storage File System) 4.8.31
++++++ File system initialization started.
=====> ofs.persist off
=====> ofs.osslib libEosFstOss.so
=====> ofs.tpc pgm /usr/bin/xrdcp
Plugin No such file or directory loading osslib libEosFstOss-4.so
Config Falling back to using libEosFstOss.so
Plugin loaded
210402 11:42:44 time=1617331364.036562 func=Configure                level=INFO  logid=19e127f0-935d-11eb-8081-90e2ba9a1550 unit=fstoss@localhost tid=00007f78bde91780 source=XrdFstOss:170                  tident= sec=      uid=0 gid=0 name= geo="" preread depth=0, queue_size=0 and bytes=0
Config effective /etc/xrd.cf.fst ofs configuration:
       all.role server
       ofs.maxdelay   60
       ofs.persist    off hold 600
       ofs.trace      0
       ofs.osslib libEosFstOss.so
------ File system server initialization completed.
=====> fstofs enforces SSS authentication for XROOT clients
=====> fstofs.autoboot : true
=====> fstofs.broker : root://grid05.aligrid.hiroshima-u.ac.jp:1097//eos/nfs10.aligrid.hiroshima-u.ac.jp:1095/fst
=====> eoscp-log : /var/log/eos/fst/eoscp.log
=====> fstofs.defaultreceiverqueue : /eos/*/mgm
=====> fstofs.authdir : /var/eos/auth/
210402 11:42:44 time=1617331364.073819 func=Storage                  level=INFO  logid=FstOfsStorage unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78bde91780 source=Storage:110                    tident= sec=      uid=0 gid=0 name= geo="" starting scrubbing thread
210402 11:42:44 time=1617331364.073946 func=Storage                  level=INFO  logid=FstOfsStorage unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78bde91780 source=Storage:121                    tident= sec=      uid=0 gid=0 name= geo="" starting trim thread
210402 11:42:44 time=1617331364.073998 func=Storage                  level=INFO  logid=FstOfsStorage unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78bde91780 source=Storage:131                    tident= sec=      uid=0 gid=0 name= geo="" starting deletion thread
210402 11:42:44 time=1617331364.074081 func=Storage                  level=INFO  logid=FstOfsStorage unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78bde91780 source=Storage:141                    tident= sec=      uid=0 gid=0 name= geo="" starting report thread
210402 11:42:44 time=1617331364.074962 func=Scrub                    level=INFO  logid=FstOfsStorage unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a3bff700 source=Scrub:39                       tident= sec=      uid=0 gid=0 name= geo="" msg="create scrubbing pattern ..."
210402 11:42:44 time=1617331364.075034 func=Storage                  level=INFO  logid=FstOfsStorage unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78bde91780 source=Storage:151                    tident= sec=      uid=0 gid=0 name= geo="" starting error report thread
210402 11:42:44 time=1617331364.075916 func=getFstNodeConfigQueue    level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a39fd700 source=Config:44                      tident= sec=(null) uid=99 gid=99 name=- geo="" msg="waiting for config queue in Remover ..."
210402 11:42:44 time=1617331364.076634 func=Scrub                    level=INFO  logid=FstOfsStorage unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a3bff700 source=Scrub:48                       tident= sec=      uid=0 gid=0 name= geo="" msg="start scrubbing"
210402 11:42:44 time=1617331364.076681 func=Storage                  level=INFO  logid=FstOfsStorage unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78bde91780 source=Storage:161                    tident= sec=      uid=0 gid=0 name= geo="" starting verification thread
210402 11:42:44 time=1617331364.077307 func=Storage                  level=INFO  logid=FstOfsStorage unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78bde91780 source=Storage:171                    tident= sec=      uid=0 gid=0 name= geo="" starting filesystem communication thread
210402 11:42:44 time=1617331364.077858 func=Storage                  level=INFO  logid=FstOfsStorage unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78bde91780 source=Storage:182                    tident= sec=      uid=0 gid=0 name= geo="" starting daemon supervisor thread
210402 11:42:44 time=1617331364.077925 func=Storage                  level=INFO  logid=FstOfsStorage unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78bde91780 source=Storage:192                    tident= sec=      uid=0 gid=0 name= geo="" starting filesystem publishing thread
210402 11:42:44 time=1617331364.078032 func=Storage                  level=INFO  logid=FstOfsStorage unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78bde91780 source=Storage:196                    tident= sec=      uid=0 gid=0 name= geo="" starting filesystem balancer thread
210402 11:42:44 time=1617331364.078674 func=Communicator             level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a1fff700 source=Communicator:400               tident= sec=(null) uid=99 gid=99 name=- geo="" msg="starting communicator thread"
210402 11:42:44 time=1617331364.079065 func=Supervisor               level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a2ffc700 source=Supervisor:39                  tident= sec=(null) uid=99 gid=99 name=- geo="" Supervisor activated ...
210402 11:42:44 time=1617331364.079228 func=Storage                  level=INFO  logid=FstOfsStorage unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78bde91780 source=Storage:207                    tident= sec=      uid=0 gid=0 name= geo="" starting mgm synchronization thread
210402 11:42:44 time=1617331364.079286 func=StartNotifyCurrentThread level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a1fff700 source=XrdMqSharedObject:1883         tident= sec=(null) uid=99 gid=99 name=- geo="" Starting notification
210402 11:42:44 time=1617331364.079926 func=Publish                  level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a17fe700 source=Publish:416                    tident= sec=(null) uid=99 gid=99 name=- geo="" msg="publisher activated"
210402 11:42:44 time=1617331364.080715 func=Balancer                 level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a2efb700 source=Balancer:290                   tident= sec=(null) uid=99 gid=99 name=- geo="" Start Balancer ...
210402 11:42:44 time=1617331364.080789 func=Storage                  level=INFO  logid=FstOfsStorage unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78bde91780 source=Storage:218                    tident= sec=      uid=0 gid=0 name= geo="" starting /var/ partition monitor thread ...
210402 11:42:44 time=1617331364.080871 func=getFstNodeConfigQueue    level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a2efb700 source=Config:44                      tident= sec=(null) uid=99 gid=99 name=- geo="" msg="waiting for config queue in Balancer ..."
210402 11:42:44 time=1617331364.081595 func=Storage                  level=INFO  logid=FstOfsStorage unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78bde91780 source=Storage:228                    tident= sec=      uid=0 gid=0 name= geo="" enabling net/io load monitor
210402 11:42:44 time=1617331364.081684 func=Storage                  level=INFO  logid=FstOfsStorage unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78bde91780 source=Storage:230                    tident= sec=      uid=0 gid=0 name= geo="" enabling local disk S.M.A.R.T attribute monitor
210402 11:42:44 time=1617331364.085755 func=Monitor                  level=INFO  logid=19e83fc2-935d-11eb-9a60-90e2ba9a1550 unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a0ffd700 source=MonitorVarPartition:81         tident= sec=      uid=0 gid=0 name= geo="" msg="fst partition monitor activated"
=====> fstofs.metalogdir : /var/eos/md/
210402 11:42:44 time=1617331364.087989 func=AddBroker                level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78bde91780 source=XrdMqClient:173                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="add broker" url="root://grid05.aligrid.hiroshima-u.ac.jp:1097//eos/nfs10.aligrid.hiroshima-u.ac.jp:1095/fst?xmqclient.advisory.status=0&xmqclient.advisory.query=0&xmqclient.advisory.flushbacklog=0"
210402 11:42:44 time=1617331364.088738 func=SomListener              level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f789e3ff700 source=XrdMqSharedObject:2029         tident= sec=(null) uid=99 gid=99 name=- geo="" mgm="starting SOM listener"
210402 11:42:44 time=1617331364.089671 func=Communicator             level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a1fff700 source=Communicator:452               tident= sec=(null) uid=99 gid=99 name=- geo="" msg="shared object notification" type=0 subject="/eos/nfs10.aligrid.hiroshima-u.ac.jp:1095/fst/gw/txqueue/txq"
210402 11:42:44 time=1617331364.107569 func=GetNetSpeed              level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a17fe700 source=Publish:97                     tident= sec=(null) uid=99 gid=99 name=- geo="" ethtool:networkspeed=10.00 GB/s
210402 11:42:44 time=1617331364.107819 func=Publish                  level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a17fe700 source=Publish:426                    tident= sec=(null) uid=99 gid=99 name=- geo="" msg="publish networkspeed=10.00 GB/s"
210402 11:42:44 time=1617331364.107845 func=getFstNodeConfigQueue    level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a17fe700 source=Config:44                      tident= sec=(null) uid=99 gid=99 name=- geo="" msg="waiting for config queue in Publish ..."
210402 11:42:44 time=1617331364.111174 func=Subscribe                level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78bde91780 source=XrdMqClient:621                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="failed to subscribe to broker" url="root://grid05.aligrid.hiroshima-u.ac.jp:1097//eos/nfs10.aligrid.hiroshima-u.ac.jp:1095/fst?xmqclient.advisory.status=0&xmqclient.advisory.query=0&xmqclient.advisory.flushbacklog=0"
###### mq messaging: starting thread
210402 11:42:44 time=1617331364.111370 func=RequestBroadcasts        level=NOTE  logid=19e0bce8-935d-11eb-8081-90e2ba9a1550 unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78bde91780 source=XrdFstOfs:1810                 tident= sec=      uid=0 gid=0 name= geo="" msg="requesting broadcasts"
210402 11:42:44 time=1617331364.111524 func=Communicator             level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a1fff700 source=Communicator:452               tident= sec=(null) uid=99 gid=99 name=- geo="" msg="shared object notification" type=0 subject="*/nfs10.aligrid.hiroshima-u.ac.jp:1095"
210402 11:42:44 time=1617331364.111581 func=Communicator             level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a1fff700 source=Communicator:476               tident= sec=(null) uid=99 gid=99 name=- geo="" msg="no action on subject creation" qpath="*/nfs10.aligrid.hiroshima-u.ac.jp:1095" own_id="/eos/nfs10.aligrid.hiroshima-u.ac.jp:1095/fst"
210402 11:42:44 time=1617331364.114830 func=Communicator             level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a1fff700 source=Communicator:452               tident= sec=(null) uid=99 gid=99 name=- geo="" msg="shared object notification" type=0 subject="*/nfs10.aligrid.hiroshima-u.ac.jp:1095/fst/gw/txqueue/txq"
210402 11:42:44 time=1617331364.114955 func=Communicator             level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a1fff700 source=Communicator:452               tident= sec=(null) uid=99 gid=99 name=- geo="" msg="shared object notification" type=0 subject="/eos/nfs10.aligrid.hiroshima-u.ac.jp:1095/fst/*"
210402 11:42:44 time=1617331364.121722 func=RefreshBrokersEndpoints  level=ERROR logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f789d3ff700 source=XrdMqClient:495                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="failed to contact broker" url="root://grid05.aligrid.hiroshima-u.ac.jp:1097//eos/nfs10.aligrid.hiroshima-u.ac.jp:1095/fst_mq_test?xmqclient.advisory.flushbacklog=0&xmqclient.advisory.query=0&xmqclient.advisory.status=0"
210402 11:42:44 time=1617331364.124077 func=Configure                level=NOTE  logid=19e0bce8-935d-11eb-8081-90e2ba9a1550 unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78bde91780 source=XrdFstOfs:870                  tident= sec=      uid=0 gid=0 name= geo="" FST_HOST=nfs10.aligrid.hiroshima-u.ac.jp FST_PORT=1095 FST_HTTP_PORT=8001 VERSION=4.8.31 RELEASE=1 KEYTABADLER=ba732327
Config warning: asynchronous I/O has been disabled!
Config warning: sendfile I/O has been disabled!
Config warning: 'xrootd.prepare logdir' not specified; prepare tracking disabled.
------ xrootd protocol initialization completed.
------ xrootd fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 initialization completed.
210402 11:42:45 time=1617331365.088079 func=SendMessage              level=ERROR logid=19e0140a-935d-11eb-8081-90e2ba9a1550 unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a33ff700 source=XrdMqClient:269                tident= sec=      uid=0 gid=0 name= geo="" msg="failed to send message" dst="root://grid05.aligrid.hiroshima-u.ac.jp:1097//eos/nfs10.aligrid.hiroshima-u.ac.jp:1095/fst?xmqclient.advisory.status=0&xmqclient.advisory.query=0&xmqclient.advisory.flushbacklog=0" msg="/eos/*/errorreport?xrdmqmessage.header=1a8032a0-935d-11eb-a535-90e2ba9a1550^^/eos/nfs10.aligrid.hiroshima-u.ac.jp:1095/fst^^^/eos/*/errorreport^errorreport^1617331365^78302000^0^0^0^0^^^^0^0^&xrdmqmessage.body=210402 11:42:44 time=1617331364.121722 func=RefreshBrokersEndpoints  level=ERROR logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f789d3ff700 source=XrdMqClient:495                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="failed to contact broker" url="root://grid05.aligrid.hiroshima-u.ac.jp:1097//eos/nfs10.aligrid.hiroshima-u.ac.jp:1095/fst_mq_test?xmqclient.advisory.flushbacklog=0#and#xmqclient.advisory.query=0#and#xmqclient.advisory.status=0"&xrdmqmessage.mon=1"
210402 11:42:45 9576 FstOfs_SendMessage: Unable to ; success
210402 11:42:45 time=1617331365.098053 func=RefreshBrokersEndpoints  level=ERROR logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a33ff700 source=XrdMqClient:495                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="failed to contact broker" url="root://grid05.aligrid.hiroshima-u.ac.jp:1097//eos/nfs10.aligrid.hiroshima-u.ac.jp:1095/fst_mq_test?xmqclient.advisory.flushbacklog=0&xmqclient.advisory.query=0&xmqclient.advisory.status=0"
210402 11:42:45 time=1617331365.098119 func=ErrorReport              level=ERROR logid=FstOfsStorage unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a33ff700 source=ErrorReport:91                 tident= sec=      uid=0 gid=0 name= geo="" msg="cannot send errorreport broadcast"
210402 11:42:45 time=1617331365.124989 func=Run                      level=NOTE  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f789a3ff700 source=HttpServer:113                 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="starting http server" mode="epoll" threads=16
210402 11:42:45 time=1617331365.126792 func=Run                      level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f789a3ff700 source=HttpServer:157                 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="start of micro httpd succeeded [port=8001]"
210402 11:42:46 time=1617331366.077562 func=getFstNodeConfigQueue    level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a39fd700 source=Config:44                      tident= sec=(null) uid=99 gid=99 name=- geo="" msg="waiting for config queue in Remover ..."
210402 11:42:46 time=1617331366.081858 func=getFstNodeConfigQueue    level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a2efb700 source=Config:44                      tident= sec=(null) uid=99 gid=99 name=- geo="" msg="waiting for config queue in Balancer ..."
210402 11:42:46 time=1617331366.107979 func=getFstNodeConfigQueue    level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a17fe700 source=Config:44                      tident= sec=(null) uid=99 gid=99 name=- geo="" msg="waiting for config queue in Publish ..."
210402 11:42:46 time=1617331366.131734 func=RefreshBrokersEndpoints  level=ERROR logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f789d3ff700 source=XrdMqClient:495                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="failed to contact broker" url="root://grid05.aligrid.hiroshima-u.ac.jp:1097//eos/nfs10.aligrid.hiroshima-u.ac.jp:1095/fst_mq_test?xmqclient.advisory.flushbacklog=0&xmqclient.advisory.query=0&xmqclient.advisory.status=0"
210402 11:42:48 time=1617331368.077740 func=getFstNodeConfigQueue    level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a39fd700 source=Config:44                      tident= sec=(null) uid=99 gid=99 name=- geo="" msg="waiting for config queue in Remover ..."
210402 11:42:48 time=1617331368.082040 func=getFstNodeConfigQueue    level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a2efb700 source=Config:44                      tident= sec=(null) uid=99 gid=99 name=- geo="" msg="waiting for config queue in Balancer ..."
210402 11:42:48 time=1617331368.108143 func=getFstNodeConfigQueue    level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a17fe700 source=Config:44                      tident= sec=(null) uid=99 gid=99 name=- geo="" msg="waiting for config queue in Publish ..."
210402 11:42:48 time=1617331368.141637 func=RefreshBrokersEndpoints  level=ERROR logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f789d3ff700 source=XrdMqClient:495                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="failed to contact broker" url="root://grid05.aligrid.hiroshima-u.ac.jp:1097//eos/nfs10.aligrid.hiroshima-u.ac.jp:1095/fst_mq_test?xmqclient.advisory.flushbacklog=0&xmqclient.advisory.query=0&xmqclient.advisory.status=0"
210402 11:42:49 time=1617331369.082583 func=MgmSyncer                level=INFO  logid=FstOfsStorage unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a2dfa700 source=MgmSyncer:63                   tident= sec=      uid=0 gid=0 name= geo="" msg="waiting to know manager"
210402 11:42:50 time=1617331370.077898 func=getFstNodeConfigQueue    level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a39fd700 source=Config:44                      tident= sec=(null) uid=99 gid=99 name=- geo="" msg="waiting for config queue in Remover ..."
210402 11:42:50 time=1617331370.082205 func=getFstNodeConfigQueue    level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a2efb700 source=Config:44                      tident= sec=(null) uid=99 gid=99 name=- geo="" msg="waiting for config queue in Balancer ..."
210402 11:42:50 time=1617331370.108315 func=getFstNodeConfigQueue    level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a17fe700 source=Config:44                      tident= sec=(null) uid=99 gid=99 name=- geo="" msg="waiting for config queue in Publish ..."
210402 11:42:50 time=1617331370.151491 func=RefreshBrokersEndpoints  level=ERROR logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f789d3ff700 source=XrdMqClient:495                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="failed to contact broker" url="root://grid05.aligrid.hiroshima-u.ac.jp:1097//eos/nfs10.aligrid.hiroshima-u.ac.jp:1095/fst_mq_test?xmqclient.advisory.flushbacklog=0&xmqclient.advisory.query=0&xmqclient.advisory.status=0"
210402 11:42:52 time=1617331372.078055 func=getFstNodeConfigQueue    level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a39fd700 source=Config:44                      tident= sec=(null) uid=99 gid=99 name=- geo="" msg="waiting for config queue in Remover ..."
210402 11:42:52 time=1617331372.082374 func=getFstNodeConfigQueue    level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a2efb700 source=Config:44                      tident= sec=(null) uid=99 gid=99 name=- geo="" msg="waiting for config queue in Balancer ..."
210402 11:42:52 time=1617331372.108487 func=getFstNodeConfigQueue    level=INFO  logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f78a17fe700 source=Config:44                      tident= sec=(null) uid=99 gid=99 name=- geo="" msg="waiting for config queue in Publish ..."
210402 11:42:52 time=1617331372.161358 func=RefreshBrokersEndpoints  level=ERROR logid=static.............................. unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f789d3ff700 source=XrdMqClient:495                tident= sec=(null) uid=99 gid=99 name=- geo="" msg="failed to contact broker" url="root://grid05.aligrid.hiroshima-u.ac.jp:1097//eos/nfs10.aligrid.hiroshima-u.ac.jp:1095/fst_mq_test?xmqclient.advisory.flushbacklog=0&xmqclient.advisory.query=0&xmqclient.advisory.status=0"
210402 11:42:53 time=1617331373.266713 func=RequestBroadcasts        level=NOTE  logid=19e0bce8-935d-11eb-8081-90e2ba9a1550 unit=fst@nfs10.aligrid.hiroshima-u.ac.jp:1095 tid=00007f789d3ff700 source=XrdFstOfs:1810                 tident= sec=      uid=0 gid=0 name= geo="" msg="requesting broadcasts"
@@@@@@ 00:00:00 op=shutdown msg="shutdown timedout after 0 seconds, signal=15
@@@@@@ 00:00:00 op=shutdown status=forced-complete

On MGM+MQ node, there are a plenty of logs in /var/log/eos/mgm so I am not sure which one is relevant but I can provide them if requested, of course.

Where should I check in this situation? Any suggestion will be highly appropriated.

Best,
Masanori

I don’t know if it is related on your installation, but I had lots of problems with the fact that Centos8 has a new libjemalloc, so you have to update the LD_PRELOAD to that shared library.

1 Like

Hello Joseph,
Thank you for sharing your experience! My setup uses CentOS 7 (sorry for lacking this information) but I will check if it affects.

Hi Masanori,

Looking a the errors logs it seems the FSTs can not connect to the MQ daemon running on port 1097. Make sure you use the same sss keys on the FSTs and on the MQ daemon, double check that the ports are open and if you still can’t understand what is going on you can enable XRootD debugging by setting the following env variable in /etc/sysconfig/eos.env
XRD_LOGLEVEL=Dump

Then attach the FST log here and we can try to understand what is going on.

Cheers,
Elvin

1 Like

Hello Elvin,

This was what I had missed. MGM+MQ recognizes FST after copying eos.keytab from grid05 to nfs10 and restarting EOS on both machine.

$ sudo eos node ls
┌──────────┬────────────────────────────────────┬────────────────┬──────────┬────────────┬──────┬──────────┬────────┬────────┬────────────────┬─────┐
│type      │                            hostport│          geotag│    status│   activated│  txgw│ gw-queued│  gw-ntx│ gw-rate│  heartbeatdelta│ nofs│
└──────────┴────────────────────────────────────┴────────────────┴──────────┴────────────┴──────┴──────────┴────────┴────────┴────────────────┴─────┘
 nodesview  nfs10.aligrid.hiroshima-u.ac.jp:1095              EOS     online          off    off          0       10      120                1     0

Thank you so much for your advice!