peby
(Pete Eby)
December 17, 2018, 7:54pm
1
EOS on the ornl::tmp instance is failing to start fully, specifically the FST service.
This SE is a combined MGM+FST - the MGM starts fine.
FST service initially starts, but subsequently fails - similar to FST Service Silently Fails – however, previously the FST service would (periodically) fail but the SE would otherwise start and be available.
Currently:
FST service starts, but dies later “xrootd for role: fst dead but subsys locked” – no errors logged
eos health -a reports “full” (erroneous, SE has plenty of room).
eos-log-repair neither found nor fixed any issues in files for dir md files.
fs ls shows all fsids “offline” (even when FST service initially starts), and fsid status never changes.
The single disk volumes providing storage to the fsids are available, mounted, and showing no issue.
Plenty of resources: 128G, low load, no other services running, no OOM invocations, etc.
Any hints or suggestions on what may be causing this?
Cheers,
Pete
Env info and logs:
eos 4.3.12-1
root@ornl-eos-xfer.ornl.gov :~
20:32:27 # service eos status
xrootd for role: mq (pid 29511) is running (as slave) …
xrootd for role: sync (pid 29559) is running …
xrootd for role: mgm (pid 29607 29752) is running (as master) …
xrootd for role: fst dead but subsys locked
20:32:31 # eos health -a
┌────────────────────────────────┬────────┐
│hostport │ status│
└────────────────────────────────┴────────┘
warp-ornl-cern-06.ornl.gov:1095 unknown
┌────────────┬────────────┬────────────┬────────┐
│group │offline used│ online free│ status│
└────────────┴────────────┴────────────┴────────┘
default.0 0 B 0 B full
root@ornl-eos-xfer.ornl.gov :~
20:33:31 # eos node ls
┌──────────┬────────────────────────────────┬────────────────┬──────────┬────────────┬──────┬──────────┬────────┬────────┬────────────────┬─────┐
│type │ hostport│ geotag│ status│ status│ txgw│ gw-queued│ gw-ntx│ gw-rate│ heartbeatdelta│ nofs│
└──────────┴────────────────────────────────┴────────────────┴──────────┴────────────┴──────┴──────────┴────────┴────────┴────────────────┴─────┘
nodesview warp-ornl-cern-06.ornl.gov:1095 unknown on off 0 10 120 ~ 60
root@ornl-eos-xfer.ornl.gov :~
20:34:42 # tail -n 5 /var/log/eos/mgm/xrdlog.mgm
181217 20:34:47 time=1545075287.168083 func=PurgeVersion level=INFO logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@warp-ornl-cern-06.ornl.gov:1094 tid=00007fc45cdfd700 source=Version:178 tident= sec= uid=0 gid=0 name= geo=“” version-dir=/eos/aliceornlxfer/grid/13/05602/.sys.v#.10c8f152-e29d-11de-b434-0025b3e7369c/ max-versions=0
181217 20:34:47 time=1545075287.168211 func=_open level=INFO logid=cfef0d3c-0232-11e9-b2e0-001b2161cbb4 unit=mgm@warp-ornl-cern-06.ornl.gov:1094 tid=00007fc45cdfd700 source=XrdMgmOfsDirectory:164 tident= sec=local uid=0 gid=0 name=root geo=“” name=opendir path=/eos/aliceornlxfer/grid/13/05602/.sys.v#.10c8f152-e29d-11de-b434-0025b3e7369c
181217 20:34:47 time=1545075287.168296 func=PurgeVersion level=INFO logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@warp-ornl-cern-06.ornl.gov:1094 tid=00007fc45cdfd700 source=Version:207 tident= sec= uid=0 gid=0 name= geo=“” listrc=-1 max-version=0
181217 20:34:47 time=1545075287.168345 func=_rem level=INFO logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@warp-ornl-cern-06.ornl.gov:1094 tid=00007fc45cdfd700 source=Rm:423 tident= sec=local uid=0 gid=0 name=root geo=“” msg=“deleted” can-recycle=0 path=/eos/aliceornlxfer/grid/13/05602/10c8f152-e29d-11de-b434-0025b3e7369c owner.uid=10367 owner.gid=1395 vid.uid=0 vid.gid=0
181217 20:34:47 time=1545075287.168454 func=Emsg level=ERROR logid=cfeec692-0232-11e9-b2e0-001b2161cbb4 unit=mgm@warp-ornl-cern-06.ornl.gov:1094 tid=00007fc45cdfd700 source=XrdMgmOfsFile:2874 tident=jalien.33716:123@aliendb06f.cern.ch sec=unix uid=10367 gid=1395 name=jalien geo=“” Unable to get free physical space /eos/aliceornlxfer/grid/13/05602/10c8f152-e29d-11de-b434-0025b3e7369c; No space left on device
root@ornl-eos-xfer.ornl.gov :~
20:35:26 # tail -n 5 /var/log/eos/fst/xrdlog.fst
181217 20:09:59 time=1545073799.419624 func=getFstNodeConfigQueue level=INFO logid=static… unit=fst@warp-ornl-cern-06.ornl.gov:1095 tid=00007f1b10efe700 source=Config:40 tident= sec=(null) uid=99 gid=99 name=- geo=“” Waiting for config queue in …
181217 20:09:59 time=1545073799.419915 func=getFstNodeConfigQueue level=INFO logid=static… unit=fst@warp-ornl-cern-06.ornl.gov:1095 tid=00007f1b10dfd700 source=Config:40 tident= sec=(null) uid=99 gid=99 name=- geo=“” Waiting for config queue in Cleaner …
181217 20:09:59 time=1545073799.440152 func=getFstNodeConfigQueue level=INFO logid=static… unit=fst@warp-ornl-cern-06.ornl.gov:1095 tid=00007f1b11bff700 source=Config:40 tident= sec=(null) uid=99 gid=99 name=- geo=“” Waiting for config queue in Publish …
@@@@@@ 00:00:00 op=shutdown msg="shutdown timedout after 0 seconds, signal=1
@@@@@@ 00:00:00 op=shutdown status=forced-complete
apeters
(Andreas Joachim Peters)
December 18, 2018, 2:58pm
2
Hi Pete,
attach gdb to the process of the FST xrootd and see where it exits.
That is the simplest option to try:
gdb xrootd -p
continue
… see what happens
apeters
(Andreas Joachim Peters)
December 18, 2018, 2:58pm
3
There are two FST processes , you need to attach to the first one …
apeters
(Andreas Joachim Peters)
December 18, 2018, 3:33pm
5
OK,
normally this means, there is some mess with the host name or the sss authentication.
If you do "
env XrdSecPROTOCOL=sss eos whoami
What does it show?
Is the HOST variable defined, what is it?
apeters
(Andreas Joachim Peters)
December 18, 2018, 3:43pm
7
Can you grep the FST log file for a line like:
FST_HOST=… FST_PORT=…
And can you paste/var/log/eos/mq/xrdlog.mq after you tried to start the FST …
peby
(Pete Eby)
December 18, 2018, 4:25pm
8
Hi Andreas,
Fwiw, DNS is as since the SE was deployed, with warp-ornl-cern-06 a CNAME for ornl-eos-xfer.ornl.gov
FST_PORT references:
/var/log/eos/mq/xrdlog.mq
17:19:50 # cat /var/log/eos/mq/xrdlog.mq
181218 15:10:24 161657 XrdLink: Unable to peek on ?:25@ec2-54-219-188-148.us-west-1.compute.amazonaws.com; connection reset by peer
181218 15:10:24 161657 XrdProtocol: ?:25@ec2-54-219-188-148.us-west-1.compute.amazonaws.com terminated handshake not received
181218 16:09:28 105502 Starting on Linux 2.6.32-696.30.1.el6.x86_64
Copr. 2004-2012 Stanford University, xrd version v4.8.3
++++++ xrootd mq@ornl-eos-xfer.ornl.gov initialization started.
Config using configuration file /etc/xrd.cf.mq
=====> xrd.sched mint 16 maxt 1024 idle 128
=====> xrd.port 1097
=====> xrd.network keepalive
=====> xrd.timeout idle 120
Config maximum number of connections restricted to 65000
Copr. 2012 Stanford University, xrootd protocol 3.1.0 version v4.8.3
++++++ xrootd protocol initialization started.
=====> xrootd.fslib libXrdMqOfs.so
=====> all.export /eos/ nolock
=====> xrootd.async off nosf
=====> xrootd.seclib libXrdSec.so
Config exporting /eos/
Plugin loaded
++++++ Authentication system initialization started.
Plugin loaded
=====> sec.protocol sss -c /etc/eos.keytab -s /etc/eos.keytab
=====> sec.protbind * only sss
Config 2 authentication directives processed in /etc/xrd.cf.mq
------ Authentication system initialization completed.
++++++ Protection system initialization started.
Config warning: Security level is set to none; request protection disabled!
Config Local protection level: none
Config Remote protection level: none
------ Protection system initialization completed.
Config Routing for warp-ornl-cern-06.ornl.gov : local pub4 prv4
Config Route all4: warp-ornl-cern-06.ornl.gov Dest=[::192.188.182.24]:1097
Plugin No such file or directory loading fslib libXrdMqOfs-4.so
Config Falling back to using libXrdMqOfs.so
Plugin loaded
++++++ (c) 2012 CERN/IT-DSS 4.3.12
Addr::QueueOutMutex 0xd8
Addr::MessageMutex 0x130
=====> mq.hostname: warp-ornl-cern-06.ornl.gov
=====> mq.hostpref: warp-ornl-cern-06
=====> mq.managerid: warp-ornl-cern-06.ornl.gov:1097
=====> mq.queue: /eos/
=====> mq.brokerid: root://warp-ornl-cern-06.ornl.gov:1097//eos/
Config warning: asynchronous I/O has been disabled!
Config warning: sendfile I/O has been disabled!
Config warning: ‘xrootd.prepare logdir’ not specified; prepare tracking disabled.
------ xrootd protocol initialization completed.
------ xrootd mq@ornl-eos-xfer.ornl.gov:1097 initialization completed.
181218 16:09:42 105506 XrootdXeq: daemon.105606:7@localhost pvt IPv4 login as daemon
181218 16:09:42 105507 XrootdXeq: daemon.105606:17@warp-ornl-cern-06 pub IPv4 login as daemon
181218 16:09:42 105508 XrootdXeq: daemon.105606:19@localhost pvt IPv4 login as daemon
181218 16:09:42 105509 XrootdXeq: daemon.105606:20@warp-ornl-cern-06 pub IPv4 login as daemon
181218 16:09:42 105510 XrootdXeq: root.105958:21@localhost pvt IPv4 login as daemon
181218 16:09:42 105953 XrootdXeq: root.105958:22@warp-ornl-cern-06 pub IPv4 login as daemon
181218 16:09:44 105971 XrootdXeq: daemon.106017:23@localhost pvt IPv4 login as daemon
181218 16:09:44 105973 XrootdXeq: daemon.106017:24@warp-ornl-cern-06 pub IPv4 login as daemon
181218 16:11:39 105973 XrootdXeq: daemon.106017:24@warp-ornl-cern-06 disc 0:01:55
181218 16:11:39 105971 XrootdXeq: daemon.106017:23@localhost disc 0:01:55
181218 16:14:06 122555 Starting on Linux 2.6.32-696.30.1.el6.x86_64
Copr. 2004-2012 Stanford University, xrd version v4.8.3
++++++ xrootd mq@ornl-eos-xfer.ornl.gov initialization started.
Config using configuration file /etc/xrd.cf.mq
=====> xrd.sched mint 16 maxt 1024 idle 128
=====> xrd.port 1097
=====> xrd.network keepalive
=====> xrd.timeout idle 120
Config maximum number of connections restricted to 65000
Copr. 2012 Stanford University, xrootd protocol 3.1.0 version v4.8.3
++++++ xrootd protocol initialization started.
=====> xrootd.fslib libXrdMqOfs.so
=====> all.export /eos/ nolock
=====> xrootd.async off nosf
=====> xrootd.seclib libXrdSec.so
Config exporting /eos/
Plugin loaded
++++++ Authentication system initialization started.
Plugin loaded
=====> sec.protocol sss -c /etc/eos.keytab -s /etc/eos.keytab
=====> sec.protbind * only sss
Config 2 authentication directives processed in /etc/xrd.cf.mq
------ Authentication system initialization completed.
++++++ Protection system initialization started.
Config warning: Security level is set to none; request protection disabled!
Config Local protection level: none
Config Remote protection level: none
------ Protection system initialization completed.
Config Routing for warp-ornl-cern-06.ornl.gov : local pub4 prv4
Config Route all4: warp-ornl-cern-06.ornl.gov Dest=[::192.188.182.24]:1097
Plugin No such file or directory loading fslib libXrdMqOfs-4.so
Config Falling back to using libXrdMqOfs.so
Plugin loaded
++++++ (c) 2012 CERN/IT-DSS 4.3.12
Addr::QueueOutMutex 0xd8
Addr::MessageMutex 0x130
=====> mq.hostname: warp-ornl-cern-06.ornl.gov
=====> mq.hostpref: warp-ornl-cern-06
=====> mq.managerid: warp-ornl-cern-06.ornl.gov:1097
=====> mq.queue: /eos/
=====> mq.brokerid: root://warp-ornl-cern-06.ornl.gov:1097//eos/
Config warning: asynchronous I/O has been disabled!
Config warning: sendfile I/O has been disabled!
Config warning: ‘xrootd.prepare logdir’ not specified; prepare tracking disabled.
------ xrootd protocol initialization completed.
------ xrootd mq@ornl-eos-xfer.ornl.gov:1097 initialization completed.
181218 16:14:19 122559 XrootdXeq: daemon.122651:7@localhost pvt IPv4 login as daemon
181218 16:14:19 122560 XrootdXeq: daemon.122651:17@warp-ornl-cern-06 pub IPv4 login as daemon
181218 16:14:19 122561 XrootdXeq: daemon.122651:19@localhost pvt IPv4 login as daemon
181218 16:14:19 122562 XrootdXeq: daemon.122651:20@warp-ornl-cern-06 pub IPv4 login as daemon
181218 16:14:19 122563 XrootdXeq: root.122833:21@localhost pvt IPv4 login as daemon
181218 16:14:19 122829 XrootdXeq: root.122833:22@warp-ornl-cern-06 pub IPv4 login as daemon
181218 16:14:21 122846 XrootdXeq: daemon.122898:23@localhost pvt IPv4 login as daemon
181218 16:14:21 122848 XrootdXeq: daemon.122898:24@warp-ornl-cern-06 pub IPv4 login as daemon
181218 16:16:16 122923 XrootdXeq: daemon.122898:23@localhost disc 0:01:55
181218 16:16:16 123185 XrootdXeq: daemon.122898:24@warp-ornl-cern-06 disc 0:01:55
apeters
(Andreas Joachim Peters)
December 19, 2018, 9:22am
9
Ah,
can it be, that you didn’t configure the MQ to be a master. Because the queues are never opened.
You should have an empty file under /var/eos/eos.mq.master
peby
(Pete Eby)
December 19, 2018, 2:27pm
10
On both ORNL::EOS and ORNL::TEMP SEs:
[quote]
root@ornl-eos-xfer.ornl.gov :~
15:26:22 # file /var/eos/eos.mq.master
/var/eos/eos.mq.master: cannot open `/var/eos/eos.mq.master’ (No such file or directory)
[/quote/
peby
(Pete Eby)
December 19, 2018, 3:49pm
11
The resolution of this issue was to create:
touch /var/eos/eos.mq.master
The root cause of why the mq was (apparently) put into slave mode is undetermined. There were no changes to the SE config, environment, or service scripts. Creating the empty file above immeadiatly resolved the issue with no other modifications.
Thanks for the help Andreas.