EOS Unavailable, Erroneously Reported as Full, FST svc dies

EOS on the ornl::tmp instance is failing to start fully, specifically the FST service.

This SE is a combined MGM+FST - the MGM starts fine.

FST service initially starts, but subsequently fails - similar to FST Service Silently Fails – however, previously the FST service would (periodically) fail but the SE would otherwise start and be available.

Currently:

FST service starts, but dies later “xrootd for role: fst dead but subsys locked” – no errors logged

eos health -a reports “full” (erroneous, SE has plenty of room).

eos-log-repair neither found nor fixed any issues in files for dir md files.

fs ls shows all fsids “offline” (even when FST service initially starts), and fsid status never changes.

The single disk volumes providing storage to the fsids are available, mounted, and showing no issue.

Plenty of resources: 128G, low load, no other services running, no OOM invocations, etc.

Any hints or suggestions on what may be causing this?

Cheers,
Pete

Hi Pete,
attach gdb to the process of the FST xrootd and see where it exits.

That is the simplest option to try:

gdb xrootd -p
continue
… see what happens

There are two FST processes , you need to attach to the first one …

Hi Andreas,

gbd results:

OK,
normally this means, there is some mess with the host name or the sss authentication.
If you do "
env XrdSecPROTOCOL=sss eos whoami

What does it show?

Is the HOST variable defined, what is it?

Hi Andreas,

Can you grep the FST log file for a line like:
FST_HOST=… FST_PORT=…

And can you paste/var/log/eos/mq/xrdlog.mq after you tried to start the FST …

Hi Andreas,

Fwiw, DNS is as since the SE was deployed, with warp-ornl-cern-06 a CNAME for ornl-eos-xfer.ornl.gov

FST_PORT references:

/var/log/eos/mq/xrdlog.mq

Ah,
can it be, that you didn’t configure the MQ to be a master. Because the queues are never opened.
You should have an empty file under /var/eos/eos.mq.master

On both ORNL::EOS and ORNL::TEMP SEs:

[quote]
root@ornl-eos-xfer.ornl.gov:~
15:26:22 # file /var/eos/eos.mq.master
/var/eos/eos.mq.master: cannot open `/var/eos/eos.mq.master’ (No such file or directory)
[/quote/

The resolution of this issue was to create:

touch /var/eos/eos.mq.master

The root cause of why the mq was (apparently) put into slave mode is undetermined. There were no changes to the SE config, environment, or service scripts. Creating the empty file above immeadiatly resolved the issue with no other modifications.

Thanks for the help Andreas.