Eos@mgm.service and eos@mq.service failed because a timeout was exceeded

Hello Everyone,

Todays i.e. 10/07/2021, Our eos has stop working. We had check the mgm and fsts and found that the /var is filled up in all 8 FSTs. The last log file i.e. Xrdlog.fst which has mostly occupied /var.

So, I had remove last file of xrdlog.fst and restart eos@fst daemon from all 8 FSTs. Then EOS daemon was running fine in all FSTs.

Now, we again restart EOS daemon i.e. eos (eos@mq and eos@mgm) in Master and Slave machines, but its not started. The error shown during starting of EOS daemon is below:–

=========================

[root@eos-mgm ~]# systemctl status eos
● eos.service - EOS All Services
Loaded: loaded (/usr/lib/systemd/system/eos.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Sun 2021-07-11 01:21:09 IST; 2min 23s ago
Process: 5196 ExecStartPre=/bin/sh -c /usr/sbin/eos_start_pre.sh eos-all (code=exited, status=1/FAILURE)

Jul 11 01:19:39 eos-mgm.tier2-kol.res.in systemd[1]: Starting EOS All Services…
Jul 11 01:19:39 eos-mgm.tier2-kol.res.in sh[5196]: Waiting for 5202 …
Jul 11 01:21:09 eos-mgm.tier2-kol.res.in sh[5196]: Job for eos@mgm.service failed because a timeout was exceeded. See "systemctl status eos@mgm.servi…details.
Jul 11 01:21:09 eos-mgm.tier2-kol.res.in sh[5196]: Job for eos@mq.service failed because a timeout was exceeded. See "systemctl status eos@mq.service…details.
Jul 11 01:21:09 eos-mgm.tier2-kol.res.in sh[5196]: Waiting for 5203 …
Jul 11 01:21:09 eos-mgm.tier2-kol.res.in systemd[1]: eos.service: control process exited, code=exited status=1
Jul 11 01:21:09 eos-mgm.tier2-kol.res.in systemd[1]: Failed to start EOS All Services.
Jul 11 01:21:09 eos-mgm.tier2-kol.res.in systemd[1]: Unit eos.service entered failed state.
Jul 11 01:21:09 eos-mgm.tier2-kol.res.in systemd[1]: eos.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
[root@eos-mgm ~]#

=============

However there are no new log generated or write in /var/log/eos/mgm/xrdlog.mgm and /var/log/eos/mq/xrdlog.mq after today 12:06:13 (Time) . Last line of xrdlog.mgm are below:

===================

210710 12:06:14 time=1625898974.257189 func=xrdmgmofs_shutdown level=ALERT logid=static… unit=mgm@eos-mgm.tier2-kol.res.in:1094 tid=00007ff551bc0780 source=Shutdown:59 tident= sec=(null) uid=99 gid=99 name=- geo="" msg=“shutdown complete”

====================

log inside /var/log/messages are shown as:-

11 01:57:51 eos-mgm systemd: Stopped EOS mgm.
Jul 11 01:57:51 eos-mgm systemd: Starting EOS mgm…
Jul 11 01:57:51 eos-mgm systemd: Stopped EOS mq.
Jul 11 01:57:51 eos-mgm systemd: Starting EOS mq…
Jul 11 01:59:21 eos-mgm systemd: eos@mq.service start-pre operation timed out. Terminating.
Jul 11 01:59:21 eos-mgm systemd: eos@mgm.service start-pre operation timed out. Terminating.
Jul 11 01:59:21 eos-mgm systemd: Failed to start EOS mgm.
Jul 11 01:59:21 eos-mgm systemd: Unit eos@mgm.service entered failed state.
Jul 11 01:59:21 eos-mgm systemd: eos@mgm.service failed.
Jul 11 01:59:21 eos-mgm systemd: Failed to start EOS mq.
Jul 11 01:59:21 eos-mgm systemd: Unit eos@mq.service entered failed state.
Jul 11 01:59:21 eos-mgm systemd: eos@mq.service failed.
Jul 11 01:59:26 eos-mgm systemd: eos@mq.service holdoff time over, scheduling restart.
Jul 11 01:59:26 eos-mgm systemd: eos@mgm.service holdoff time over, scheduling restart.
Jul 11 01:59:26 eos-mgm systemd: Stopped EOS mgm.
Jul 11 01:59:26 eos-mgm systemd: Starting EOS mgm…
Jul 11 01:59:26 eos-mgm systemd: Stopped EOS mq.
Jul 11 01:59:26 eos-mgm systemd: Starting EOS mq…

Any hints or suggestions on what may be causing this?

Regards
Prasun

Hi Prasun,

This usually means that systemd timed out (it took more than 60 seconds) while starting the service. This in turn can happen if you have many files inside /var/log/eos/ or /var/eos/. If you have a look at the /usr/sbin/eos_start_pre.sh script which is called by systemd when starting up the daemon you will see there are some commands there being executed to change the ownership and permissions for these directories. It looks like something happens (blocks) when this script is running so it might help to add some comments to it in order to understand what is exactly the problem.

Cheers,
Elvin

Dear Elvin,

Thank for your suggestion to rectify the issue.

I had move log files i.e. /var/log/eos/* and meta-data (md) folder i.e./var/eos/md and report folder i.e. /var/eos/report to another safe partition. The total size of report folder are around 19GB and it consists huge nos. of files. Also, inside md folder, there are 4 nos. of compacted and non-compacted mdlog files (total 75 GB) and those were last updated on 4th May 2021. Also, It may be not use after migrate to QuarkDB.

Now eos daemon i.e. eos@mq and eos@mgm are started fine on Master and Slave machines.

++++++++++++++++++++
[root@eos-mgm ~]# systemctl status eos@*| grep -E “Active|service -EOS”
Active: active (running) since Wed 2021-07-14 16:32:45 IST; 19h ago
Active: active (running) since Wed 2021-07-14 16:32:45 IST; 19h ago
[root@eos-mgm ~]#
+++++++++++++++++++++++++

Regards
Prasun