Hybrid SL6/C7 dual manager sync issues

barbet · April 11, 2019, 8:56am

Seems OK:

[root@naneosmgr01(EOSMASTER) ~]#grep MGM /etc/sysconfig/eos_env
# The fully qualified hostname of MGM master1
EOS_MGM_MASTER1=naneosmgr01.in2p3.fr
# The fully qualified hostname of MGM master2
EOS_MGM_MASTER2=naneosmgr02.in2p3.fr
EOS_MGM_ALIAS=naneosmgr.in2p3.fr
# The fully qualified hostname of current MGM
EOS_MGM_HOST=naneosmgr01.in2p3.fr
# The fully qualified hostname of target MGM
EOS_MGM_HOST_TARGET=naneosmgr01.in2p3.fr
# The hostname[:port] of the EOS MGM service
EOS_PSS_MGM=$EOS_MGM_ALIAS:1094

JM

apeters · April 11, 2019, 9:09am

It seems the systemd implementation uses:

EOS_MGM_HOST_TARGET

=> this has to point to naneosmgr02 on naneosmgr01

barbet · April 11, 2019, 9:39am

OK, this variable has been introduced at some point. At the moment, it is not defined on the slave still in SL6.
I restarted all daemons on both machines and there is still no sync.

On the master naneosmgr01, I see accesses in , though:

[root@naneosmgr01(EOSMASTER) ~]#tail -100 /var/log/eos/sync/xrdlog.sync 
190411 11:33:13 143755 daemon.128217:22@naneosmgr02 ofs_close: use=1 fn=/var/eos/md/directories.naneosmgr02.in2p3.fr.mdlog
190411 11:33:13 143789 XrootdXeq: daemon.128237:20@naneosmgr02 disc 0:04:29
190411 11:33:13 143789 daemon.128237:20@naneosmgr02 ofs_close: use=1 fn=/var/eos/md/iostat.naneosmgr02.in2p3.fr.dump
190411 11:33:13 143767 XrootdXeq: daemon.128603:24@naneosmgr02 pub IP64 login as daemon
190411 11:33:13 143767 daemon.128603:24@naneosmgr02 ofs_open: 2-664 fn=/var/eos/md/files.naneosmgr02.in2p3.fr.mdlog
190411 11:33:13 143788 XrootdXeq: daemon.128623:20@naneosmgr02 pub IP64 login as daemon
190411 11:33:13 143788 daemon.128623:20@naneosmgr02 ofs_open: 2-664 fn=/var/eos/md/directories.naneosmgr02.in2p3.fr.mdlog
190411 11:33:13 143756 XrootdXeq: daemon.128643:22@naneosmgr02 pub IP64 login as daemon
190411 11:33:13 143756 daemon.128643:22@naneosmgr02 ofs_open: 2-664 fn=/var/eos/md/iostat.naneosmgr02.in2p3.fr.dump

But the files do not change:

[root@naneosmgr01(EOSMASTER) ~]#date;ls -alsrt /var/eos/md
Thu Apr 11 11:38:35 CEST 2019
total 10426680
2593448 -rw-r--r--   1 daemon daemon 2655689636 Apr  8 10:16 files.naneosmgr02.in2p3.fr.mdlog.1554714224
  20300 -rw-r--r--   1 daemon daemon   20785176 Apr  8 16:51 directories.naneosmgr02.in2p3.fr.mdlog
2590240 -rw-r--r--   1 daemon daemon 2652405648 Apr  8 16:52 files.naneosmgr02.in2p3.fr.mdlog

Do I have to define the variables on the slave ? How ?

Thanks

JM

apeters · April 11, 2019, 9:41am

Hi Jean Michel,
It think on mgr2 you define EOS_MGM_HOST_TARGET to point to mgr1, on mgr1 you define EOS_MGM_HOST_TARGET to mgr2.

That is sort of uncomfortable to me, but …
In your file on mgr1 you point this variable to mgr1 which is wrong!

barbet · April 11, 2019, 9:57am

Ok, I am going to try this after lunch. I understand that EOS_MGM_HOST_TARGET has to indicate the other manager (whatever the role is ?) but what is the variable EOS_MGM_HOST for ?

Do you think that I should try to make it work on this hybrid cluster or should I move directly to reinstalling the slave manager in CentOS7 ?

JM

barbet · April 11, 2019, 12:13pm

Andreas,

I am sorry but I think that the bad settings with sync have damaged the NS files.

This morning at 9:10:

[root@naneosmgr01(EOSMASTER) ~]#date;ls -alsrt /var/eos/md
Thu Apr 11 09:10:32 CEST 2019
[...]
2597744 -rw-r--r--   1 daemon daemon 2660086592 Apr 11 08:53 files.naneosmgr01.in2p3.fr.mdlog
  24300 -rw-r--r--   1 daemon daemon   24880484 Apr 11 08:53 directories.naneosmgr01.in2p3.fr.mdlog

After I try to restart daemons, I got (and did not notice ):

 [root@naneosmgr01(EOSMASTER) ~]#date;ls -alsrt /var/eos/md
Thu Apr 11 11:27:38 CEST 2019
[...]
      8 -rw-r--r--   1 daemon daemon       8064 Apr 11 11:13 directories.naneosmgr01.in2p3.fr.mdlog
      8 -rw-r--r--   1 daemon daemon       4744 Apr 11 11:16 files.naneosmgr01.in2p3.fr.mdlog

I have a backup that was performed during the night.

JM

barbet · April 15, 2019, 7:08am

This is not the end yet (see below) but the probable origin of the problem was found:

With C7 systemd is used and there are 2 new variables in /etc/sysconfig/eos_env that have to be correctly set:

# The fully qualified hostname of current MGM
EOS_MGM_HOST=naneosmgr01.in2p3.fr
# The fully qualified hostname of target MGM
EOS_MGM_HOST_TARGET=naneosmgr02.in2p3.fr

I did not know and did notpay attention, both variables where set to naneosmgr01, so the master was basically synchronizing to itself, corrupting the NS files.

Now it worked all week-end with the slave with the slave getting the NS files from the master but suddenly this morning around 6am, the slave manager started to display error messages and was like hanging.
When I tried to restart it, it did not with this message:

[root@naneosmgr02(EOSSLAVE) ~]#tail -2000 /var/log/eos/mgm/xrdlog.mgm
[...]
190415 07:45:08 time=1555307108.683692 func=BootNamespace            level=NOTE 
 logid=a09ecbc8-5f41-11e9-b40f-14187764bdce unit=mgm@naneosmgr02.in2p3.fr:1094 t
id=00007fddca1db720 source=Master:1890                    tident=<service> sec= 
     uid=0 gid=0 name= geo="" eos directory view configure started as slave
190415 07:45:08 time=1555307108.683927 func=BootNamespace            level=CRIT 
 logid=a09ecbc8-5f41-11e9-b40f-14187764bdce unit=mgm@naneosmgr02.in2p3.fr:1094 t
id=00007fddca1db720 source=Master:1946                    tident=<service> sec= 
     uid=0 gid=0 name= geo="" eos view initialization failed after 0 seconds
190415 07:45:08 time=1555307108.683965 func=BootNamespace            level=CRIT 
 logid=a09ecbc8-5f41-11e9-b40f-14187764bdce unit=mgm@naneosmgr02.in2p3.fr:1094 t
id=00007fddca1db720 source=Master:1949                    tident=<service> sec= 
     uid=0 gid=0 name= geo="" initialization returned ec=14 Unrecognized file ty
pe: /var/eos/md/directories.naneosmgr01.in2p3.fr.mdlog

The slave manager still in SL6 has to be reinstalled in C7 but it may be useful to understand why this error.

JM

barbet · April 16, 2019, 11:59am

The error went away after I compacted the NS files.

And this thread has to be over because the slave has been reinstalled in CentOS7 making the pair master/slave homogeneous.

Special thanks to Andreas for his patient assistance.

JM

CERN Accelerating science

Hybrid SL6/C7 dual manager sync issues