Hybrid SL6/C7 dual manager sync issues

Hello,

Yesterday I performed a role exchange between two managers running the same 4.4.23 EOS version. The master in SL6 became slave and a newly installed C7 manager was promoted to master. At the third step when telling again the previous master that the other machine is master, it hanged and I had to restart the eos services.

But I got worried that the files holding the namespace are much smaller on the new master. I am wondering if this is expected. Also I doubt that the synchronization of the namespace files is working correctly. I observe that the now slave does not retrieve a copy of the files from the master. I checked all the daemons, tried to restart them several times.

The questions are:

Is that expected that the namespace files are dramatically shrunked on a new master ?
How to diagnose/debug the synchronization between master and slave ?

I will provide info, but, already, it seems the name of the files have changed and include the xrootd port number for the sync service while it did not before, is this a problem ?

And is the bug that was requiring to comment-out the line “::1” in /etc/hosts still here ? (or in short: is it still necessary to do so ?)

Thank you

JM

Hi Jean Michel,

the third step normally involves a full namespace restart.

Can you just paste the output of
eos ns
eos ns master
on both machines
and a listing of
ls -la /var/eos/md

The ::1 was only needed with an xrootd version which was not listening on IPV6. If you have 4.8.* this can be removed.

Hi Andreas,

Here you are:

naneosmgr02 (SL6, was the master before role exchange):

 [root@naneosmgr02(EOSSLAVE) ~]#eos ns
# ------------------------------------------------------------------------------------
# Namespace Statistics
# ------------------------------------------------------------------------------------
ALL      Files                            16633579 [booted] (96s)
ALL      Directories                      54545
ALL      Total boot time                  97 s
# ------------------------------------------------------------------------------------
ALL      Compactification                 status=off waitstart=0 interval=0 ratio-file=0.0:1 ratio-dir=0.0:1
# ------------------------------------------------------------------------------------
ALL      Replication                      mode=slave-ro state=slave-ro master=naneosmgr01.in2p3.fr configdir=/var/eos/config/naneosmgr01.in2p3.fr/ config=default mgm:naneosmgr01.in2p3.fr=ok mgm:mode=master-rw mq:naneosmgr01.in2p3.fr:1097=ok
ALL      Namespace Latency Files          0
ALL      Namespace Latency Directories    0
ALL      Namespace Pending Updates        0
# ------------------------------------------------------------------------------------
ALL      File Changelog Size              2.66 GB
ALL      Dir  Changelog Size              10.15 MB
# ------------------------------------------------------------------------------------
ALL      avg. File Entry Size             160 B
ALL      avg. Dir  Entry Size             186 B
# ------------------------------------------------------------------------------------
ALL      files created since boot         126
ALL      container created since boot     0
# ------------------------------------------------------------------------------------
ALL      current file id                  99421091
ALL      current container id             54546
# ------------------------------------------------------------------------------------
ALL      eosxd caps                       0
ALL      eosxd clients                    0
# ------------------------------------------------------------------------------------
ALL      memory virtual                   18.98 GB
ALL      memory resident                  14.51 GB
ALL      memory share                     2.68 GB
ALL      memory growths                   41.84 MB
ALL      threads                          153
ALL      fds                              208
ALL      uptime                           68488
# ------------------------------------------------------------------------------------
ALL      drain info                       id=default, thread_pool_min=6, thread_pool_max=400, thread_pool_size=6, queue_size=0
# ------------------------------------------------------------------------------------
[root@naneosmgr02(EOSSLAVE) ~]#eos ns master
190408 14:54:08 time=1554728048.148155 func=BootNamespace            level=NOTE  logid=65a4dee8-59fd-11e9-9f9d-14187764bdce unit=mgm@naneosmgr02.in2p3.fr:1094 tid=00007f5484b5b720 source=Master:1890                    tident=<service> sec=      uid=0 gid=0 name= geo="" eos directory view configure started as slave
190408 14:54:08 time=1554728048.784112 func=BootNamespace            level=NOTE  logid=65a4dee8-59fd-11e9-9f9d-14187764bdce unit=mgm@naneosmgr02.in2p3.fr:1094 tid=00007f5484b5b720 source=Master:1929                    tident=<service> sec=      uid=0 gid=0 name= geo="" eos directory view configure stopped after 0 seconds
190408 14:54:08 time=1554728048.784156 func=BootNamespace            level=NOTE  logid=65a4dee8-59fd-11e9-9f9d-14187764bdce unit=mgm@naneosmgr02.in2p3.fr:1094 tid=00007f5484b5b720 source=Master:1933                    tident=<service> sec=      uid=0 gid=0 name= geo="" running in slave mode
190408 14:54:08 time=1554728048.789587 func=Activate                 level=NOTE  logid=static.............................. unit=mgm@naneosmgr02.in2p3.fr:1094 tid=00007f5484b5b720 source=Master:1062                    tident= sec=(null) uid=99 gid=99 name=- geo="" configdir=/var/eos/config/naneosmgr01.in2p3.fr/ activating master=naneosmgr01.in2p3.fr
190408 14:54:08 time=1554728048.789654 func=Activate                 level=INFO  logid=static.............................. unit=mgm@naneosmgr02.in2p3.fr:1094 tid=00007f5484b5b720 source=Master:1071                    tident= sec=(null) uid=99 gid=99 name=- geo="" autoload config=default
190408 14:54:08 time=1554728048.848270 func=Activate                 level=INFO  logid=static.............................. unit=mgm@naneosmgr02.in2p3.fr:1094 tid=00007f5484b5b720 source=Master:1084                    tident= sec=(null) uid=99 gid=99 name=- geo="" Successful auto-load config default
190408 14:55:44 time=1554728144.150081 func=InitializeFileView       level=NOTE  logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@naneosmgr02.in2p3.fr:1094 tid=00007f53dc377700 source=XrdMgmOfsConfigure:171         tident=<single-exec> sec=      uid=0 gid=0 name= geo="" eos namespace file loading stopped after 96 seconds
[root@naneosmgr02(EOSSLAVE) ~]#ls -la /var/eos/md
total 10435260
drwx------.  2 daemon root         4096 Apr  9 09:55 .
drwxr-xr-x. 11 daemon daemon       4096 Apr  8 14:12 ..
-rw-r--r--   1 daemon daemon   10153784 Mar 21 11:09 directories.naneosmgr01.in2p3.fr.mdlog
-rw-r--r--   1 daemon daemon   20785176 Apr  8 13:11 directories.naneosmgr02.in2p3.fr.mdlog
-rw-r--r--   1 daemon daemon 2663917700 Mar 21 11:10 files.naneosmgr01.in2p3.fr.mdlog
-rw-r--r--   1 daemon daemon 2652405648 Apr  8 13:15 files.naneosmgr02.in2p3.fr.mdlog
-rw-r--r--   1 daemon daemon 2681690572 Apr  4 08:15 files.naneosmgr02.in2p3.fr.mdlog.1554360503
-rw-r--r--   1 daemon daemon 2655689636 Apr  8 10:16 files.naneosmgr02.in2p3.fr.mdlog.1554714157
-rw-r--r--   1 daemon daemon       3210 Apr  4 09:12 iostat.naneosmgr01.in2p3.fr.dump
-rwxr--r--   1 daemon daemon       2240 Apr  9 09:55 iostat.naneosmgr02.in2p3.fr:1094.dump
-rwxr--r--   1 daemon daemon       3210 Mar 21 09:25 iostat.naneosmgr02.in2p3.fr.dump
-rw-rw-rw-   1 daemon daemon     469630 Mar 21 09:26 so.mgm.dump
-rw-r--r--   1 daemon daemon     473564 Apr  9 09:55 so.mgm.dump.naneosmgr02.in2p3.fr:1094
-rw-rw-rw-   1 daemon root        73127 Mar 21 09:26 stacktrace

naneosmgr01 (CentOS7, is the newly installed machine promoted to master):

[root@naneosmgr01(EOSMASTER) ~]#eos ns
# ------------------------------------------------------------------------------------
# Namespace Statistics
# ------------------------------------------------------------------------------------
ALL      Files                            16581112 [booted] (88s)
ALL      Directories                      54545
ALL      Total boot time                  1554714310 s
# ------------------------------------------------------------------------------------
ALL      Compactification                 status=off waitstart=0 interval=0 ratio-file=0.0:1 ratio-dir=0.0:1
# ------------------------------------------------------------------------------------
ALL      Replication                      mode=master-rw state=master-rw master=naneosmgr01.in2p3.fr configdir=/var/eos/config/naneosmgr01.in2p3.fr/ config=default mgm:naneosmgr02.in2p3.fr=ok mgm:mode=slave-ro mq:naneosmgr02.in2p3.fr:1097=ok
# ------------------------------------------------------------------------------------
ALL      File Changelog Size              26.65 kB
ALL      Dir  Changelog Size              286.88 kB
# ------------------------------------------------------------------------------------
ALL      avg. File Entry Size             0 B
ALL      avg. Dir  Entry Size             5 B
# ------------------------------------------------------------------------------------
ALL      files created since boot         230
ALL      container created since boot     0
# ------------------------------------------------------------------------------------
ALL      current file id                  99424277
ALL      current container id             54546
# ------------------------------------------------------------------------------------
ALL      eosxd caps                       0
ALL      eosxd clients                    0
# ------------------------------------------------------------------------------------
ALL      memory virtual                   23.66 GB
ALL      memory resident                  18.43 GB
ALL      memory share                     2.67 GB
ALL      memory growths                   7.40 MB
ALL      threads                          168
ALL      fds                              224
ALL      uptime                           413496
# ------------------------------------------------------------------------------------
ALL      drain info                       id=default, thread_pool_min=6, thread_pool_max=400, thread_pool_size=6, queue_size=0
# ------------------------------------------------------------------------------------
[root@naneosmgr01(EOSMASTER) ~]#eos ns master
190404 15:06:17 time=1554383177.191864 func=BootNamespace            level=NOTE  logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcddd3878c0 source=Master:1890                    tident=<service> sec=      uid=0 gid=0 name= geo="" eos directory view configure started as slave
190404 15:06:17 time=1554383177.584298 func=BootNamespace            level=NOTE  logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcddd3878c0 source=Master:1929                    tident=<service> sec=      uid=0 gid=0 name= geo="" eos directory view configure stopped after 0 seconds
190404 15:06:17 time=1554383177.584341 func=BootNamespace            level=NOTE  logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcddd3878c0 source=Master:1933                    tident=<service> sec=      uid=0 gid=0 name= geo="" running in slave mode
190404 15:06:17 time=1554383177.591555 func=Activate                 level=NOTE  logid=static.............................. unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcddd3878c0 source=Master:1062                    tident= sec=(null) uid=99 gid=99 name=- geo="" configdir=/var/eos/config/naneosmgr02.in2p3.fr/ activating master=naneosmgr02.in2p3.fr
190404 15:06:17 time=1554383177.591622 func=Activate                 level=INFO  logid=static.............................. unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcddd3878c0 source=Master:1071                    tident= sec=(null) uid=99 gid=99 name=- geo="" autoload config=default
190404 15:06:17 time=1554383177.665151 func=Activate                 level=INFO  logid=static.............................. unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcddd3878c0 source=Master:1084                    tident= sec=(null) uid=99 gid=99 name=- geo="" Successful auto-load config default
190404 15:07:28 time=1554383248.119660 func=InitializeFileView       level=NOTE  logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcda11fc700 source=XrdMgmOfsConfigure:171         tident=<single-exec> sec=      uid=0 gid=0 name= geo="" eos namespace file loading stopped after 71 seconds
190408 11:03:44 time=1554714224.857316 func=TagNamespaceInodes       level=INFO  logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:2038                    tident=<service> sec=      uid=0 gid=0 name= geo="" msg="tag namespace inodes"
190408 11:03:44 time=1554714224.857392 func=RedirectToRemoteMaster   level=INFO  logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:2216                    tident=<service> sec=      uid=0 gid=0 name= geo="" msg="redirect to remote master"
190408 11:03:44 time=1554714224.857425 func=RedirectToRemoteMaster   level=INFO  logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:2225                    tident=<service> sec=      uid=0 gid=0 name= geo="" msg="invoking slave shutdown"
190408 11:03:44 time=1554714224.865277 func=RedirectToRemoteMaster   level=INFO  logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:2228                    tident=<service> sec=      uid=0 gid=0 name= geo="" msg="stopped namespace following"
190408 11:03:45 time=1554714225.870895 func=WaitNamespaceFilesInSync level=INFO  logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:2059                    tident=<service> sec=      uid=0 gid=0 name= geo="" msg="check ns file synchronization"
190408 11:03:45 time=1554714225.877853 func=WaitNamespaceFilesInSync level=INFO  logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:2086                    tident=<service> sec=      uid=0 gid=0 name= geo="" remote-sync host=naneosmgr02.in2p3.fr:1096 is reachable
190408 11:03:45 time=1554714225.881939 func=WaitNamespaceFilesInSync level=INFO  logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:2132                    tident=<service> sec=      uid=0 gid=0 name= geo="" remote files file=2652400968 dir=20781148
190408 11:04:35 time=1554714275.882484 func=WaitNamespaceFilesInSync level=INFO  logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:2201                    tident=<service> sec=      uid=0 gid=0 name= geo="" msg="ns files  synchronized"
190408 11:04:54 time=1554714294.263313 func=BootNamespace            level=NOTE  logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:1890                    tident=<service> sec=      uid=0 gid=0 name= geo="" eos directory view configure started as slave
190408 11:04:54 time=1554714294.675761 func=BootNamespace            level=NOTE  logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:1929                    tident=<service> sec=      uid=0 gid=0 name= geo="" eos directory view configure stopped after 0 seconds
190408 11:04:54 time=1554714294.675797 func=BootNamespace            level=NOTE  logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:1933                    tident=<service> sec=      uid=0 gid=0 name= geo="" running in slave mode
190408 11:04:54 time=1554714294.675857 func=RebootSlaveNamespace     level=INFO  logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:2290                    tident=<service> sec=      uid=0 gid=0 name= geo="" msg="starting file view loader thread"
190408 11:04:54 time=1554714294.675902 func=RebootSlaveNamespace     level=NOTE  logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:2314                    tident=<service> sec=      uid=0 gid=0 name= geo="" running in slave mode
190408 11:06:22 time=1554714382.537216 func=InitializeFileView       level=NOTE  logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fccca1fc700 source=XrdMgmOfsConfigure:171         tident=<single-exec> sec=      uid=0 gid=0 name= geo="" eos namespace file loading stopped after 88 seconds
190408 13:54:46 time=1554724486.504885 func=Activate                 level=NOTE  logid=static.............................. unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcdd89e3700 source=Master:1062                    tident= sec=(null) uid=99 gid=99 name=- geo="" configdir=/var/eos/config/naneosmgr01.in2p3.fr/ activating master=naneosmgr01.in2p3.fr
190408 13:54:46 time=1554724486.513374 func=Activate                 level=NOTE  logid=static.............................. unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcdd89e3700 source=Master:1113                    tident= sec=(null) uid=99 gid=99 name=- geo="" Doing Slave=>Master transition
190408 13:54:46 time=1554724486.583820 func=Slave2Master             level=INFO  logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcdd89e3700 source=Master:1357                    tident=<service> sec=      uid=0 gid=0 name= geo="" remote-sync host=naneosmgr02.in2p3.fr:1096 is reachable
190408 13:54:46 time=1554724486.587320 func=Slave2Master             level=INFO  logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcdd89e3700 source=Master:1444                    tident=<service> sec=      uid=0 gid=0 name= geo="" msg="invoking slave=>master transition"
190408 13:54:46 time=1554724486.725332 func=Slave2Master             level=WARN  logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcdd89e3700 source=Master:1504                    tident=<service> sec=      uid=0 gid=0 name= geo="" failed to start eossync services - 1
190408 13:54:46 time=1554724486.725434 func=Slave2Master             level=INFO  logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcdd89e3700 source=Master:1509                    tident=<service> sec=      uid=0 gid=0 name= geo="" msg="registering new manager to nodes"
190408 13:54:46 time=1554724486.728325 func=Slave2Master             level=NOTE  logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcdd89e3700 source=Master:1514                    tident=<service> sec=      uid=0 gid=0 name= geo="" running in master mode
[root@naneosmgr01(EOSMASTER) ~]#ls -la /var/eos/md
total 5204968
drwx------   2 daemon root         4096 Apr  9 09:57 .
drwxr-xr-x. 14 daemon daemon        214 Apr  8 14:54 ..
-rw-r--r--   1 daemon daemon     286880 Apr  9 09:51 directories.naneosmgr01.in2p3.fr.mdlog
-rw-r--r--   1 daemon daemon   20785176 Apr  8 16:51 directories.naneosmgr02.in2p3.fr.mdlog
-rw-r--r--   1 daemon daemon      26652 Apr  9 09:51 files.naneosmgr01.in2p3.fr.mdlog
-rw-r--r--   1 daemon daemon 2652405648 Apr  8 16:52 files.naneosmgr02.in2p3.fr.mdlog
-rw-r--r--   1 daemon daemon 2655689636 Apr  8 10:16 files.naneosmgr02.in2p3.fr.mdlog.1554714224
-rwxr--r--   1 daemon daemon       2218 Apr  9 09:57 iostat.naneosmgr01.in2p3.fr:1094.dump
-rw-r--r--   1 daemon daemon       3210 Apr  9 09:57 iostat.naneosmgr02.in2p3.fr.dump
-rw-r--r--   1 daemon daemon     474774 Apr  9 09:57 so.mgm.dump.naneosmgr01.in2p3.fr:1094
-rw-r--r--   1 daemon root       122880 Apr  4 14:58 stacktrace

Thanks

JM

Ok, you should put the service in read-only mode (set all the filesystems to RO) or make the master a read-only master.

Something went wrong on your machine, because the namespace files on your new master have been truncated. Are you aware of doing anything like that?

It it not too bad, because only 230 files have been created, but to prevent further problems, you should manually copy on the new master:

files.naneosmgr02.in2p3.fr.mdlog => files.naneosmgr01.in2p3.fr.mdlog
directories.naneosmgr02.in2p3.fr.mdlog => directories.naneosmgr01.in2p3.fr.mdlog

Then you stop the new master,  set the machine as master using systemd or service scripts and start the service again.

If you have the logging enabled, you can check the file names of the 230 files created since yesterday and either tell latchezar to remove them from the catalogue or you can also recover them. 

If you want to recover them, I will write you how ...


Thanks Andreas,

I cannot act now, I have an important meeting. I will do it this afternoon at 14:00.

How do I put the master in RO ?
How can I check what were the files created ?

Back at 14:00.

JM

You just issue once the command to make naneosmgr01 as master on naneosmgr02. This puts mgr02 to masterRO mode and mgr01 stays as slave.

I have to check that I understand well:

Currently naneosmgr01 is the master and naneosmgr02 the slave. You are asking me to do the following ?

[root@naneosmgr02(EOSSLAVE) ~]#eos -b ns master naneosmgr01.in2p3.fr

naneosmgr01is already master and will not change to slave this way or am I wrong ?

JM

Sorry, I mixed up the two names :wink:
You issue this command on 01 and let him go into MasterRO by pushing the master variable to 02.
root@naneosmgr01(EOSMASTER)# eos -b ns master nanoeosmgr02.in2p3.fr

Hi Andreas,

I am ready. Here is the situation with NS files:

[root@naneosmgr01(EOSMASTER) ~]#ls -lart /var/eos/md
total 5205000
-rw-r--r--   1 daemon root       122880 Apr  4 14:58 stacktrace
-rw-r--r--   1 daemon daemon 2655689636 Apr  8 10:16 files.naneosmgr02.in2p3.fr.mdlog.1554714224
drwxr-xr-x. 14 daemon daemon        214 Apr  8 14:54 ..
-rw-r--r--   1 daemon daemon   20785176 Apr  8 16:51 directories.naneosmgr02.in2p3.fr.mdlog
-rw-r--r--   1 daemon daemon 2652405648 Apr  8 16:52 files.naneosmgr02.in2p3.fr.mdlog
-rw-r--r--   1 daemon daemon       5940 Apr  9 13:46 files.naneosmgr01.in2p3.fr.mdlog
-rw-r--r--   1 daemon daemon     337092 Apr  9 13:46 directories.naneosmgr01.in2p3.fr.mdlog
-rw-r--r--   1 daemon daemon     475740 Apr  9 13:52 so.mgm.dump.naneosmgr01.in2p3.fr:1094
-rwxr--r--   1 daemon daemon       2218 Apr  9 13:52 iostat.naneosmgr01.in2p3.fr:1094.dump
drwx------   2 daemon root         4096 Apr  9 13:52 .
-rw-r--r--   1 daemon daemon       3210 Apr  9 13:52 iostat.naneosmgr02.in2p3.fr.dump

[root@naneosmgr02(EOSSLAVE) ~]#ls -lart /var/eos/md
total 10435260
-rwxr--r--   1 daemon daemon       3210 Mar 21 09:25 iostat.naneosmgr02.in2p3.fr.dump
-rw-rw-rw-   1 daemon daemon     469630 Mar 21 09:26 so.mgm.dump
-rw-rw-rw-   1 daemon root        73127 Mar 21 09:26 stacktrace
-rw-r--r--   1 daemon daemon   10153784 Mar 21 11:09 directories.naneosmgr01.in2p3.fr.mdlog
-rw-r--r--   1 daemon daemon 2663917700 Mar 21 11:10 files.naneosmgr01.in2p3.fr.mdlog
-rw-r--r--   1 daemon daemon 2681690572 Apr  4 08:15 files.naneosmgr02.in2p3.fr.mdlog.1554360503
-rw-r--r--   1 daemon daemon       3210 Apr  4 09:12 iostat.naneosmgr01.in2p3.fr.dump
-rw-r--r--   1 daemon daemon 2655689636 Apr  8 10:16 files.naneosmgr02.in2p3.fr.mdlog.1554714157
-rw-r--r--   1 daemon daemon   20785176 Apr  8 13:11 directories.naneosmgr02.in2p3.fr.mdlog
-rw-r--r--   1 daemon daemon 2652405648 Apr  8 13:15 files.naneosmgr02.in2p3.fr.mdlog
drwxr-xr-x. 11 daemon daemon       4096 Apr  8 14:12 ..
-rwxr--r--   1 daemon daemon       2240 Apr  9 13:52 iostat.naneosmgr02.in2p3.fr:1094.dump
-rw-r--r--   1 daemon daemon     473993 Apr  9 13:52 so.mgm.dump.naneosmgr02.in2p3.fr:1094
drwx------.  2 daemon root         4096 Apr  9 13:52 .

I believe that the files files.naneosmgr01.in2p3.fr.mdlog and directories.naneosmgr01.in2p3.fr.mdlog on naneosmgr01 are the current namespace and I suppose that I have to backup them somewhere as they contain the name of the newly created files since the role exchange.

The namespace as it was before the role exchange should be on the files files.naneosmgr02.in2p3.fr.mdlog and directories.naneosmgr02.in2p3.fr.mdlog already on naneosmgr01 and I have to rename them with naneosmgr01 after having put the manager in read-only.

Then if I restart naneosmgr01 as master (RW) it will pick up the files, right ?

Then we look at the synchronization that is currently not working so that naneosmgr02 (slave) gets the NS from naneosmgr01 (master). Right ?

BTW, I did nothing to truncate the NS files, I suppose this is the result of a bad synchro… ?

I will act as soon as you confirm the above.

JM

Yes that is right… in principle yoo could just append the new changeling to the old one with a small trick… if you can wait 2h I send you the info around 4…

Yes, I think I had better wait and do everything in a clean manner with your support.

JM

Jean Michel, can you send me the first 1024 bytes of the /var/eos/files.*.mdlog from the new master which was truncated.
I will tell you then how you can concatenate the files and everything will be just ok afterwards. You can send by mail.

Andreas, I sent the whole file by mail

JM

Hello,

The files that make the namespace have been concatenated following a procedure given by Andreas and the master manager restarted with the new namespace.

Thank you Andreas for your help recovering from a bad situation.

I think however that the synchronization between the master and the slave manager is not working and this has to be checked.

JM

I tried to restart the eossync service on both nodes but it does not change anything: the slave manager does not get an updated copy of the NS files from the master manager.

I forgot which process pushes or pulls the data but it seems that it does not work in the special setting that I currently have : master manager in CentOS7 and slave in SL6.

As the goal is to have both machines in C7, I may very well go directly to the next step which is reinstalling the slave manager in C7…

Ideas, advice ?

Thank you

JM

You have to run the server part also, (eossync is the client pushing)

To start the server you do:

systemctl restart eos@sync

Hi Andreas,

The sync services run as xrootd plugins were already running but I restarted them both on the master and on the server. It does not change anything.

Curiously, there is a file iostat.naneosmgr02.in2p3.fr.dump on the master naneosmgr01 that is updated but it does not correspond to the active file on naneosmgr02:

[root@naneosmgr01(EOSMASTER) ~]#ls -alsrt /var/eos/md/iostat.naneosmgr02.in2p3.fr.dump 
4 -rw-r--r-- 1 daemon daemon 3210 Apr 11 09:55 /var/eos/md/iostat.naneosmgr02.in2p3.fr.dump
[root@naneosmgr02(EOSSLAVE) ~]#ls -alsrt /var/eos/md/iostat.naneosmgr02.in2p3.fr*
4 -rwxr--r-- 1 daemon daemon 3210 Mar 21 09:25 /var/eos/md/iostat.naneosmgr02.in2p3.fr.dump
4 -rwxr--r-- 1 daemon daemon 2242 Apr 11 09:54 /var/eos/md/iostat.naneosmgr02.in2p3.fr:1094.dump

… and no synchro for the NS files, nor the config.

JM

Oh,
can you have a look at the log files in /var/log/eos/sync/

There is a log file of the server xrdlog.sync
but also log files for the clients … do you have the eosfilsync commands running?

ps aux | grep eosfilesync

Logfiles /var/log/eos/sync/ do not say much:

On master:

 [root@naneosmgr01(EOSMASTER) ~]#tail -50 /var/log/eos/sync/xrdlog.sync | grep -v nangios
[...]
190411 09:48:49 138119 Starting on Linux 3.10.0-957.1.3.el7.x86_64
Copr.  2004-2012 Stanford University, xrd version v4.8.4
++++++ xrootd sync@naneosmgr01.in2p3.fr initialization started.
Config using configuration file /etc/xrd.cf.sync
=====> xrd.network keepalive
=====> xrd.port 1096
Config maximum number of connections restricted to 65000
Copr.  2012 Stanford University, xrootd protocol 3.1.0 version v4.8.4
++++++ xrootd protocol initialization started.
=====> xrootd.async off nosf
=====> xrootd.seclib libXrdSec.so
=====> all.export /var/eos/ nolock
Config exporting /var/eos/
Plugin loaded 
++++++ Authentication system initialization started.
Plugin loaded 
=====> sec.protocol sss -c /etc/eos.keytab -s /etc/eos.keytab
Config warning: protocol sss previously defined.
=====> sec.protocol sss
Config 2 authentication directives processed in /etc/xrd.cf.sync
------ Authentication system initialization completed.
++++++ Protection system initialization started.
Config warning: Security level is set to none; request protection disabled!
Config Local  protection level: none
Config Remote protection level: none
------ Protection system initialization completed.
Config Routing for naneosmgr01.in2p3.fr: local pub4 prv4 pub6 prv6
Config Route all4: naneosmgr01.in2p3.fr Dest=[::193.48.101.204]:1096
Config Route all6: naneosmgr01.in2p3.fr Dest=[2001:660:7224:100:193:48:101:204]:1096
++++++ File system initialization started.
=====> ofs.trace open close
++++++ Storage system initialization started.
=====> all.export /var/eos/ nolock
Config effective /etc/xrd.cf.sync oss configuration:
       oss.alloc        0 0 0
       oss.cachescan    600
       oss.fdlimit      32500 65000
       oss.maxsize      0
       oss.trace        0
       oss.xfr          1 deny 10800 keep 1200
       oss.memfile off  max 33565681664
       oss.defaults  r/w  nocheck nodread nomig norcreate nopurge nostage xattr
       oss.path /var/eos/ r/w  nocheck nodread nomig norcreate nopurge nostage xattr
------ Storage system initialization completed.
Config effective /etc/xrd.cf.sync ofs configuration:
       all.role server
       ofs.maxdelay   60
       ofs.persist    manual hold 600 logdir /tmp/sync/.ofs/posc.log
       ofs.trace      4
------ File system server initialization completed.
Config warning: asynchronous I/O has been disabled!
Config warning: sendfile I/O has been disabled!
Config warning: 'xrootd.prepare logdir' not specified; prepare tracking disabled.
------ xrootd protocol initialization completed.
------ xrootd sync@naneosmgr01.in2p3.fr:1096 initialization completed.
190411 09:48:49 138126 XrootdXeq: daemon.116933:20@naneosmgr02 pub IP64 login as daemon
190411 09:48:49 138126 daemon.116933:20@naneosmgr02 ofs_open: 2-664 fn=/var/eos/md/iostat.naneosmgr02.in2p3.fr.dump
190411 09:54:18 138137 XrootdXeq: daemon.135909:22@naneosmgr pub IP64 login as daemon
190411 09:54:18 138137 daemon.135909:22@naneosmgr ofs_open: 2-664 fn=/var/eos/md/files.naneosmgr01.in2p3.fr.mdlog
190411 09:54:18 138297 XrootdXeq: daemon.135910:23@naneosmgr pub IP64 login as daemon
190411 09:54:18 138297 daemon.135910:23@naneosmgr ofs_open: 2-664 fn=/var/eos/md/directories.naneosmgr01.in2p3.fr.mdlog

On slave:

[root@naneosmgr02(EOSSLAVE) ~]#tail -50 /var/log/eos/sync/xrdlog.sync | grep -v nangios
=====> all.export /var/eos/ nolock
Config exporting /var/eos/
Plugin loaded 
++++++ Authentication system initialization started.
Plugin loaded 
=====> sec.protocol sss -c /etc/eos.keytab -s /etc/eos.keytab
Config warning: protocol sss previously defined.
=====> sec.protocol sss
Config 2 authentication directives processed in /etc/xrd.cf.sync
------ Authentication system initialization completed.
++++++ Protection system initialization started.
Config warning: Security level is set to none; request protection disabled!
Config Local  protection level: none
Config Remote protection level: none
------ Protection system initialization completed.
Config Routing for naneosmgr02.in2p3.fr: local pub4 prv4 pub6 prv6
Config Route all4: naneosmgr02.in2p3.fr Dest=[::193.48.101.205]:1096
Config Route all6: naneosmgr02.in2p3.fr Dest=[2001:660:7224:100:193:48:101:205]:1096
++++++ File system initialization started.
=====> ofs.trace open close
++++++ Storage system initialization started.
=====> all.export /var/eos/ nolock
Config effective /etc/xrd.cf.sync oss configuration:
       oss.alloc        0 0 0
       oss.cachescan    600
       oss.fdlimit      32500 65000
       oss.maxsize      0
       oss.trace        0
       oss.xfr          1 deny 10800 keep 1200
       oss.memfile off  max 33650980864
       oss.defaults  r/w  nocheck nodread nomig norcreate nopurge nostage xattr
       oss.path /var/eos/ r/w  nocheck nodread nomig norcreate nopurge nostage xattr
------ Storage system initialization completed.
Config effective /etc/xrd.cf.sync ofs configuration:
       all.role server
       ofs.maxdelay   60
       ofs.persist    manual hold 600 logdir /tmp/sync/.ofs/posc.log
       ofs.trace      4
------ File system server initialization completed.
Config warning: asynchronous I/O has been disabled!
Config warning: sendfile I/O has been disabled!
Config warning: 'xrootd.prepare logdir' not specified; prepare tracking disabled.
------ xrootd protocol initialization completed.
------ xrootd sync@naneosmgr02.in2p3.fr:1096 initialization completed.

Processes are running:

[root@naneosmgr01(EOSMASTER) ~]#ps aux | grep eosfilesync
daemon   135909  0.0  0.0 217896 23268 ?        Ssl  09:08   0:00 /usr//sbin/eosfilesync /var/eos/md/files.naneosmgr01.in2p3.fr.mdlog root://naneosmgr01.in2p3.fr:1096///var/eos/md/files.naneosmgr01.in2p3.fr.mdlog
daemon   135910 21.1  0.4 488232 299876 ?       Ssl  09:08  16:45 /usr//sbin/eosfilesync /var/eos/md/directories.naneosmgr01.in2p3.fr.mdlog root://naneosmgr01.in2p3.fr:1096///var/eos/md/directories.naneosmgr01.in2p3.fr.mdlog
daemon   135913  0.0  0.0 141992 12476 ?        Ss   09:08   0:00 /usr//sbin/eosfilesync /var/eos/md/iostat.naneosmgr01.in2p3.fr.dump root://naneosmgr01.in2p3.fr:1096///var/eos/md/iostat.naneosmgr01.in2p3.fr.dump

… and on the slave:

[root@naneosmgr02(EOSSLAVE) ~]#ps aux | grep eosfilesync
daemon   116893  0.0  0.0 440492 10096 ?        Sl   09:09   0:00 /usr//sbin/eosfilesync /var/eos/md/files.naneosmgr02.in2p3.fr.mdlog root://naneosmgr01.in2p3.fr:1096///var/eos/md/files.naneosmgr02.in2p3.fr.mdlog
daemon   116913  0.0  0.0 440492  8592 ?        Sl   09:09   0:00 /usr//sbin/eosfilesync /var/eos/md/directories.naneosmgr02.in2p3.fr.mdlog root://naneosmgr01.in2p3.fr:1096///var/eos/md/directories.naneosmgr02.in2p3.fr.mdlog
daemon   116933  0.0  0.0 506028  7824 ?        Sl   09:09   0:00 /usr//sbin/eosfilesync /var/eos/md/iostat.naneosmgr02.in2p3.fr.dump root://naneosmgr01.in2p3.fr:1096///var/eos/md/iostat.naneosmgr02.in2p3.fr.dump

JM

Ah, the master syncs to himself … why is that? Do you have both machines defined in /etc/sysconfig/eos_env like
EOS_MGM_MASTER1=“naneosmgr01…”
EOS_MGM_MASTER2=“naneosmgr02…”