barbet
(Jean Michel Barbet)
April 9, 2019, 5:55am
1
Hello,
Yesterday I performed a role exchange between two managers running the same 4.4.23 EOS version. The master in SL6 became slave and a newly installed C7 manager was promoted to master. At the third step when telling again the previous master that the other machine is master, it hanged and I had to restart the eos services.
But I got worried that the files holding the namespace are much smaller on the new master. I am wondering if this is expected. Also I doubt that the synchronization of the namespace files is working correctly. I observe that the now slave does not retrieve a copy of the files from the master. I checked all the daemons, tried to restart them several times.
The questions are:
Is that expected that the namespace files are dramatically shrunked on a new master ?
How to diagnose/debug the synchronization between master and slave ?
I will provide info, but, already, it seems the name of the files have changed and include the xrootd port number for the sync service while it did not before, is this a problem ?
And is the bug that was requiring to comment-out the line “::1” in /etc/hosts still here ? (or in short: is it still necessary to do so ?)
Thank you
JM
apeters
(Andreas Joachim Peters)
April 9, 2019, 7:15am
2
Hi Jean Michel,
the third step normally involves a full namespace restart.
Can you just paste the output of
eos ns
eos ns master
on both machines
and a listing of
ls -la /var/eos/md
The ::1 was only needed with an xrootd version which was not listening on IPV6. If you have 4.8.* this can be removed.
barbet
(Jean Michel Barbet)
April 9, 2019, 7:59am
3
Hi Andreas,
Here you are:
naneosmgr02 (SL6, was the master before role exchange):
[root@naneosmgr02(EOSSLAVE) ~]#eos ns
# ------------------------------------------------------------------------------------
# Namespace Statistics
# ------------------------------------------------------------------------------------
ALL Files 16633579 [booted] (96s)
ALL Directories 54545
ALL Total boot time 97 s
# ------------------------------------------------------------------------------------
ALL Compactification status=off waitstart=0 interval=0 ratio-file=0.0:1 ratio-dir=0.0:1
# ------------------------------------------------------------------------------------
ALL Replication mode=slave-ro state=slave-ro master=naneosmgr01.in2p3.fr configdir=/var/eos/config/naneosmgr01.in2p3.fr/ config=default mgm:naneosmgr01.in2p3.fr=ok mgm:mode=master-rw mq:naneosmgr01.in2p3.fr:1097=ok
ALL Namespace Latency Files 0
ALL Namespace Latency Directories 0
ALL Namespace Pending Updates 0
# ------------------------------------------------------------------------------------
ALL File Changelog Size 2.66 GB
ALL Dir Changelog Size 10.15 MB
# ------------------------------------------------------------------------------------
ALL avg. File Entry Size 160 B
ALL avg. Dir Entry Size 186 B
# ------------------------------------------------------------------------------------
ALL files created since boot 126
ALL container created since boot 0
# ------------------------------------------------------------------------------------
ALL current file id 99421091
ALL current container id 54546
# ------------------------------------------------------------------------------------
ALL eosxd caps 0
ALL eosxd clients 0
# ------------------------------------------------------------------------------------
ALL memory virtual 18.98 GB
ALL memory resident 14.51 GB
ALL memory share 2.68 GB
ALL memory growths 41.84 MB
ALL threads 153
ALL fds 208
ALL uptime 68488
# ------------------------------------------------------------------------------------
ALL drain info id=default, thread_pool_min=6, thread_pool_max=400, thread_pool_size=6, queue_size=0
# ------------------------------------------------------------------------------------
[root@naneosmgr02(EOSSLAVE) ~]#eos ns master
190408 14:54:08 time=1554728048.148155 func=BootNamespace level=NOTE logid=65a4dee8-59fd-11e9-9f9d-14187764bdce unit=mgm@naneosmgr02.in2p3.fr:1094 tid=00007f5484b5b720 source=Master:1890 tident=<service> sec= uid=0 gid=0 name= geo="" eos directory view configure started as slave
190408 14:54:08 time=1554728048.784112 func=BootNamespace level=NOTE logid=65a4dee8-59fd-11e9-9f9d-14187764bdce unit=mgm@naneosmgr02.in2p3.fr:1094 tid=00007f5484b5b720 source=Master:1929 tident=<service> sec= uid=0 gid=0 name= geo="" eos directory view configure stopped after 0 seconds
190408 14:54:08 time=1554728048.784156 func=BootNamespace level=NOTE logid=65a4dee8-59fd-11e9-9f9d-14187764bdce unit=mgm@naneosmgr02.in2p3.fr:1094 tid=00007f5484b5b720 source=Master:1933 tident=<service> sec= uid=0 gid=0 name= geo="" running in slave mode
190408 14:54:08 time=1554728048.789587 func=Activate level=NOTE logid=static.............................. unit=mgm@naneosmgr02.in2p3.fr:1094 tid=00007f5484b5b720 source=Master:1062 tident= sec=(null) uid=99 gid=99 name=- geo="" configdir=/var/eos/config/naneosmgr01.in2p3.fr/ activating master=naneosmgr01.in2p3.fr
190408 14:54:08 time=1554728048.789654 func=Activate level=INFO logid=static.............................. unit=mgm@naneosmgr02.in2p3.fr:1094 tid=00007f5484b5b720 source=Master:1071 tident= sec=(null) uid=99 gid=99 name=- geo="" autoload config=default
190408 14:54:08 time=1554728048.848270 func=Activate level=INFO logid=static.............................. unit=mgm@naneosmgr02.in2p3.fr:1094 tid=00007f5484b5b720 source=Master:1084 tident= sec=(null) uid=99 gid=99 name=- geo="" Successful auto-load config default
190408 14:55:44 time=1554728144.150081 func=InitializeFileView level=NOTE logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@naneosmgr02.in2p3.fr:1094 tid=00007f53dc377700 source=XrdMgmOfsConfigure:171 tident=<single-exec> sec= uid=0 gid=0 name= geo="" eos namespace file loading stopped after 96 seconds
[root@naneosmgr02(EOSSLAVE) ~]#ls -la /var/eos/md
total 10435260
drwx------. 2 daemon root 4096 Apr 9 09:55 .
drwxr-xr-x. 11 daemon daemon 4096 Apr 8 14:12 ..
-rw-r--r-- 1 daemon daemon 10153784 Mar 21 11:09 directories.naneosmgr01.in2p3.fr.mdlog
-rw-r--r-- 1 daemon daemon 20785176 Apr 8 13:11 directories.naneosmgr02.in2p3.fr.mdlog
-rw-r--r-- 1 daemon daemon 2663917700 Mar 21 11:10 files.naneosmgr01.in2p3.fr.mdlog
-rw-r--r-- 1 daemon daemon 2652405648 Apr 8 13:15 files.naneosmgr02.in2p3.fr.mdlog
-rw-r--r-- 1 daemon daemon 2681690572 Apr 4 08:15 files.naneosmgr02.in2p3.fr.mdlog.1554360503
-rw-r--r-- 1 daemon daemon 2655689636 Apr 8 10:16 files.naneosmgr02.in2p3.fr.mdlog.1554714157
-rw-r--r-- 1 daemon daemon 3210 Apr 4 09:12 iostat.naneosmgr01.in2p3.fr.dump
-rwxr--r-- 1 daemon daemon 2240 Apr 9 09:55 iostat.naneosmgr02.in2p3.fr:1094.dump
-rwxr--r-- 1 daemon daemon 3210 Mar 21 09:25 iostat.naneosmgr02.in2p3.fr.dump
-rw-rw-rw- 1 daemon daemon 469630 Mar 21 09:26 so.mgm.dump
-rw-r--r-- 1 daemon daemon 473564 Apr 9 09:55 so.mgm.dump.naneosmgr02.in2p3.fr:1094
-rw-rw-rw- 1 daemon root 73127 Mar 21 09:26 stacktrace
naneosmgr01 (CentOS7, is the newly installed machine promoted to master):
[root@naneosmgr01(EOSMASTER) ~]#eos ns
# ------------------------------------------------------------------------------------
# Namespace Statistics
# ------------------------------------------------------------------------------------
ALL Files 16581112 [booted] (88s)
ALL Directories 54545
ALL Total boot time 1554714310 s
# ------------------------------------------------------------------------------------
ALL Compactification status=off waitstart=0 interval=0 ratio-file=0.0:1 ratio-dir=0.0:1
# ------------------------------------------------------------------------------------
ALL Replication mode=master-rw state=master-rw master=naneosmgr01.in2p3.fr configdir=/var/eos/config/naneosmgr01.in2p3.fr/ config=default mgm:naneosmgr02.in2p3.fr=ok mgm:mode=slave-ro mq:naneosmgr02.in2p3.fr:1097=ok
# ------------------------------------------------------------------------------------
ALL File Changelog Size 26.65 kB
ALL Dir Changelog Size 286.88 kB
# ------------------------------------------------------------------------------------
ALL avg. File Entry Size 0 B
ALL avg. Dir Entry Size 5 B
# ------------------------------------------------------------------------------------
ALL files created since boot 230
ALL container created since boot 0
# ------------------------------------------------------------------------------------
ALL current file id 99424277
ALL current container id 54546
# ------------------------------------------------------------------------------------
ALL eosxd caps 0
ALL eosxd clients 0
# ------------------------------------------------------------------------------------
ALL memory virtual 23.66 GB
ALL memory resident 18.43 GB
ALL memory share 2.67 GB
ALL memory growths 7.40 MB
ALL threads 168
ALL fds 224
ALL uptime 413496
# ------------------------------------------------------------------------------------
ALL drain info id=default, thread_pool_min=6, thread_pool_max=400, thread_pool_size=6, queue_size=0
# ------------------------------------------------------------------------------------
[root@naneosmgr01(EOSMASTER) ~]#eos ns master
190404 15:06:17 time=1554383177.191864 func=BootNamespace level=NOTE logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcddd3878c0 source=Master:1890 tident=<service> sec= uid=0 gid=0 name= geo="" eos directory view configure started as slave
190404 15:06:17 time=1554383177.584298 func=BootNamespace level=NOTE logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcddd3878c0 source=Master:1929 tident=<service> sec= uid=0 gid=0 name= geo="" eos directory view configure stopped after 0 seconds
190404 15:06:17 time=1554383177.584341 func=BootNamespace level=NOTE logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcddd3878c0 source=Master:1933 tident=<service> sec= uid=0 gid=0 name= geo="" running in slave mode
190404 15:06:17 time=1554383177.591555 func=Activate level=NOTE logid=static.............................. unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcddd3878c0 source=Master:1062 tident= sec=(null) uid=99 gid=99 name=- geo="" configdir=/var/eos/config/naneosmgr02.in2p3.fr/ activating master=naneosmgr02.in2p3.fr
190404 15:06:17 time=1554383177.591622 func=Activate level=INFO logid=static.............................. unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcddd3878c0 source=Master:1071 tident= sec=(null) uid=99 gid=99 name=- geo="" autoload config=default
190404 15:06:17 time=1554383177.665151 func=Activate level=INFO logid=static.............................. unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcddd3878c0 source=Master:1084 tident= sec=(null) uid=99 gid=99 name=- geo="" Successful auto-load config default
190404 15:07:28 time=1554383248.119660 func=InitializeFileView level=NOTE logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcda11fc700 source=XrdMgmOfsConfigure:171 tident=<single-exec> sec= uid=0 gid=0 name= geo="" eos namespace file loading stopped after 71 seconds
190408 11:03:44 time=1554714224.857316 func=TagNamespaceInodes level=INFO logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:2038 tident=<service> sec= uid=0 gid=0 name= geo="" msg="tag namespace inodes"
190408 11:03:44 time=1554714224.857392 func=RedirectToRemoteMaster level=INFO logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:2216 tident=<service> sec= uid=0 gid=0 name= geo="" msg="redirect to remote master"
190408 11:03:44 time=1554714224.857425 func=RedirectToRemoteMaster level=INFO logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:2225 tident=<service> sec= uid=0 gid=0 name= geo="" msg="invoking slave shutdown"
190408 11:03:44 time=1554714224.865277 func=RedirectToRemoteMaster level=INFO logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:2228 tident=<service> sec= uid=0 gid=0 name= geo="" msg="stopped namespace following"
190408 11:03:45 time=1554714225.870895 func=WaitNamespaceFilesInSync level=INFO logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:2059 tident=<service> sec= uid=0 gid=0 name= geo="" msg="check ns file synchronization"
190408 11:03:45 time=1554714225.877853 func=WaitNamespaceFilesInSync level=INFO logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:2086 tident=<service> sec= uid=0 gid=0 name= geo="" remote-sync host=naneosmgr02.in2p3.fr:1096 is reachable
190408 11:03:45 time=1554714225.881939 func=WaitNamespaceFilesInSync level=INFO logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:2132 tident=<service> sec= uid=0 gid=0 name= geo="" remote files file=2652400968 dir=20781148
190408 11:04:35 time=1554714275.882484 func=WaitNamespaceFilesInSync level=INFO logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:2201 tident=<service> sec= uid=0 gid=0 name= geo="" msg="ns files synchronized"
190408 11:04:54 time=1554714294.263313 func=BootNamespace level=NOTE logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:1890 tident=<service> sec= uid=0 gid=0 name= geo="" eos directory view configure started as slave
190408 11:04:54 time=1554714294.675761 func=BootNamespace level=NOTE logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:1929 tident=<service> sec= uid=0 gid=0 name= geo="" eos directory view configure stopped after 0 seconds
190408 11:04:54 time=1554714294.675797 func=BootNamespace level=NOTE logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:1933 tident=<service> sec= uid=0 gid=0 name= geo="" running in slave mode
190408 11:04:54 time=1554714294.675857 func=RebootSlaveNamespace level=INFO logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:2290 tident=<service> sec= uid=0 gid=0 name= geo="" msg="starting file view loader thread"
190408 11:04:54 time=1554714294.675902 func=RebootSlaveNamespace level=NOTE logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fc9b8fff700 source=Master:2314 tident=<service> sec= uid=0 gid=0 name= geo="" running in slave mode
190408 11:06:22 time=1554714382.537216 func=InitializeFileView level=NOTE logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fccca1fc700 source=XrdMgmOfsConfigure:171 tident=<single-exec> sec= uid=0 gid=0 name= geo="" eos namespace file loading stopped after 88 seconds
190408 13:54:46 time=1554724486.504885 func=Activate level=NOTE logid=static.............................. unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcdd89e3700 source=Master:1062 tident= sec=(null) uid=99 gid=99 name=- geo="" configdir=/var/eos/config/naneosmgr01.in2p3.fr/ activating master=naneosmgr01.in2p3.fr
190408 13:54:46 time=1554724486.513374 func=Activate level=NOTE logid=static.............................. unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcdd89e3700 source=Master:1113 tident= sec=(null) uid=99 gid=99 name=- geo="" Doing Slave=>Master transition
190408 13:54:46 time=1554724486.583820 func=Slave2Master level=INFO logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcdd89e3700 source=Master:1357 tident=<service> sec= uid=0 gid=0 name= geo="" remote-sync host=naneosmgr02.in2p3.fr:1096 is reachable
190408 13:54:46 time=1554724486.587320 func=Slave2Master level=INFO logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcdd89e3700 source=Master:1444 tident=<service> sec= uid=0 gid=0 name= geo="" msg="invoking slave=>master transition"
190408 13:54:46 time=1554724486.725332 func=Slave2Master level=WARN logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcdd89e3700 source=Master:1504 tident=<service> sec= uid=0 gid=0 name= geo="" failed to start eossync services - 1
190408 13:54:46 time=1554724486.725434 func=Slave2Master level=INFO logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcdd89e3700 source=Master:1509 tident=<service> sec= uid=0 gid=0 name= geo="" msg="registering new manager to nodes"
190408 13:54:46 time=1554724486.728325 func=Slave2Master level=NOTE logid=6e9b9a26-56da-11e9-bbe6-14187764b113 unit=mgm@naneosmgr01.in2p3.fr:1094 tid=00007fcdd89e3700 source=Master:1514 tident=<service> sec= uid=0 gid=0 name= geo="" running in master mode
[root@naneosmgr01(EOSMASTER) ~]#ls -la /var/eos/md
total 5204968
drwx------ 2 daemon root 4096 Apr 9 09:57 .
drwxr-xr-x. 14 daemon daemon 214 Apr 8 14:54 ..
-rw-r--r-- 1 daemon daemon 286880 Apr 9 09:51 directories.naneosmgr01.in2p3.fr.mdlog
-rw-r--r-- 1 daemon daemon 20785176 Apr 8 16:51 directories.naneosmgr02.in2p3.fr.mdlog
-rw-r--r-- 1 daemon daemon 26652 Apr 9 09:51 files.naneosmgr01.in2p3.fr.mdlog
-rw-r--r-- 1 daemon daemon 2652405648 Apr 8 16:52 files.naneosmgr02.in2p3.fr.mdlog
-rw-r--r-- 1 daemon daemon 2655689636 Apr 8 10:16 files.naneosmgr02.in2p3.fr.mdlog.1554714224
-rwxr--r-- 1 daemon daemon 2218 Apr 9 09:57 iostat.naneosmgr01.in2p3.fr:1094.dump
-rw-r--r-- 1 daemon daemon 3210 Apr 9 09:57 iostat.naneosmgr02.in2p3.fr.dump
-rw-r--r-- 1 daemon daemon 474774 Apr 9 09:57 so.mgm.dump.naneosmgr01.in2p3.fr:1094
-rw-r--r-- 1 daemon root 122880 Apr 4 14:58 stacktrace
Thanks
JM
apeters
(Andreas Joachim Peters)
April 9, 2019, 8:29am
4
Ok, you should put the service in read-only mode (set all the filesystems to RO) or make the master a read-only master.
Something went wrong on your machine, because the namespace files on your new master have been truncated. Are you aware of doing anything like that?
It it not too bad, because only 230 files have been created, but to prevent further problems, you should manually copy on the new master:
files.naneosmgr02.in2p3.fr.mdlog => files.naneosmgr01.in2p3.fr.mdlog
directories.naneosmgr02.in2p3.fr.mdlog => directories.naneosmgr01.in2p3.fr.mdlog
Then you stop the new master, set the machine as master using systemd or service scripts and start the service again.
If you have the logging enabled, you can check the file names of the 230 files created since yesterday and either tell latchezar to remove them from the catalogue or you can also recover them.
If you want to recover them, I will write you how ...
barbet
(Jean Michel Barbet)
April 9, 2019, 8:50am
5
Thanks Andreas,
I cannot act now, I have an important meeting. I will do it this afternoon at 14:00.
How do I put the master in RO ?
How can I check what were the files created ?
Back at 14:00.
JM
apeters
(Andreas Joachim Peters)
April 9, 2019, 9:05am
6
You just issue once the command to make naneosmgr01 as master on naneosmgr02. This puts mgr02 to masterRO mode and mgr01 stays as slave.
barbet
(Jean Michel Barbet)
April 9, 2019, 9:56am
7
I have to check that I understand well:
Currently naneosmgr01 is the master and naneosmgr02 the slave. You are asking me to do the following ?
[root@naneosmgr02(EOSSLAVE) ~]#eos -b ns master naneosmgr01.in2p3.fr
naneosmgr01is already master and will not change to slave this way or am I wrong ?
JM
apeters
(Andreas Joachim Peters)
April 9, 2019, 10:18am
8
Sorry, I mixed up the two names
You issue this command on 01 and let him go into MasterRO by pushing the master variable to 02.
root@naneosmgr01(EOSMASTER)# eos -b ns master nanoeosmgr02.in2p3.fr
barbet
(Jean Michel Barbet)
April 9, 2019, 11:59am
9
Hi Andreas,
I am ready. Here is the situation with NS files:
[root@naneosmgr01(EOSMASTER) ~]#ls -lart /var/eos/md
total 5205000
-rw-r--r-- 1 daemon root 122880 Apr 4 14:58 stacktrace
-rw-r--r-- 1 daemon daemon 2655689636 Apr 8 10:16 files.naneosmgr02.in2p3.fr.mdlog.1554714224
drwxr-xr-x. 14 daemon daemon 214 Apr 8 14:54 ..
-rw-r--r-- 1 daemon daemon 20785176 Apr 8 16:51 directories.naneosmgr02.in2p3.fr.mdlog
-rw-r--r-- 1 daemon daemon 2652405648 Apr 8 16:52 files.naneosmgr02.in2p3.fr.mdlog
-rw-r--r-- 1 daemon daemon 5940 Apr 9 13:46 files.naneosmgr01.in2p3.fr.mdlog
-rw-r--r-- 1 daemon daemon 337092 Apr 9 13:46 directories.naneosmgr01.in2p3.fr.mdlog
-rw-r--r-- 1 daemon daemon 475740 Apr 9 13:52 so.mgm.dump.naneosmgr01.in2p3.fr:1094
-rwxr--r-- 1 daemon daemon 2218 Apr 9 13:52 iostat.naneosmgr01.in2p3.fr:1094.dump
drwx------ 2 daemon root 4096 Apr 9 13:52 .
-rw-r--r-- 1 daemon daemon 3210 Apr 9 13:52 iostat.naneosmgr02.in2p3.fr.dump
[root@naneosmgr02(EOSSLAVE) ~]#ls -lart /var/eos/md
total 10435260
-rwxr--r-- 1 daemon daemon 3210 Mar 21 09:25 iostat.naneosmgr02.in2p3.fr.dump
-rw-rw-rw- 1 daemon daemon 469630 Mar 21 09:26 so.mgm.dump
-rw-rw-rw- 1 daemon root 73127 Mar 21 09:26 stacktrace
-rw-r--r-- 1 daemon daemon 10153784 Mar 21 11:09 directories.naneosmgr01.in2p3.fr.mdlog
-rw-r--r-- 1 daemon daemon 2663917700 Mar 21 11:10 files.naneosmgr01.in2p3.fr.mdlog
-rw-r--r-- 1 daemon daemon 2681690572 Apr 4 08:15 files.naneosmgr02.in2p3.fr.mdlog.1554360503
-rw-r--r-- 1 daemon daemon 3210 Apr 4 09:12 iostat.naneosmgr01.in2p3.fr.dump
-rw-r--r-- 1 daemon daemon 2655689636 Apr 8 10:16 files.naneosmgr02.in2p3.fr.mdlog.1554714157
-rw-r--r-- 1 daemon daemon 20785176 Apr 8 13:11 directories.naneosmgr02.in2p3.fr.mdlog
-rw-r--r-- 1 daemon daemon 2652405648 Apr 8 13:15 files.naneosmgr02.in2p3.fr.mdlog
drwxr-xr-x. 11 daemon daemon 4096 Apr 8 14:12 ..
-rwxr--r-- 1 daemon daemon 2240 Apr 9 13:52 iostat.naneosmgr02.in2p3.fr:1094.dump
-rw-r--r-- 1 daemon daemon 473993 Apr 9 13:52 so.mgm.dump.naneosmgr02.in2p3.fr:1094
drwx------. 2 daemon root 4096 Apr 9 13:52 .
I believe that the files files.naneosmgr01.in2p3.fr.mdlog and directories.naneosmgr01.in2p3.fr.mdlog on naneosmgr01 are the current namespace and I suppose that I have to backup them somewhere as they contain the name of the newly created files since the role exchange.
The namespace as it was before the role exchange should be on the files files.naneosmgr02.in2p3.fr.mdlog and directories.naneosmgr02.in2p3.fr.mdlog already on naneosmgr01 and I have to rename them with naneosmgr01 after having put the manager in read-only.
Then if I restart naneosmgr01 as master (RW) it will pick up the files, right ?
Then we look at the synchronization that is currently not working so that naneosmgr02 (slave) gets the NS from naneosmgr01 (master). Right ?
BTW, I did nothing to truncate the NS files, I suppose this is the result of a bad synchro… ?
I will act as soon as you confirm the above.
JM
apeters
(Andreas Joachim Peters)
April 9, 2019, 12:15pm
10
Yes that is right… in principle yoo could just append the new changeling to the old one with a small trick… if you can wait 2h I send you the info around 4…
barbet
(Jean Michel Barbet)
April 9, 2019, 12:31pm
11
Yes, I think I had better wait and do everything in a clean manner with your support.
JM
apeters
(Andreas Joachim Peters)
April 9, 2019, 1:49pm
12
Jean Michel, can you send me the first 1024 bytes of the /var/eos/files.*.mdlog from the new master which was truncated.
I will tell you then how you can concatenate the files and everything will be just ok afterwards. You can send by mail.
barbet
(Jean Michel Barbet)
April 9, 2019, 2:11pm
13
Andreas, I sent the whole file by mail
JM
barbet
(Jean Michel Barbet)
April 10, 2019, 8:30am
14
Hello,
The files that make the namespace have been concatenated following a procedure given by Andreas and the master manager restarted with the new namespace.
Thank you Andreas for your help recovering from a bad situation.
I think however that the synchronization between the master and the slave manager is not working and this has to be checked.
JM
barbet
(Jean Michel Barbet)
April 11, 2019, 7:14am
15
I tried to restart the eossync service on both nodes but it does not change anything: the slave manager does not get an updated copy of the NS files from the master manager.
I forgot which process pushes or pulls the data but it seems that it does not work in the special setting that I currently have : master manager in CentOS7 and slave in SL6.
As the goal is to have both machines in C7, I may very well go directly to the next step which is reinstalling the slave manager in C7…
Ideas, advice ?
Thank you
JM
apeters
(Andreas Joachim Peters)
April 11, 2019, 7:36am
16
You have to run the server part also, (eossync is the client pushing)
To start the server you do:
systemctl restart eos@sync
barbet
(Jean Michel Barbet)
April 11, 2019, 7:57am
17
Hi Andreas,
The sync services run as xrootd plugins were already running but I restarted them both on the master and on the server. It does not change anything.
Curiously, there is a file iostat.naneosmgr02.in2p3.fr.dump on the master naneosmgr01 that is updated but it does not correspond to the active file on naneosmgr02:
[root@naneosmgr01(EOSMASTER) ~]#ls -alsrt /var/eos/md/iostat.naneosmgr02.in2p3.fr.dump
4 -rw-r--r-- 1 daemon daemon 3210 Apr 11 09:55 /var/eos/md/iostat.naneosmgr02.in2p3.fr.dump
[root@naneosmgr02(EOSSLAVE) ~]#ls -alsrt /var/eos/md/iostat.naneosmgr02.in2p3.fr*
4 -rwxr--r-- 1 daemon daemon 3210 Mar 21 09:25 /var/eos/md/iostat.naneosmgr02.in2p3.fr.dump
4 -rwxr--r-- 1 daemon daemon 2242 Apr 11 09:54 /var/eos/md/iostat.naneosmgr02.in2p3.fr:1094.dump
… and no synchro for the NS files, nor the config.
JM
apeters
(Andreas Joachim Peters)
April 11, 2019, 8:10am
18
Oh,
can you have a look at the log files in /var/log/eos/sync/
There is a log file of the server xrdlog.sync
but also log files for the clients … do you have the eosfilsync commands running?
ps aux | grep eosfilesync
barbet
(Jean Michel Barbet)
April 11, 2019, 8:28am
19
Logfiles /var/log/eos/sync/ do not say much:
On master:
[root@naneosmgr01(EOSMASTER) ~]#tail -50 /var/log/eos/sync/xrdlog.sync | grep -v nangios
[...]
190411 09:48:49 138119 Starting on Linux 3.10.0-957.1.3.el7.x86_64
Copr. 2004-2012 Stanford University, xrd version v4.8.4
++++++ xrootd sync@naneosmgr01.in2p3.fr initialization started.
Config using configuration file /etc/xrd.cf.sync
=====> xrd.network keepalive
=====> xrd.port 1096
Config maximum number of connections restricted to 65000
Copr. 2012 Stanford University, xrootd protocol 3.1.0 version v4.8.4
++++++ xrootd protocol initialization started.
=====> xrootd.async off nosf
=====> xrootd.seclib libXrdSec.so
=====> all.export /var/eos/ nolock
Config exporting /var/eos/
Plugin loaded
++++++ Authentication system initialization started.
Plugin loaded
=====> sec.protocol sss -c /etc/eos.keytab -s /etc/eos.keytab
Config warning: protocol sss previously defined.
=====> sec.protocol sss
Config 2 authentication directives processed in /etc/xrd.cf.sync
------ Authentication system initialization completed.
++++++ Protection system initialization started.
Config warning: Security level is set to none; request protection disabled!
Config Local protection level: none
Config Remote protection level: none
------ Protection system initialization completed.
Config Routing for naneosmgr01.in2p3.fr: local pub4 prv4 pub6 prv6
Config Route all4: naneosmgr01.in2p3.fr Dest=[::193.48.101.204]:1096
Config Route all6: naneosmgr01.in2p3.fr Dest=[2001:660:7224:100:193:48:101:204]:1096
++++++ File system initialization started.
=====> ofs.trace open close
++++++ Storage system initialization started.
=====> all.export /var/eos/ nolock
Config effective /etc/xrd.cf.sync oss configuration:
oss.alloc 0 0 0
oss.cachescan 600
oss.fdlimit 32500 65000
oss.maxsize 0
oss.trace 0
oss.xfr 1 deny 10800 keep 1200
oss.memfile off max 33565681664
oss.defaults r/w nocheck nodread nomig norcreate nopurge nostage xattr
oss.path /var/eos/ r/w nocheck nodread nomig norcreate nopurge nostage xattr
------ Storage system initialization completed.
Config effective /etc/xrd.cf.sync ofs configuration:
all.role server
ofs.maxdelay 60
ofs.persist manual hold 600 logdir /tmp/sync/.ofs/posc.log
ofs.trace 4
------ File system server initialization completed.
Config warning: asynchronous I/O has been disabled!
Config warning: sendfile I/O has been disabled!
Config warning: 'xrootd.prepare logdir' not specified; prepare tracking disabled.
------ xrootd protocol initialization completed.
------ xrootd sync@naneosmgr01.in2p3.fr:1096 initialization completed.
190411 09:48:49 138126 XrootdXeq: daemon.116933:20@naneosmgr02 pub IP64 login as daemon
190411 09:48:49 138126 daemon.116933:20@naneosmgr02 ofs_open: 2-664 fn=/var/eos/md/iostat.naneosmgr02.in2p3.fr.dump
190411 09:54:18 138137 XrootdXeq: daemon.135909:22@naneosmgr pub IP64 login as daemon
190411 09:54:18 138137 daemon.135909:22@naneosmgr ofs_open: 2-664 fn=/var/eos/md/files.naneosmgr01.in2p3.fr.mdlog
190411 09:54:18 138297 XrootdXeq: daemon.135910:23@naneosmgr pub IP64 login as daemon
190411 09:54:18 138297 daemon.135910:23@naneosmgr ofs_open: 2-664 fn=/var/eos/md/directories.naneosmgr01.in2p3.fr.mdlog
On slave:
[root@naneosmgr02(EOSSLAVE) ~]#tail -50 /var/log/eos/sync/xrdlog.sync | grep -v nangios
=====> all.export /var/eos/ nolock
Config exporting /var/eos/
Plugin loaded
++++++ Authentication system initialization started.
Plugin loaded
=====> sec.protocol sss -c /etc/eos.keytab -s /etc/eos.keytab
Config warning: protocol sss previously defined.
=====> sec.protocol sss
Config 2 authentication directives processed in /etc/xrd.cf.sync
------ Authentication system initialization completed.
++++++ Protection system initialization started.
Config warning: Security level is set to none; request protection disabled!
Config Local protection level: none
Config Remote protection level: none
------ Protection system initialization completed.
Config Routing for naneosmgr02.in2p3.fr: local pub4 prv4 pub6 prv6
Config Route all4: naneosmgr02.in2p3.fr Dest=[::193.48.101.205]:1096
Config Route all6: naneosmgr02.in2p3.fr Dest=[2001:660:7224:100:193:48:101:205]:1096
++++++ File system initialization started.
=====> ofs.trace open close
++++++ Storage system initialization started.
=====> all.export /var/eos/ nolock
Config effective /etc/xrd.cf.sync oss configuration:
oss.alloc 0 0 0
oss.cachescan 600
oss.fdlimit 32500 65000
oss.maxsize 0
oss.trace 0
oss.xfr 1 deny 10800 keep 1200
oss.memfile off max 33650980864
oss.defaults r/w nocheck nodread nomig norcreate nopurge nostage xattr
oss.path /var/eos/ r/w nocheck nodread nomig norcreate nopurge nostage xattr
------ Storage system initialization completed.
Config effective /etc/xrd.cf.sync ofs configuration:
all.role server
ofs.maxdelay 60
ofs.persist manual hold 600 logdir /tmp/sync/.ofs/posc.log
ofs.trace 4
------ File system server initialization completed.
Config warning: asynchronous I/O has been disabled!
Config warning: sendfile I/O has been disabled!
Config warning: 'xrootd.prepare logdir' not specified; prepare tracking disabled.
------ xrootd protocol initialization completed.
------ xrootd sync@naneosmgr02.in2p3.fr:1096 initialization completed.
Processes are running:
[root@naneosmgr01(EOSMASTER) ~]#ps aux | grep eosfilesync
daemon 135909 0.0 0.0 217896 23268 ? Ssl 09:08 0:00 /usr//sbin/eosfilesync /var/eos/md/files.naneosmgr01.in2p3.fr.mdlog root://naneosmgr01.in2p3.fr:1096///var/eos/md/files.naneosmgr01.in2p3.fr.mdlog
daemon 135910 21.1 0.4 488232 299876 ? Ssl 09:08 16:45 /usr//sbin/eosfilesync /var/eos/md/directories.naneosmgr01.in2p3.fr.mdlog root://naneosmgr01.in2p3.fr:1096///var/eos/md/directories.naneosmgr01.in2p3.fr.mdlog
daemon 135913 0.0 0.0 141992 12476 ? Ss 09:08 0:00 /usr//sbin/eosfilesync /var/eos/md/iostat.naneosmgr01.in2p3.fr.dump root://naneosmgr01.in2p3.fr:1096///var/eos/md/iostat.naneosmgr01.in2p3.fr.dump
… and on the slave:
[root@naneosmgr02(EOSSLAVE) ~]#ps aux | grep eosfilesync
daemon 116893 0.0 0.0 440492 10096 ? Sl 09:09 0:00 /usr//sbin/eosfilesync /var/eos/md/files.naneosmgr02.in2p3.fr.mdlog root://naneosmgr01.in2p3.fr:1096///var/eos/md/files.naneosmgr02.in2p3.fr.mdlog
daemon 116913 0.0 0.0 440492 8592 ? Sl 09:09 0:00 /usr//sbin/eosfilesync /var/eos/md/directories.naneosmgr02.in2p3.fr.mdlog root://naneosmgr01.in2p3.fr:1096///var/eos/md/directories.naneosmgr02.in2p3.fr.mdlog
daemon 116933 0.0 0.0 506028 7824 ? Sl 09:09 0:00 /usr//sbin/eosfilesync /var/eos/md/iostat.naneosmgr02.in2p3.fr.dump root://naneosmgr01.in2p3.fr:1096///var/eos/md/iostat.naneosmgr02.in2p3.fr.dump
JM
apeters
(Andreas Joachim Peters)
April 11, 2019, 8:44am
20
Ah, the master syncs to himself … why is that? Do you have both machines defined in /etc/sysconfig/eos_env like
EOS_MGM_MASTER1=“naneosmgr01…”
EOS_MGM_MASTER2=“naneosmgr02…”