EOS mgm with qdb crashes

hello once again ! I figured this belongs in a new thread since it’s a different question

I followed the process described in the previous thread:

  1. ran eos_ns_convert
  2. created new qdb directory
  3. moved new raft-journal into qdb directory with existing state_machine
  4. copied data to all nodes (3 total)
  5. started everything up in raft mode
  6. quarkdb on its own seemed stable enough
  7. started up eos mq & mgm
  8. eos mgm booted okay
EOS Console [root://localhost] |/> ns
# ------------------------------------------------------------------------------------
# Namespace Statistics
# ------------------------------------------------------------------------------------
ALL      Files                            105792732 [booted] (11s)
ALL      Directories                      20066027
# ------------------------------------------------------------------------------------
ALL      Compactification                 status=off waitstart=0 interval=0 ratio-file=0.0:1 ratio-dir=0.0:1
# ------------------------------------------------------------------------------------
ALL      Replication                      mode=master-rw state=master-rw master=crlt-c2.cdndev.aarnet.edu.au configdir=/var/eos/config/crlt-c2.cdndev.aarnet.edu.au/ config=default active=true mgm:gdpt-c2.cdndev.aarnet.edu.au=down mq:gdpt-c2.cdndev.aarnet.edu.au:1097=down
# ------------------------------------------------------------------------------------
ALL      File Changelog Size
ALL      Dir  Changelog Size
# ------------------------------------------------------------------------------------
ALL      avg. File Entry Size
ALL      avg. Dir  Entry Size
# ------------------------------------------------------------------------------------
ALL      files created since boot         5
ALL      container created since boot     8
# ------------------------------------------------------------------------------------
ALL      current file id                  1213967100
ALL      current container id             22635899
# ------------------------------------------------------------------------------------
ALL      memory virtual                   1.43 GB
ALL      memory resident                  213.51 MB
ALL      memory share                     28.11 MB
ALL      memory growths                   0 B
ALL      threads                          101
ALL      uptime                           23
# ------------------------------------------------------------------------------------

(yaaay)
(really enjoying that boot time and amount of memory used :ok_hand: )

at some point though, I start seeing the mgm crash:

Thread 86 (Thread 0x7f1e89ff1700 (LWP 131)):
#0  0x00007f1ebdad9e4d in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f1ebdf16929 in XrdSysTimer::Wait(int) ()
   from /lib64/libXrdUtils.so.2
#2  0x00007f1eb7ee3e51 in eos::common::ShellCmd::wait(unsigned long) ()
   from /lib64/libeosCommonServer.so.4
#3  0x00007f1eb8555198 in eos::common::StackTrace::GdbTrace(char const*, int, char const*) () from /lib64/libXrdEosMgm.so
#4  0x00007f1eb84f7cfb in xrdmgmofs_stacktrace(int) ()
   from /lib64/libXrdEosMgm.so
#5  <signal handler called>
#6  0x00007f1ebcd151f7 in raise () from /lib64/libc.so.6
#7  0x00007f1ebcd168e8 in abort () from /lib64/libc.so.6
#8  0x00007f1ebd61bac5 in __gnu_cxx::__verbose_terminate_handler() ()
   from /lib64/libstdc++.so.6
#9  0x00007f1ebd619a36 in ?? () from /lib64/libstdc++.so.6
#10 0x00007f1ebd619a63 in std::terminate() () from /lib64/libstdc++.so.6
#11 0x00007f1e96bcea26 in qclient::BackgroundFlusher::FlusherCallback::handleResponse(std::shared_ptr<redisReply>&&) () from /usr/lib64/libEosNsQuarkdb.so
#12 0x00007f1e96ab4fe4 in qclient::CallbackExecutorThread::main(qclient::ThreadAssistant&) () from /usr/lib64/libEosNsQuarkdb.so
#13 0x00007f1eb753ae2f in execute_native_thread_routine ()
   from /opt/eos-folly/lib/libfolly.so
#14 0x00007f1ebdad2e25 in start_thread () from /lib64/libpthread.so.0
#15 0x00007f1ebcdd834d in clone () from /lib64/libc.so.6

and on the quarkdb nodes:

watf? exit

at least. I think that message is from quarkdb…

i am assuming maybe I missed out some kind of configuration, maybe on the mgm side ?

more stack traces i found from qdb logs:

terminate called after throwing an instance of 'quarkdb::FatalException'
  what():  assertion violation, condition is not true: commit == 0 ----- Stack trace (most recent call last) in thread 71:
#8    Object ", at 0xffffffffffffffff, in
#7    Object ", at 0x7fda5f8a334c, in
#6    Object ", at 0x7fda6059de24, in
#5    Object ", at 0x7fda5bc5159e, in
#4    Object ", at 0x7fda5ba4c096, in
#3    Object ", at 0x7fda5ba4c00a, in
#2    Object ", at 0x7fda5ba4bb9b, in
#1    Object ", at 0x7fda5b9fab37, in
#0    Object ", at 0x7fda5b9c4693, in

Stack trace (most recent call last) in thread 71:
#8    Object ", at 0xffffffffffffffff, in
#7    Object ", at 0x7fda5f8a334c, in
#6    Object ", at 0x7fda6059de24, in
#5    Object ", at 0x7fda5bc515cd, in
#4    Object ", at 0x7fda600e4a62, in
#3    Object ", at 0x7fda600e4a35, in
#2    Object ", at 0x7fda600e6ac4, in
#1    Object ", at 0x7fda5f7e18e7, in
#0    Object ", at 0x7fda5f7e01f7, in
Aborted (Signal sent by tkill() 1 0)

I didn’t see issues before I started the mgm, I think, but starting quarkdb again without the MGM just continously shows crashes now with this error.

(i probably should install debuginfos…brb)

Hi Crystal,

Which QDB versions are you using?

It looks like to me an entry has been inserted into the journal which some of the nodes do not understand, and crash in result. Are you running the same QDB version across all nodes?

Cheers,
Georgios

sorry!! I didn’t get a chance to look into this again for a couple of days.

QDB version is 0.2.3-1.el7.cern, same on all nodes.

when I restarted the mgm again, it didn’t crash but complained about this: initialization returned ec=17 SafetyCheck FATAL: Risk of data loss, found container with id bigger than max container id

anyway, I shut everything down, cleaned out everything on the mgm side, reloaded a clean state-machine on all the qdb nodes, and everything seems okay again now. going to leave it for a while and see how long it survives!

I mostly just wanted to work through the namespace switchover process - not entirely sure what went wrong the first time :confused: