CERN Accelerating science

Geobalancer crash


(Andrey Zarochentsev) #1

Good day!

To my experience, EOS geobalncer worked well in version 4.1.11
Now, after upgrade to version 4.2.22, it does not work anymore.

After some research I have found the following:

I see an empty log file /var/log/eos/mgm/GeoBalancer.log
After turning debug on:
{
EOS Console [root://localhost] |/> debug debug /eos//mgm
success: switched to mgm.debuglevel=debug on nodes mgm.nodename=/eos/
/mgm
EOS Console [root://localhost] |/>
}

I see this:
{
/var/log/eos/mgm/GeoBalancer.log
180516 15:23:04 DEBUG GeoBalancer:648 Converter is off for! It needs to be on for the geotag balancer to work. space=default
}

Seems like converter needs to be turned on as well:
{
EOS Console [root://localhost] |/> space config default space.converter=on
success: converter is enabled!
EOS Console [root://localhost] |/>
}

Finally it looks like geobalancer has started to work.
From /var/log/eos/mgm/GeoBalancer.log:
{
180516 15:24:24 INFO GeoBalancer:663 geobalancer is enabled ntx=10
180516 15:24:24 INFO GeoBalancer:408 scheduledtransfers=0
180516 15:24:24 INFO GeoBalancer:273 New average calculated: average=16.39 %
180516 15:24:24 INFO GeoBalancer:205 geotag=ITEP average=35.87
180516 15:24:24 INFO GeoBalancer:205 geotag=JINR average=9.86
180516 15:24:24 INFO GeoBalancer:205 geotag=KIAE average=3.45
180516 15:24:24 DEBUG GeoBalancer:539 Couldn’t choose any FID to schedule: failedgeotag=ITEP
}

But after a short period of time mgm crashes, and the whole EOS instance goes down.
MGM process remains in memory, but any connection attempt fails with a timeout.
From /var/log/eos/mgm/xrdlog.mgm:
{
#########################################################################

-----------------------------------------------------------------------

Responsible thread =>

-----------------------------------------------------------------------

Thread 22 (Thread 0x7f8235e13700 (LWP 10592)):

#########################################################################
#5
#6 0x00007f82af551658 in eos::mgm::GeoBalancer::chooseFidFromGeotag(std::basic_string<char, std::char_traits, std::allocator > const&) () from /usr/lib64/libXrdEosMgm.so
#7 0x00007f82af555077 in eos::mgm::GeoBalancer::prepareTransfer() () from /usr/lib64/libXrdEosMgm.so
#8 0x00007f82af55522b in eos::mgm::GeoBalancer::prepareTransfers(int) () from /usr/lib64/libXrdEosMgm.so
#9 0x00007f82af555d61 in eos::mgm::GeoBalancer::GeoBalance() () from /usr/lib64/libXrdEosMgm.so
#10 0x00007f82b56ea53f in XrdSysThread_Xeq () from /usr/lib64/libXrdUtils.so.2
#11 0x0000003bc3407aa1 in start_thread () from /lib64/libpthread.so.0
#12 0x0000003bc30e8bcd in clone () from /lib64/libc.so.6
}

Do you have any clues?

Cheers,
Andrey


(Andrey Zarochentsev) #2

Ups… I hope, I found the problem of mgm crash.

http://eos-docs.web.cern.ch/eos-docs/configuration/groupbalancer.html?highlight=balancer

{
Make sure that you have enabled the converter and the converter.ntx space variable is bigger than geobalancer.ntx :
}

After change:
{
EOS Console [root://localhost] |/> space config default space.converter.ntx=15
success: setting converter.ntx=15
EOS Console [root://localhost] |/>
}

I have not see problem with MGM. But geobalancer does not work again :slight_smile: