Mgm crash loop

We started a project to evacuate data from an FST today. First step was adding a new FST with 90 new FSIDs. We had issues almost immediately with the mgm crashing constantly. It’s basically in a loop at this point. I can’t tell exactly where it’s failing but it’s getting a signal 11 (or 6), dumping and restarting for some number of hours before (seemingly systemd?) gives up and just stops.

log snippet at https://flexo-external.lbl.gov/mgm.txt

Hi John,

could you paste a few hundred lines after the stacktrace itself, something that looks like below - this helps us isolate a bit where the call comes fro

 <signal handler called>
#6  0x00007f20b49da907 in eos::mgm::GeoTreeEngine::accessHeadReplicaMultipleGroup (this=0x7f20aaafb000, nAccessReplicas=@0x7f1ac65af460: 1, 
    fsIndex=@0x7f1ac65af738: 0, existingReplicas=0x7f1ac65afba0, 
    inode=1162884456, dataProxys=0x7f1ac65afbe0, 
    firewallEntryPoint=0x7f1ac65afc00, 
    type=eos::mgm::GeoTreeEngine::regularRO, accesserGeotag="test", 
    forcedFsId=@0x7f1ac65af470: 0, unavailableFs=0x7f1ac65afc20)
    at /root/rpmbuild/BUILD/eos-4.8.62-1/mgm/GeoTreeEngine.cc:1971

...

Hi John,

On top of what Abishek already mentioned can you double check that you have the following package installed on the MGM machine: devtoolset-8-gdb? This should help with getting a full stacktrace when the MGM crashes. These are stored in /var/eos/md/. If you have any recent files stored in there starting with stacktrace-DATE then please send us the most recent one.

Can you also let us know what is the exact version of EOS you are running? Both on the MGM and on the FSTs?
Are you absolutely sure you are not trying by any chance to add twice the same file system identifier?

Thanks,
Elvin

https://flexo-external.lbl.gov/xrdlog.mgm.gdb.gz

(thanks, that package was missing)

Hi John,

What is the output of the following command?
eos fileinfo fid:267781660

What is the status of the file system that holds the replica for the above file?
eos fs ls <fsid_from_above>
eos fs status <fsid_from_above>

Is this a new file system that was added recently?

Cheers,
Elvin

-bash-4.2# eos fileinfo fid:267781660
  File: '/eos/alicelblhpcs/grid/11/19045/98aad5d6-eadc-11ed-8086-b82a72dfdc5b'  Flags: 0644  Clock: 17775807955d9e71
  Size: 224644592
Status: healthy
Modify: Thu May  4 17:34:43 2023 Timestamp: 1683246883.369223000
Change: Thu May  4 17:34:30 2023 Timestamp: 1683246870.314002166
 Birth: Thu May  4 17:34:30 2023 Timestamp: 1683246870.314002166
  CUid: 900 CGid: 900 Fxid: 0ff6061c Fid: 267781660 Pid: 23122 Pxid: 00005a52
XStype: adler    XS: 6d f6 b2 ec    ETAGs: "71882092010536960:6df6b2ec"
Layout: plain Stripes: 1 Blocksize: 4k LayoutId: 00100002 Redundancy: d1::t0 
  #Rep: 1
┌───┬──────┬────────────────────────┬────────────────┬────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│                    host│      schedgroup│            path│      boot│  configstatus│       drain│  active│                  geotag│
└───┴──────┴────────────────────────┴────────────────┴────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
 0       97       alicefst01.lbl.gov        default.0          /data31     booted             rw      nodrain   online                  50B1275 

*******
-bash-4.2# eos fs ls 97
┌────────────────────────┬────┬──────┬────────────────────────────────┬────────────────┬────────────────┬────────────┬──────────────┬────────────┬──────┬────────┬────────────────┐
│host                    │port│    id│                            path│      schedgroup│          geotag│        boot│  configstatus│       drain│ usage│  active│          health│
└────────────────────────┴────┴──────┴────────────────────────────────┴────────────────┴────────────────┴────────────┴──────────────┴────────────┴──────┴────────┴────────────────┘
 alicefst00.lbl.gov       1095      6                           /data5        default.0          50B1275       booted             rw      nodrain  78.97   online               OK 
 alicefst01.lbl.gov       1095     97                          /data31        default.0          50B1275       booted             rw      nodrain  80.46   online               OK 
 alicefst00.lbl.gov       1095    150                          /data84        default.0          50B1275       booted             rw      nodrain  78.97   online               OK 
 alicefst00.lbl.gov       1095    162                          /data96        default.0          50B1275       booted             rw      nodrain  78.97   online               OK 
 alicefst00.lbl.gov       1095    163                          /data97        default.0          50B1275       booted             rw      nodrain  78.18   online               OK 
 alicefst00.lbl.gov       1095    168                         /data102        default.0          50B1275       booted             rw      nodrain  77.97   online               OK 
 alicefst01.lbl.gov       1095    197                          /data71        default.0          50B1275       booted             rw      nodrain  80.85   online               OK 
 alicefst01.lbl.gov       1095    223                          /data97        default.0          50B1275       booted             rw      nodrain  80.82   online               OK 
 alicefst03.lbl.gov       1095    297                           /data1        default.0          50B1275       booted            off      nodrain   0.17   online               OK 

-bash-4.2# eos fs status 97
# ------------------------------------------------------------------------------------
# FileSystem Variables
# ------------------------------------------------------------------------------------
bootcheck                        := 0
bootsenttime                     := 1534635679
configstatus                     := rw
drainperiod                      := 86400
forcegeotag                      := 50B1275
graceperiod                      := 86400
headroom                         := 5100000000
host                             := alicefst01.lbl.gov
hostport                         := alicefst01.lbl.gov:1095
id                               := 97
local.drain                      := nodrain
path                             := /data31
port                             := 1095
queue                            := /eos/alicefst01.lbl.gov:1095/fst
queuepath                        := /eos/alicefst01.lbl.gov:1095/fst/data31
scaninterval                     := 1814400
schedgroup                       := default.0
stat.active                      := online
stat.balancer.running            := 0
stat.boot                        := booted
stat.disk.bw                     := 235
stat.disk.iops                   := 78
stat.disk.load                   := 0.000000
stat.disk.readratemb             := 0.000000
stat.disk.writeratemb            := 0.000000
stat.fsck.blockxs_err            := 16
stat.fsck.d_cx_diff              := 0
stat.fsck.d_mem_sz_diff          := 229
stat.fsck.d_sync_n               := 166602
stat.fsck.m_cx_diff              := 0
stat.fsck.m_mem_sz_diff          := 2151
stat.fsck.m_sync_n               := 166611
stat.fsck.mem_n                  := 166618
stat.fsck.orphans_n              := 2396
stat.fsck.rep_diff_n             := 1
stat.fsck.rep_missing_n          := 4791
stat.fsck.unreg_n                := 0
stat.geotag                      := 50B1275
stat.health                      := OK
stat.health.drives_failed        := 0
stat.health.drives_total         := 1
stat.health.indicator            := N/A
stat.health.redundancy_factor    := 1
stat.http.port                   := 8001
stat.nominal.filled              := 79.421005
stat.publishtimestamp            := 1690917004048
stat.ropen                       := 0
stat.ropen.hotfiles              :=  
stat.statfs.bavail               := 478203951
stat.statfs.bfree                := 478203951
stat.statfs.blocks               := 2441087488
stat.statfs.bsize                := 4096
stat.statfs.capacity             := 9998694350848
stat.statfs.ffree                := 976453002
stat.statfs.files                := 976643648
stat.statfs.filled               := 80.410208
stat.statfs.freebytes            := 1958723383296
stat.statfs.fused                := 190646
stat.statfs.namelen              := 255
stat.statfs.type                 := 1481003842
stat.statfs.usedbytes            := 8039970967552
stat.usedfiles                   := 166618
stat.wopen                       := 0
stat.wopen.hotfiles              :=  
uuid                             := 83af0e28-032f-40b9-bd47-511d24c384cf
-bash-4.2#

And sorry, no, this fsid is not a new file system. The .eosfsid file is dated aug 9 2018.