Fusex crashing with segfault

delujans · May 16, 2023, 4:17am

Hello friends,

We have seen some rare crashes of eosxd process in our cta deployment. We are running fusex 5.1.11, eos-server 5.1.11 (xrootd 5.5.7).

Error:

Segmentation fault      eosxd -ofsname=${name} -f

Config:

{
  "name" : "aarnet-cloudstor",
  "hostport" : "ourmgm.local:1094",
  "remotemountdir" : "/eos/aarnet-cloudstor",
  "localmountdir" : "/eos/aarnet-cloudstor",
  "mdcachedir" : "/srv/eos/fusex/aarnet-cloudstor",
  "mdzmqtarget" : "tcp://ourmgm.local:1100",
  "mdzmqidentity" : "crlt-e10",
  "options" : {
    "debug" : 0,
    "debuglevel" : 4,
    "libfusethreads" : 0,
    "md-kernelcache" : 1,
    "md-kernelcache.enoent.timeout" : 5,
    "md-backend.timeout" : 86400,
    "md-backend.put.timeout" : 120,
    "data-kernelcache" : 1,
    "mkdir-is-sync" : 1,
    "create-is-sync" : 1,
    "symlink-is-sync" : 1,
    "rename-is-sync" : 1,
    "rmdir-is-sync" : 0,
    "global-flush" : 0,
    "global-locking" : 1,
    "fd-limit" : 524288,
    "no-fsync" : [ ".db", ".db-journal", ".sqlite", ".sqlite-journal", ".db3", ".db3-journal", "*.o" ],
    "overlay-mode" : 000,
    "rm-rf-protect-levels" : 1,
    "rm-rf-bulk" : 1,
    "show-tree-size" : 1,
    "free-md-asap" : 1,
    "cpu-core-affinity" : 0,
    "no-xattr" : 1,
    "no-link" : 1,
    "nocache-graceperiod" : 5,
    "leasetime" : 300,
    "write-size-flush-interval" : 5
  },
  "auth" : {
    "krb5" : 0,
    "gsi-first" : 0,
    "sss" : 0,
    "ssskeytab" : "/etc/eos.cdn.keytab",
    "shared-mount" : 1,
    "environ-deadlock-timeout" : 100,
    "forknoexec-heuristic" : 1
  },
  "recovery" : {
    "read-open" : 1,
    "read-open-noserver" : 1,
    "read-open-noserver-retrywindow" : 15,
    "write-open" : 0,
    "write-open-noserver" : 0,
    "write-open-noserver-retrywindow" : 15
   },
  "cache" : {
    "type" : "disk",
    "size-mb" : 1000,
    "size-ino" : 65536,
    "journal-mb" : 16134,
    "journal-ino" : 65536,
    "clean-threshold" : 85.0,
    "location" : "/srv/eos/fusex/cache/aarnet-cloudstor/",
    "journal" : "/srv/eos/fusex/journal/aarnet-cloudstor/",
    "read-ahead-strategy" : "dynamic",
    "read-ahead-bytes-nominal" : 1048576,
    "read-ahead-bytes-max" : 8388608,
    "read-ahead-blocks-max" : 8388608,
    "max-read-ahead-buffer" : 1073741824,
    "max-write-buffer" : 1073741824
  }
}

Fusex logs:

230512 09:39:54 t=1683884394.102888 f=WaitPrefetch     l=ERROR ino:800000008b1d519b s=data:970                 pre-read failed error=[FATAL] Unknown error code: software caused connection abort: request timeout
230512 09:39:54 t=1683884394.105715 f=recover_ropen    l=WARN  ino:800000008b1d5198 s=data:1174                recover read-open [1]
230512 09:39:54 t=1683884394.105727 f=recover_ropen    l=WARN  ino:800000008b1d5198 s=data:1207                recover reopening file for read
230512 09:39:54 t=1683884394.105851 f=recover_ropen    l=WARN  ino:800000008b1d5198 s=data:1222                applying exclusion list: tried=crlt-s56.cdn.aarnet.edu.au,
230512 09:39:54 t=1683884394.106236 f=HandleResponseWithHosts l=ERROR tid=00007f05ec3f7700 s=xrdclproxy:559           state=failed async open returned errmsg=[ERROR] Socket timeout
                 ---- high rate error messages suppressed ----
fusermount: failed to unmount /eos/aarnet-cloudstor: Invalid argument
# umounthandler: executing fusermount -u -z /eos/aarnet-cloudstor
# umounthandler: sighandler received signal 11 - emitting signal 11 again
230512 09:39:54 t=1683884394.236312 f=lookupNonLocalJail l=ALERT tid=00007f058e7ff700 s=SecurityChecker:212      Failed to openat file
# umounthandler: executing fusermount -u -z /eos/aarnet-cloudstor
# umounthandler: sighandler received signal 11 - emitting signal 11 again
fusermount: failed to unmount /eos/aarnet-cloudstor: Invalid argument
# umounthandler: executing fusermount -u -z /eos/aarnet-cloudstor
# umounthandler: sighandler received signal 11 - emitting signal 11 again
fusermount: failed to unmount /eos/aarnet-cloudstor: Invalid argument
fusermount: failed to unmount /eos/aarnet-cloudstor: Invalid argument
fusermount: failed to unmount /eos/aarnet-cloudstor: Invalid argument

This happened a couple of times and we are seeing Unknown error code: software caused connection abort: request timeout each time.

Any ideas on how to debug this further?

Thank you

Denis

apeters · May 16, 2023, 6:20am

Hmm, I will check the code around that, but the easiest would be, if you allow core dumps!

delujans · May 18, 2023, 11:33pm

Thanks Andreas! Coredumps enabled, will wait for next crash and submit

Warm Regards,

Denis

delujans · May 26, 2023, 1:16am

Hi Andreas,

Finally managed to capture some coredumps, please use the following link to download:

https://filesender.aarnet.edu.au/?s=download&token=288038d8-00ae-406b-a68d-df59b2cd5fa0

Warm Regards,

Denis

apeters · May 26, 2023, 4:24am

Ok,
I will try what we can do with that because I don’t have your executable. For the time being, you could add

gdb eosxd corefile <<< “thread apply all bt”

here, I guess this might be enough.

delujans · May 26, 2023, 6:08am

Hi Andreas,

Thanks, here is the gdb.txt link:

https://cloudstor.aarnet.edu.au/plus/s/KQgnX69jFBnaED3

Warm Regards,

Denis

delujans · June 12, 2023, 12:29am

Howdy @apeters ,

Just checking in to see if you’ve had a chance to take a look at the coredump. Anything else we can supply for debug?

Warm Regards,

Denis

apeters · June 15, 2023, 7:55am

Sorry, yes, I checked it out … can you see which thread is creating the SEGV? It is probably Thread 1 in jemalloc ???

Can it be that you list very large directories? It fails in allocation while listing a directory it seems …

delujans · June 19, 2023, 12:10am

Hi @apeters ,

Thanks for that. Sorry I’m not entirely sure what you mean by that. I don’t really know how to interpret gdb.txt. I don’t see which thread caused the segfault in that text file. I do see a path that thread 1 was working with. I’ve looked into it, and it has a few subdirectories, most of which have only 4 .png files.

It’s entirely possible we have directories that are too large. How large is too large?

We set fusex limit to 500k files in the MGM:

export EOS_MGM_FUSEX_MAX_CHILDREN=500000

However that doesn’t appear to be getting enforced (last I checked it does the listing just fine). We can chase down users with too many files in a single dir, I just need to know what’s considered “too many”.

Thanks very much again,

Denis

CERN Accelerating science

Fusex crashing with segfault