A strange behaviour when trying to setup a user home directory upon fusex

Dear Expert,

I have found a very strange behaviour when trying to log on as a user whose home directory is on EOS fusex. The password authentication is through internal FreeIPA (Kerberos V5) and a user account has a valid Kerberos keyring as follows:

-bash-4.2$ klist -A
Ticket cache: KEYRING:persistent:556951043:krb_ccache_duXEK6R
Default principal: eos_admin1@SDFARM.KR

Valid starting       Expires              Service principal
09/22/2020 16:48:41  09/23/2020 16:48:41  krbtgt/SDFARM.KR@SDFARM.KR

However, when it tried to log on to a client, it does not get its home directory with a permission problem like,

❯ ssh eos_admin1@fcc
eos_admin1@fcc's password:
Last login: Tue Sep 22 16:44:54 2020 from 
Could not chdir to home directory /eos/gsdc/user/e/eos_admin1: Permission denied
-bash: /eos/gsdc/user/e/eos_admin1/.bash_profile: Operation not permitted

Strangely, once it drops to / because it cannot change to its home directory, then it can get the home when entering bash in interactive mode:

-bash-4.2$ pwd
/
-bash-4.2$ bash -i
[eos_admin1@fcc /]$ cd
[eos_admin1@fcc ~]$ pwd
/eos/gsdc/user/e/eos_admin1

Afterwards, it can get to its home outside bash:

[eos_admin1@fcc ~]$ exit
exit
-bash-4.2$ cd
-bash-4.2$ pwd
/eos/gsdc/user/e/eos_admin1

Furthermore, when it failed to change to its home directory initially after establishing ssh session, one can list the contents of the home, create(touch), copy, and remove a file (while still cd does not work…)

❯ ssh eos_admin1@fcc.sdfarm.kr -p4280
eos_admin1@fcc.sdfarm.kr's password:
Last login: Tue Sep 22 16:48:42 2020 from 150.183.234.130
Could not chdir to home directory /eos/gsdc/user/e/eos_admin1: Permission denied
-bash: /eos/gsdc/user/e/eos_admin1/.bash_profile: Operation not permitted
-bash-4.2$ ls -al /eos/gsdc/user/e/eos_admin1/
total 16
drwx------. 1 eos_admin1 eos_admins 4096 Sep 23 09:04 .
drwxr-xr-x. 2 root       root       4096 Sep 21 16:09 ..
drwxr-x---. 2 eos_admin1 eos_admins 4096 Sep 22 14:16 testdir
-rwxr-xr-x. 1 eos_admin1 eos_admins    0 Sep 22 13:27 testfile

-bash-4.2$ touch /eos/gsdc/user/e/eos_admin1/testfile2
-bash-4.2$ ls -al /eos/gsdc/user/e/eos_admin1/
total 16
drwx------. 2 eos_admin1 eos_admins 4096 Sep 23 09:09 .
drwxr-xr-x. 2 root       root       4096 Sep 21 16:09 ..
drwxr-x---. 2 eos_admin1 eos_admins 4096 Sep 22 14:16 testdir
-rwxr-xr-x. 1 eos_admin1 eos_admins    0 Sep 22 13:27 testfile
-rw-r-----. 1 eos_admin1 eos_admins    0 Sep 23 09:09 testfile2

-bash-4.2$ rm /eos/gsdc/user/e/eos_admin1/testfile2
-bash-4.2$ ls -l /eos/gsdc/user/e/eos_admin1/
total 0
drwxr-x---. 2 eos_admin1 eos_admins 4096 Sep 22 14:16 testdir
-rwxr-xr-x. 1 eos_admin1 eos_admins    0 Sep 22 13:27 testfile

-bash-4.2$ rmdir /eos/gsdc/user/e/eos_admin1/testdir
-bash-4.2$ ls -l /eos/gsdc/user/e/eos_admin1/
total 0
-rwxr-xr-x. 1 eos_admin1 eos_admins 0 Sep 22 13:27 testfile

The followings are krb5 configuration in MGM, auth config in fuse.log, mount command output and a snippet of fuse.log on client side

MGM sec.protocol & sec.protbind configuration (/etc/xrd.cf.mgm):

sh-4.2# cat /etc/xrd.cf.mgm | grep -E 'protocol|protbind' | grep -v '#'
sec.protocol unix
sec.protocol sss -c /etc/eos.keytab -s /etc/eos.keytab
sec.protocol krb5 host/jbod-mgmt-01.sdfarm.kr@SDFARM.KR
sec.protbind localhost.localdomain sss unix
sec.protbind localhost sss unix
sec.protbind * only krb5 sss unix

auth part of /etc/eos/fuse.conf in client:

  "auth" : {
    "krb5" : 1,
    "gsi-first" : 0,
    "sss" : 0,
    "ssskeytab" : "/etc/eos.keytab",
    "shared-mount" : 1,
    "environ-deadlock-timeout" : 100,
    "forknoexec-heuristic" : 1
  },

Fusex mount on client:

[root@fcc ~]# mount -t fuse eosxd /eos
# fsname=''
# -o allow_other enabled on shared mount
# -o big_writes enabled
# JSON parsing successful
# File descriptor limit: 524288 soft, 524288 hard
# Disabling nagle algorithm (XRD_NODELAY=1)
# Setting MALLOC_CONF=dirty_decay_ms:0

Fusex client log: (/var/log/eos/fusex/fuse.log)

200922 16:48:37 t=1600760917.754907 f=HandleResponseWithHosts l=ERROR tid=00007faeea7fc700 s=xrdclproxy:468           state=failed async open returned errmsg=[ERROR] Error response: Permission denied
200922 16:48:37 t=1600760917.754974 f=TryRecovery      l=ERROR ino:00000008e0000000 s=data:962                 unrecoverable error - code=400 errNo=3010
200922 16:48:37 t=1600760917.755003 f=flush_nolock     l=ERROR ino:00000008e0000000 s=data:369                 remote open failed - returning 13
200922 16:48:37 t=1600760917.755027 f=flush            l=WARN  tid=00007faef03ff700 s=eosfuse:4520             invalidating ino=0x8e0000000 after flush error
200922 16:48:37 t=1600760917.755067 f=TryRecovery      l=ERROR ino:00000008e0000000 s=data:962                 unrecoverable error - code=400 errNo=3010
200922 16:48:37 t=1600760917.755095 f=flush_nolock     l=ERROR ino:00000008e0000000 s=data:369                 remote open failed - returning 13
200922 16:48:37 t=1600760917.784151 f=TryRecovery      l=ERROR ino:00000008e0000000 s=data:962                 unrecoverable error - code=400 errNo=3010
200922 16:48:37 t=1600760917.784183 f=flush_nolock     l=ERROR ino:00000008e0000000 s=data:369                 remote open failed - returning 13
200922 16:48:37 t=1600760917.916266 f=ioflush          l=WARN  tid=00007faeed7f0700 s=data:2896                OpenAsync failed - trying recovery - ino:       8e0000000 err-code:400
200922 16:48:37 t=1600760917.916293 f=ioflush          l=WARN  tid=00007faeed7f0700 s=data:2922                giving up OpenAsync request - ino:       8e0000000 err-code:400
200922 16:48:37 t=1600760917.916544 f=ioflush          l=CRIT  tid=00007faeed7f0700 s=data:2950                ino:       8e0000000 msg=file open failed file-recovery=/var/eos/fusex/cache/jbod-mgmt-01.sdfarm.kr/08E/8E0000000.dc.recover journal-recovery=/var/eos/fusex/journal/jbod-mgmt-01.sdfarm.kr/08E/8E0000000.jc.recover
#      -------------------
#      - recovery record -
#      -------------------
#        path := '/eos/gsdc/user/e/eos_admin1/.bash_history'
#        fid  := 142
#        -[ 000 ] 200922 16:48:37 t=1600760917.754963 f=flush_nolock     l= ino:00000008e0000000 s=data:364                 status='XrdCl::Proxy::FAILED' hint='will TryRecovery'
#        -[ 001 ] 200922 16:48:37 t=1600760917.754999 f=flush_nolock     l= ino:00000008e0000000 s=data:368                 errno='13' hint='failed TryRecovery'
#        -[ 002 ] 200922 16:48:37 t=1600760917.755061 f=flush_nolock     l= ino:00000008e0000000 s=data:364                 status='XrdCl::Proxy::FAILED' hint='will TryRecovery'
#        -[ 003 ] 200922 16:48:37 t=1600760917.755091 f=flush_nolock     l= ino:00000008e0000000 s=data:368                 errno='13' hint='failed TryRecovery'
#        -[ 004 ] 200922 16:48:37 t=1600760917.784143 f=flush_nolock     l= ino:00000008e0000000 s=data:364                 status='XrdCl::Proxy::FAILED' hint='will TryRecovery'
#        -[ 005 ] 200922 16:48:37 t=1600760917.784167 f=flush_nolock     l= ino:00000008e0000000 s=data:368                 errno='13' hint='failed TryRecovery'

The versions of EOS server and client are 4.8.12 and 4.7.7.

It would be very helpful if you point out anything that I missed or made something wrong.

Thank you.

Best regards,
Sang-Un

@apeters any thoughts on this?