Problems to make the attr default=replica work

Hello,

I have installed a small EOS demonstrator in order to play with some settings, etc. Currently, I am looking at the “replica” mode. Il have 2 filesystems on 2 differents FSTs in the same group and I have set a directory to do replicas. What I observe if I write a file there is that the replication process is started and works but the manager is not informed that the operation was succesful. Here are extracts the operations and logs:

EOS Console [root://localhost] |/> attr set default=replica /eos/testarea/testmirror/
EOS Console [root://localhost] |/> attr ls /eos/testarea/testmirror/
sys.forced.blocksize=“4k”
sys.forced.checksum=“adler”
sys.forced.layout=“replica”
sys.forced.nstripes=“2”
sys.forced.space=“default”

setenv XrdSecDEBUG 1 ; xrdcp -v -f file:///dlocal2/Video/JdL_Subatech_2017.mp4 xroot://nanxrd15.in2p3.fr//eos/testarea/testmirror/videolabo.mp4
sec_Client: protocol request for host nanxrd16.in2p3.fr token=’&P=unix&P=sss,0.13:/etc/eos.keytab’
sec_PM: Loaded unix protocol object from libXrdSecunix.so
sec_PM: Using unix protocol, args=’’
[16MB/513.8MB][ 3%][=> ][16MB/s][128MB/513.8MB][ 24%][============> ][128MB/[224MB/513.8MB][ 43%][=====================> ][112MB/[304MB/513.8MB][ 59%][=============================> ][101.3M[400MB/513.8MB][ 77%][======================================> ][100MB/[496MB/513.8MB][ 96%][================================================> ][99.2MB[513.8MB/513.8MB][100%][==================================================][102.[513.8MB/513.8MB][100%][==================================================][102.8MB/s]
Run: [ERROR] Server responded with an error: [3005] Unable to close failed ; remote I/O error

EOS Console [root://localhost] |/> file check /eos/testarea/testmirror/videolabo.mp4 %nrep%output
path="/eos/testarea/testmirror/videolabo.mp4" fid="(null)" size=“0” nrep=“0” checksumtype=“adler” checksum=“0000000000000000000000000000000000000000”
INCONSISTENCY REPLICA path=/eos/testarea/testmirror/videolabo.mp4 fid=(null) size=0 stripes=2 nrep=0 nrepstored=0 nreponline=0 checksumtype=adler checksum=0000000000000000000000000000000000000000

I can see the file on both FSTs:

[root@nanxrd15 ~]# ls -als /data01/00000000/00000039
526104 -rw-r–r--. 1 daemon daemon 538729736 Aug 14 09:15 /data01/00000000/00000039
[root@nanxrd16 ~]# ls -als /data01/00000000/00000039
526104 -rw-r–r--. 1 daemon daemon 538729736 Aug 14 09:15 /data01/00000000/00000039

There are errors in the logs:

gid=99 name=- geo="" msg=“query error” status=1 code=400
180814 09:15:38 9714 FstOfs_CallManager: ? Unable to Unable to give access - sys
tem access restricted - unauthorized identity used /eos/testarea/testmirror/vid
eolabo.mp4; communication error on send
180814 09:15:38 time=1534230938.955891 func=close level=CRIT
logid==d53a0c28-9f91-11e8-8e60-003048de1e44 unit=fst@nanxrd16.in2p3.fr:1095 tid
=00007f30b1533700 source=XrdFstOfsFile:1391 tident=barbet.19335:56@n
anpc117 sec= uid=99 gid=99 name=nobody geo="" commit returned an uncatched
error msg=Unable to Unable to give access - system access restricted - unauthori
zed identity used /eos/testarea/testmirror/videolabo.mp4; communication error o
n send [probably timeout] - closing transaction to keep the file save
180814 09:15:38 time=1534230938.972660 func=Close level=ERROR
logid==d53a0c28-9f91-11e8-8e60-003048de1e44 unit=fst@nanxrd16.in2p3.fr:1095 tid
=00007f30b1533700 source=ReplicaParLayout:494 tident=barbet.19335:56@n
anpc117 sec=unix uid=0 gid=0 name=barbet geo="" error=failed to close replica r
oot://nanxrd15.in2p3.fr:1095///eos/testarea/testmirror/videolabo.mp4?&cap.msg=Z/
kidSdAyI/juId7KxH2ZUMODZmbFjsNuigQbcn/wuRxPrycWW1jqHIYWVKH1WHkjTkIfNTF4E9ZMVEuJA
IOWLRhNjiNj5ZXjbMswy3IQFHxExjzrLdwf5vYFUY9GVvh2NDTwfCMiOIlyFBNj5WO9fv1rqf7WRByT1
6SuiG9cWTlmPaOZoWwRZY8zAnrcj7OyeJ0sZtqdey9QOCOVq2qCp1bU8LOQ3AFVAYfyHcBcixvCadnMA
05QddEuylr8NkNGBCw7P3aHTUrTxtbh5PiuAUlgO5smBNqFURc3jm5VCk8a9glXSAAj0+X+sIhmXCV+P
uX4BZY5zuCl9aNgeE0i5iyB4m7fCy2IMQRm8xuH+HJXP4jwM/lMP3lyI0sw9xeBtEjAG+PNYQVNGsPcF
AYqwH9ZO2GWF2GMLoET82qmDM/aCs1caY18HWoTyJmH+O8FcbUiiT0fJZyMoudGx/h8bbbp0p9T5uPUa
JpvirdF5Bt/uxI+dciIjyJnzkQsSx25HDM6Bz9mzaexFO0JGt7ABUOCuYqe7guHHsH4XfqPAySleOXn8
/KPg==&cap.sym=iV1RwnSEr1+FpZdsw5F5WM8r538=&mgm.id=00000039&mgm.logid=d53a0c28-9
f91-11e8-8e60-003048de1e44&mgm.replicahead=0&mgm.replicaindex=1&oss.asize=538729
736&mgm.path=/eos/testarea/testmirror/videolabo.mp4
180814 09:15:38 9714 FstOfs_ReplicaParClose: barbet.19335:56@nanpc117 Unable to
close failed ; remote I/O error
180814 09:15:38 time=1534230938.972795 func=close level=INFO
logid==d53a0c28-9f91-11e8-8e60-003048de1e44 unit=fst@nanxrd16.in2p3.fr:1095 tid
=00007f30b1533700 source=XrdFstOfsFile:1691 tident=barbet.19335:56@n
anpc117 sec= uid=99 gid=99 name=nobody geo="" info=“repair on close” path=/
eos/testarea/testmirror/videolabo.mp4
180814 09:15:38 time=1534230938.973478 func=CallManager level=ERROR
logid=static… unit=fst@nanxrd16.in2p3.fr:1095 tid=
00007f30b1533700 source=XrdFstOfs:911 tident= sec=(null) uid=99
gid=99 name=- geo="" msg=“query error” status=1 code=400
180814 09:15:38 9714 FstOfs_CallManager: ? Unable to Unable to give access - sys
tem access restricted - unauthorized identity used /eos/testarea/testmirror/vid
eolabo.mp4; communication error on send
180814 09:15:38 time=1534230938.973541 func=close level=ERROR
logid==d53a0c28-9f91-11e8-8e60-003048de1e44 unit=fst@nanxrd16.in2p3.fr:1095 tid
=00007f30b1533700 source=XrdFstOfsFile:1696 tident=barbet.19335:56@n
anpc117 sec= uid=99 gid=99 name=nobody geo="" failed to execute ‘adjustrepl
ica’ for path=/eos/testarea/testmirror/videolabo.mp4
180814 09:15:38 9714 FstOfs_close: ? Unable to create all replicas - uploaded fi
le is at risk - only one replica has been successfully stored for fn= /eos/testa
rea/testmirror/videolabo.mp4; input/output error
180814 09:15:38 time=1534230938.973576 func=close level=WARN
logid==d53a0c28-9f91-11e8-8e60-003048de1e44 unit=fst@nanxrd16.in2p3.fr:1095 tid
=00007f30b1533700 source=XrdFstOfsFile:1709 tident=barbet.19335:56@n
anpc117 sec= uid=99 gid=99 name=nobody geo="" executed ‘adjustreplica’ for
path=/eos/testarea/testmirror/videolabo.mp4 - file is at low risk due to missing
replica’s

=> Seems there is some communication problems between the members of the cluster:
nanxrd15 is manager+FST
nanxrd16 is FST

I tried also to force replication with:

EOS Console [root://localhost] |/> file adjustreplica /eos/testarea/testmirror/videolabo2.mp4
success: scheduled replication from source fs=1 => target fs=2

nanxrd16: ==> /var/log/eos/fst/eoscp.log <==

error: [SUCCESS]
[eoscp] #################################################################
[eoscp] # Date : ( 1534231634 ) Tue Aug 14 09:27:14 2018[eoscp} # auth forced=sss krb5= gsi=
[eoscp] # Source Name [00] : root://nanxrd15.in2p3.fr:1095//replicate:0000003a
[eoscp] # Destination Name [00] : root://nanxrd16.in2p3.fr:1095//replicate:0000003a
[eoscp] # Data Copied [bytes] : 538729736
[eoscp] # Realtime [s] : 5.539000
[eoscp] # Eff.Copy. Rate[MB/s] : 97.261188
[eoscp] # Bandwidth[MB/s] : 100
[eoscp] # Write Start Position : 0
[eoscp] # Write Stop Position : 538729736

But:

EOS Console [root://localhost] |/> file check /eos/testarea/testmirror/videolabo2.mp4 %nrep%output
path="/eos/testarea/testmirror/videolabo2.mp4" fid=“0000003a” size=“538729736” nrep=“1” checksumtype=“adler” checksum=“033a17cf00000000000000000000000000000000”
nrep=“00” fsid=“1” host=“nanxrd15.in2p3.fr:1095” fstpath="/data01/00000000/0000003a" size=“538729736” statsize=“538729736” checksum=“033a17cf00000000000000000000000000000000”
INCONSISTENCY REPLICA path=/eos/testarea/testmirror/videolabo2.mp4 fid=0000003a size=538729736 stripes=2 nrep=1 nrepstored=1 nreponline=1 checksumtype=adler checksum=033a17cf00000000000000000000000000000000

Any idea ?

Thanks JM

Hi Jean-Michel, the problem is that the FST which runs on 16 cannot talk to the MGM on 15 via sss authentication. Can you dump the sec.protbind entries in /etc/xrd.cf.mgm and just try to connect from 16 via ‘eos root://naxrd15.in2p3.fr whoami’

Cheers Andreas.

Hi Andreas, thanks,

[root@nanxrd16 ~]# eos root://naxrd15.in2p3.fr whoami
error: MGM root://naxrd15.in2p3.fr not online/reachable

[root@nanxrd15 ~]# grep sec.protbind /etc/xrd.cf.mgm
sec.protbind localhost.localdomain unix sss
sec.protbind localhost unix sss
#sec.protbind * only krb5 gsi sss unix
#sec.protbind * only sss unix
sec.protbind * host

=> I have played with the security model and I guess I should put back sss before unix right ?

JM

You need:

sec.protbind * only sss unix

Thanks Andreas, it works now (the not reachable message was a typo), it should have been:
[root@nanxrd16 ~]# eos root://nanxrd15.in2p3.fr whoami
Virtual Identity: uid=99 (99) gid=99 (99) [authz:host] host=nanxrd16.in2p3.fr

and after the right config that you give:
[root@nanxrd16 ~]# eos root://nanxrd15.in2p3.fr whoami
Virtual Identity: uid=2 (2,99) gid=2 (2,99) [authz:sss] host=nanxrd16.in2p3.fr

Then the replication works.

JM