Many files with zero replicas in MGM

We’ve had some complaints from users running crab jobs about files not being found. I don’t have full details yet, but I’ve been doing some poking around ahead of getting the details.

I’ve noticed there are a very large number of files with no replicas listed in the MGM metadata. The files exist on disk and the FST LevelDB has the correct information, but the MGM does not. We recently drained 6 filesystems across two nodes and added 8 new filesystems across four nodes which caused a lot of rebalancing.

Here is an example typical of what I am seeing:

# eos file info fxid:256e99c3
  File: '/eos/uscms/store/user/guerrero/HHFall19code/HH2019/Limits/CMSSW_10_2_13/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/xs/13TeV/13TeV-tHq.txt'  Flags: 0600  Clock: 161ff225586e21bb
  Size: 1584
Modify: Wed Mar 18 10:58:28 2020 Timestamp: 1584547108.208782965
Change: Wed Mar 18 10:58:28 2020 Timestamp: 1584547108.208782965
 Birth: Wed Jul  1 13:49:53 2020 Timestamp: 1593629393.518846921
  CUid: 47644 CGid: 5063 Fxid: 256e99c3 Fid: 628005315 Pid: 12864852 Pxid: 00c44d54
XStype: adler    XS: af 77 15 90    ETAGs: "168578893102448640:af771590"
Layout: replica Stripes: 2 Blocksize: 4k LayoutId: 00100112 Redundancy: d0::t0 
  #Rep: 0
*******

I have a script that finds files by fxid on all FST nodes:

# ./getfilefsts.sh 256e99c3
cmseos69.fnal.gov[206]: -rw-r----- 1 daemon daemon 1584 Jul 2 23:46 /storage/data1/0000f550/256e99c3
cmseos71.fnal.gov[210]: -rw------- 1 daemon daemon 1584 Jul 1 13:50 /storage/data1/0000f550/256e99c3

(I also wondered why some are 600 and some are 640)

The FST LevelDB has the info:

# eos-leveldb-inspect --dbpath /var/eos/md/fmd.0210.LevelDB --fid 628005315
fxid=256e99c3 id=628005315 cid=12864852 fsid=210 ctime=1584547108 ctime_ns=208782965 mtime=1584547108 mtime_ns=208782965 atime=1593629414 atime_ns=335646000 size=1584 disksize=1584 mgmsize=1584 lid=0x100112 uid=47644 gid=5063 filecxerror=0x0 blockcxerror=0x0 layouterror=0x0 checksum=af771590 diskchecksum=af771590 mgmchecksum=af771590 locations=210,206, 

Any idea what could be happening here? Running a eos-ns-inspect, I get 539436 files with no replicas. This isn’t an exact number as some would probably be recently deleted, but pretty alarming that there would be close to that many.


Dan Szkola
FNAL

Oh forgot:

EOS_SERVER_VERSION=4.7.16 EOS_SERVER_RELEASE=1
EOS_CLIENT_VERSION=4.7.16 EOS_CLIENT_RELEASE=1

Hi Dan,

Could you collect some logs from both the MGM and the FSTs about this particular file? It’s a bit surprising that the modification/creation dates are before the birth date. Search for the logs from 1st of July around 13:50.

Do you know what type of access was this? Normal xrootd? Fusex?

Thanks,
Elvin

The different mode bits come from internal operations like draining or balancing so this is not something particularly worrying.

EOS Report entries:

/var/eos/report/2020/07/20200701.eosreport.gz:log=b2f25618-bbcb-11ea-838b-ac1f6b05cb96&path=/eos/uscms/store/user/guerrero/HHFall19code/HH2019/Limits/CMSSW_10_2_13/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/xs/13TeV/13TeV-tHq.txt&fstpath=/storage/data3/0000f550/256e99c3&ruid=47644&rgid=5063&td=daemon.29884:147@cmseos71&host=cmseos11.fnal.gov&lid=1048850&fid=628005315&fsid=46&ots=1593629414&otms=324&cts=1593629414&ctms=339&nrc=0&nwc=1&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=0&wb_min=0&wb_max=0&wb_sigma=0.00&sfwdb=0&sbwdb=0&sxlfwdb=0&sxlbwdb=0&nfwds=0&nbwds=0&nxlfwds=0&nxlbwds=0&rt=0.00&rvt=0.00&wt=0.06&osize=0&csize=1584&delete_on_close=0&sec.prot=unix&sec.name=root&sec.host=cmslpc113.fnal.gov&sec.vorg=&sec.grps=root&sec.role=&sec.info=&sec.app=fuse

/var/eos/report/2020/07/20200701.eosreport.gz:log=b2f25618-bbcb-11ea-838b-ac1f6b05cb96&path=/eos/uscms/store/user/guerrero/HHFall19code/HH2019/Limits/CMSSW_10_2_13/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/xs/13TeV/13TeV-tHq.txt&fstpath=/storage/data1/0000f550/256e99c3&ruid=47644&rgid=5063&td=*LocE8cA.1372:392@cmslpc113&host=cmseos71.fnal.gov&lid=1048850&fid=628005315&fsid=210&ots=1593629414&otms=321&cts=1593629414&ctms=339&nrc=0&nwc=1&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=1584&wb_min=1584&wb_max=1584&wb_sigma=0.00&sfwdb=0&sbwdb=0&sxlfwdb=0&sxlbwdb=0&nfwds=0&nbwds=0&nxlfwds=0&nxlbwds=0&rt=0.00&rvt=0.00&wt=0.04&osize=0&csize=1584&delete_on_close=0&sec.prot=unix&sec.name=root&sec.host=cmslpc113.fnal.gov&sec.vorg=&sec.grps=root&sec.role=&sec.info=&sec.app=fuse

/var/eos/report/2020/07/20200702.eosreport.gz:log=unknown&path=/eos/uscms/store/user/guerrero/HHFall19code/HH2019/Limits/CMSSW_10_2_13/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/xs/13TeV/13TeV-tHq.txt&fstpath=/storage/data3/0000f550/256e99c3&ruid=1&rgid=1&td=daemon.28645:981@cmseos69&host=cmseos11.fnal.gov&lid=1048834&fid=628005315&fsid=46&ots=1593751618&otms=315&cts=1593751618&ctms=446&nrc=3&nwc=0&rb=1584&rb_min=1584&rb_max=1584&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=0&wb_min=0&wb_max=0&wb_sigma=0.00&sfwdb=0&sbwdb=0&sxlfwdb=0&sxlbwdb=0&nfwds=0&nbwds=0&nxlfwds=0&nxlbwds=0&rt=28.51&rvt=0.00&wt=0.00&osize=1584&csize=1584&delete_on_close=0&sec.prot=sss&sec.name=eos&sec.host=eos&sec.vorg=-&sec.grps=-&sec.role=-&sec.info=-&sec.app=eos/balancing

/var/eos/report/2020/07/20200702.eosreport.gz:log=unknown&path=/eos/uscms/store/user/guerrero/HHFall19code/HH2019/Limits/CMSSW_10_2_13/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/xs/13TeV/13TeV-tHq.txt&fstpath=/storage/data1/0000f550/256e99c3&ruid=1&rgid=1&td=daemon.28645:165@cmseos69&host=cmseos69.fnal.gov&lid=1048834&fid=628005315&fsid=206&ots=1593751618&otms=367&cts=1593751618&ctms=449&nrc=0&nwc=1&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=1584&wb_min=1584&wb_max=1584&wb_sigma=0.00&sfwdb=0&sbwdb=0&sxlfwdb=0&sxlbwdb=0&nfwds=0&nbwds=0&nxlfwds=0&nxlbwds=0&rt=0.00&rvt=0.00&wt=0.04&osize=0&csize=1584&delete_on_close=0&sec.prot=sss&sec.name=eos&sec.host=eos&sec.vorg=-&sec.grps=-&sec.role=-&sec.info=-&sec.app=eos/balancing

/var/eos/report/2020/07/20200702.eosreport.gz:log=67d7e726-ab1b-11ea-b60c-000e1e52b0a2&host=cmseos11.fnal.gov&fid=628005315&fxid=256e99c3&fsid=46&del_ts=1593751948&del_tns=291575593&dc_ts=1593629414&dc_tns=337444040&dm_ts=1593629414&dm_tns=332444250&da_ts=1593751618&da_tns=444395747&dsize=1584&sec.app=deletion

cmseos69 (FST):
eoscp.log-20200703.gz:[eoscp] # Source Name [00] : root://cmseos11.fnal.gov:1095//replicate:256e99c3
eoscp.log-20200703.gz:[eoscp] # Destination Name [00] : root://cmseos69.fnal.gov:1095//replicate:256e99c3
xrdlog.fst-20200703.gz:200702 23:46:58 time=1593751618.238903 func=ThreadLoop level=INFO logid=302c4ac2-b4c8-11ea-8f59-a0369f4c48b0 unit=fst@cmseos69.fnal.gov:1095 tid=00007f78797ff700 source=TransferMultiplexer:139 tident=<service> sec= uid=0 gid=0 name= geo="" msg="new transfer source.url=root://cmseos11.fnal.gov:1095//replicate:256e99c3 target.url=root://cmseos69.fnal.gov:1095//replicate:256e99c3 "
xrdlog.fst-20200703.gz:200702 23:46:58 time=1593751618.367952 func=open level=INFO logid=unknown unit=fst@cmseos69.fnal.gov:1095 tid=00007f78131d8700 source=XrdFstOfsFile:119 tident=daemon.28645:165@cmseos69 sec=sss uid=0 gid=0 name=daemon geo="" path=/replicate:256e99c3 info=cap.msg=<...>&cap.sym=<...>&fst.valid=1593751677 open_mode=200
xrdlog.fst-20200703.gz:200702 23:46:58 time=1593751618.368318 func=ProcessCapOpaque level=INFO logid=unknown unit=fst@cmseos69.fnal.gov:1095 tid=00007f78131d8700 source=XrdFstOfsFile:2232 tident=daemon.28645:165@cmseos69 sec=(null) uid=99 gid=99 name=(null) geo="" capability=&mgm.access=write&mgm.lid=1048834&mgm.cid=12864852&mgm.ruid=1&mgm.rgid=1&mgm.uid=1&mgm.gid=1&mgm.path=/eos/uscms/store/user/guerrero/HHFall19code/HH2019/Limits/CMSSW_10_2_13/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/xs/13TeV/13TeV-tHq.txt&mgm.manager=cmseosmgm01.fnal.gov:1094&mgm.fid=256e99c3&mgm.sec=sss|eos|eos|-|-|-|-|eos/balancing&mgm.drainfsid=46&mgm.localprefix=/storage/data1&mgm.fsid=206&mgm.targethostport=cmseos69.fnal.gov:1095&mgm.bookingsize=1584&mgm.source.lid=1048850&mgm.source.ruid=47644&mgm.source.rgid=5063&cap.valid=1593755217
xrdlog.fst-20200703.gz:200702 23:46:58 time=1593751618.368403 func=open level=INFO logid=unknown unit=fst@cmseos69.fnal.gov:1095 tid=00007f78131d8700 source=XrdFstOfsFile:197 tident=daemon.28645:165@cmseos69 sec=(null) uid=1 gid=1 name=nobody geo="" ns_path=/replicate:256e99c3 fst_path=/storage/data1/0000f550/256e99c3
xrdlog.fst-20200703.gz:200702 23:46:58 time=1593751618.399068 func=open level=INFO logid=unknown unit=fst@cmseos69.fnal.gov:1095 tid=00007f78131d8700 source=XrdFstOfsFile:462 tident=daemon.28645:165@cmseos69 sec=(null) uid=1 gid=1 name=nobody geo="" fst_path=/storage/data1/0000f550/256e99c3 open-mode=300 create-mode=41a0 layout-name=plain oss-opaque=&mgm.lid=1048834&mgm.bookingsize=1584
xrdlog.fst-20200703.gz:200702 23:46:58 time=1593751618.399120 func=fileOpen level=INFO logid=3a3c474c-bce8-11ea-bdff-a0369f4c48b0 unit=fst@cmseos69.fnal.gov:1095 tid=00007f78131d8700 source=LocalIo:70 tident=<service> sec= uid=0 gid=0 name= geo="" flags=300, path=/storage/data1/0000f550/256e99c3
xrdlog.fst-20200703.gz:200702 23:46:58 time=1593751618.430219 func=stat level=INFO logid=unknown unit=fst@cmseos69.fnal.gov:1095 tid=00007f78131d8700 source=XrdFstOfsFile:922 tident=daemon.28645:165@cmseos69 sec= uid=1 gid=1 name=nobody geo="" path=/replicate:256e99c3 fxid=256e99c3 size=0 mtime=1593751618.25770222
xrdlog.fst-20200703.gz:200702 23:46:58 time=1593751618.928013 func=Report level=INFO logid=static.............................. unit=fst@cmseos69.fnal.gov:1095 tid=00007f78852fd700 source=Report:51 tident= sec=(null) uid=99 gid=99 name=- geo="" log=unknown&path=/eos/uscms/store/user/guerrero/HHFall19code/HH2019/Limits/CMSSW_10_2_13/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/xs/13TeV/13TeV-tHq.txt&fstpath=/storage/data1/0000f550/256e99c3&ruid=1&rgid=1&td=daemon.28645:165@cmseos69&host=cmseos69.fnal.gov&lid=1048834&fid=628005315&fsid=206&ots=1593751618&otms=367&cts=1593751618&ctms=449&nrc=0&nwc=1&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=1584&wb_min=1584&wb_max=1584&wb_sigma=0.00&sfwdb=0&sbwdb=0&sxlfwdb=0&sxlbwdb=0&nfwds=0&nbwds=0&nxlfwds=0&nxlbwds=0&rt=0.00&rvt=0.00&wt=0.04&osize=0&csize=1584&delete_on_close=0&sec.prot=sss&sec.name=eos&sec.host=eos&sec.vorg=-&sec.grps=-&sec.role=-&sec.info=-&sec.app=eos/balancing
xrdlog.fst-20200703.gz:200702 23:47:58 time=1593751678.472794 func=MgmSyncer level=INFO logid=static.............................. unit=fst@cmseos69.fnal.gov:1095 tid=00007f78822fd700 source=MgmSyncer:96 tident= sec=(null) uid=99 gid=99 name=- geo="" fxid=256e99c3 mtime=1593751618
xrdlog.fst-20200707.gz:200706 13:29:27 time=1594060167.741095 func=MgmSyncer level=INFO logid=static.............................. unit=fst@cmseos69.fnal.gov:1095 tid=00007f78822fd700 source=MgmSyncer:96 tident= sec=(null) uid=99 gid=99 name=- geo="" fxid=256e99c3 mtime=1584547108
xrdlog.fst-20200707.gz:200706 14:28:14 time=1594063694.680843 func=MgmSyncer level=INFO logid=static.............................. unit=fst@cmseos69.fnal.gov:1095 tid=00007f78822fd700 source=MgmSyncer:96 tident= sec=(null) uid=99 gid=99 name=- geo="" fxid=256e99c3 mtime=1584547108

cmseos71 (FST):
xrdlog.fst-20200702.gz:200701 13:50:14 time=1593629414.321904 func=open level=INFO logid=b2f25618-bbcb-11ea-838b-ac1f6b05cb96 unit=fst@cmseos71.fnal.gov:1095 tid=00007fa549afe700 source=XrdFstOfsFile:119 tident=*LocE8cA.1372:392@cmslpc113 sec=unix uid=0 gid=0 name=root geo="" path=/#curl#/eos/uscms/store/user/guerrero/HHFall19code/HH2019/Limits/CMSSW_10_2_13/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/xs/13TeV/13TeV-tHq.txt info=cap.msg=<...>&cap.sym=<...>&eos.app=fuse&eos.bookingsize=0&eos.encodepath=1&eos.lfn=fxid:256e99c3&fst.blocksize=262144&fst.readahead=true&fst.valid=1593629473&mgm.id=256e99c3&mgm.logid=b2f25618-bbcb-11ea-838b-ac1f6b05cb96&mgm.mtime=0&mgm.replicahead=0&mgm.replicaindex=0 open_mode=2
xrdlog.fst-20200702.gz:200701 13:50:14 time=1593629414.322285 func=ProcessCapOpaque level=INFO logid=b2f25618-bbcb-11ea-838b-ac1f6b05cb96 unit=fst@cmseos71.fnal.gov:1095 tid=00007fa549afe700 source=XrdFstOfsFile:2232 tident=*LocE8cA.1372:392@cmslpc113 sec=(null) uid=99 gid=99 name=(null) geo="" capability=&mgm.access=update&mgm.ruid=47644&mgm.rgid=5063&mgm.uid=99&mgm.gid=99&mgm.path=/eos/uscms/store/user/guerrero/HHFall19code/HH2019/Limits/CMSSW_10_2_13/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/xs/13TeV/13TeV-tHq.txt&mgm.manager=cmseosmgm01.fnal.gov:1094&mgm.fid=256e99c3&mgm.cid=12864852&mgm.sec=unix|root|cmslpc113.fnal.gov||root|||fuse&mgm.lid=1048850&mgm.bookingsize=0&mgm.fsid=210&mgm.url0=root://cmseos71.fnal.gov:1095//&mgm.fsid0=210&mgm.url1=root://cmseos11.fnal.gov:1095//&mgm.fsid1=46&cap.valid=1593633014
xrdlog.fst-20200702.gz:200701 13:50:14 time=1593629414.322374 func=open level=INFO logid=b2f25618-bbcb-11ea-838b-ac1f6b05cb96 unit=fst@cmseos71.fnal.gov:1095 tid=00007fa549afe700 source=XrdFstOfsFile:197 tident=*LocE8cA.1372:392@cmslpc113 sec=(null) uid=47644 gid=5063 name=nobody geo="" ns_path=/#curl#/eos/uscms/store/user/guerrero/HHFall19code/HH2019/Limits/CMSSW_10_2_13/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/xs/13TeV/13TeV-tHq.txt fst_path=/storage/data1/0000f550/256e99c3
xrdlog.fst-20200702.gz:200701 13:50:14 time=1593629414.322784 func=open level=INFO logid=b2f25618-bbcb-11ea-838b-ac1f6b05cb96 unit=fst@cmseos71.fnal.gov:1095 tid=00007fa549afe700 source=XrdFstOfsFile:462 tident=*LocE8cA.1372:392@cmslpc113 sec=(null) uid=47644 gid=5063 name=nobody geo="" fst_path=/storage/data1/0000f550/256e99c3 open-mode=102 create-mode=4180 layout-name=replica oss-opaque=&mgm.lid=1048850&mgm.bookingsize=0
xrdlog.fst-20200702.gz:200701 13:50:14 time=1593629414.322908 func=fileOpen level=INFO logid=b2f2bc3e-bbcb-11ea-993d-a0369f4c4926 unit=fst@cmseos71.fnal.gov:1095 tid=00007fa549afe700 source=LocalIo:70 tident=<service> sec= uid=0 gid=0 name= geo="" flags=102, path=/storage/data1/0000f550/256e99c3
xrdlog.fst-20200702.gz:200701 13:50:14 time=1593629414.326799 func=stat level=INFO logid=b2f25618-bbcb-11ea-838b-ac1f6b05cb96 unit=fst@cmseos71.fnal.gov:1095 tid=00007fa549afe700 source=XrdFstOfsFile:922 tident=*LocE8cA.1372:392@cmslpc113 sec= uid=47644 gid=5063 name=nobody geo="" path=/#curl#/eos/uscms/store/user/guerrero/HHFall19code/HH2019/Limits/CMSSW_10_2_13/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/xs/13TeV/13TeV-tHq.txt fxid=256e99c3 size=0 mtime=1593629414.53364308
xrdlog.fst-20200702.gz:200701 13:50:14 time=1593629414.331996 func=stat level=INFO logid=b2f25618-bbcb-11ea-838b-ac1f6b05cb96 unit=fst@cmseos71.fnal.gov:1095 tid=00007fa549afe700 source=XrdFstOfsFile:922 tident=*LocE8cA.1372:392@cmslpc113 sec= uid=47644 gid=5063 name=nobody geo="" path=/#curl#/eos/uscms/store/user/guerrero/HHFall19code/HH2019/Limits/CMSSW_10_2_13/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/xs/13TeV/13TeV-tHq.txt fxid=256e99c3 size=0 mtime=1593629414.53364308
xrdlog.fst-20200702.gz:200701 13:50:14 time=1593629414.911571 func=Report level=INFO logid=static.............................. unit=fst@cmseos71.fnal.gov:1095 tid=00007fa58aafd700 source=Report:51 tident= sec=(null) uid=99 gid=99 name=- geo="" log=b2f25618-bbcb-11ea-838b-ac1f6b05cb96&path=/eos/uscms/store/user/guerrero/HHFall19code/HH2019/Limits/CMSSW_10_2_13/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/xs/13TeV/13TeV-tHq.txt&fstpath=/storage/data1/0000f550/256e99c3&ruid=47644&rgid=5063&td=*LocE8cA.1372:392@cmslpc113&host=cmseos71.fnal.gov&lid=1048850&fid=628005315&fsid=210&ots=1593629414&otms=321&cts=1593629414&ctms=339&nrc=0&nwc=1&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=1584&wb_min=1584&wb_max=1584&wb_sigma=0.00&sfwdb=0&sbwdb=0&sxlfwdb=0&sxlbwdb=0&nfwds=0&nbwds=0&nxlfwds=0&nxlbwds=0&rt=0.00&rvt=0.00&wt=0.04&osize=0&csize=1584&delete_on_close=0&sec.prot=unix&sec.name=root&sec.host=cmslpc113.fnal.gov&sec.vorg=&sec.grps=root&sec.role=&sec.info=&sec.app=fuse
xrdlog.fst-20200702.gz:200701 13:51:14 time=1593629474.060635 func=MgmSyncer level=INFO logid=static.............................. unit=fst@cmseos71.fnal.gov:1095 tid=00007fa587afd700 source=MgmSyncer:96 tident= sec=(null) uid=99 gid=99 name=- geo="" fxid=256e99c3 mtime=1593629414
xrdlog.fst-20200707.gz:200706 13:29:26 time=1594060166.293200 func=MgmSyncer level=INFO logid=static.............................. unit=fst@cmseos71.fnal.gov:1095 tid=00007fa587afd700 source=MgmSyncer:96 tident= sec=(null) uid=99 gid=99 name=- geo="" fxid=256e99c3 mtime=1584547108
xrdlog.fst-20200707.gz:200706 14:28:18 time=1594063698.376505 func=MgmSyncer level=INFO logid=static.............................. unit=fst@cmseos71.fnal.gov:1095 tid=00007fa587afd700 source=MgmSyncer:96 tident= sec=(null) uid=99 gid=99 name=- geo="" fxid=256e99c3 mtime=1584547108

cmseosmgm01:
xrdlog.mgm-2020-07-01-1593630702.gz:200701 13:49:53 time=1593629393.519352 func=BroadcastReleaseFromExternal level=INFO logid=static.............................. unit=mgm@cmseosmgm01.fnal.gov:1094 tid=00007f622cdfc700 source=Caps:155 tident= sec=(null) uid=99 gid=99 name=- geo="" id=256e99c30000000 mInodeCaps.count=0
xrdlog.mgm-2020-07-01-1593630702.gz:200701 13:49:53 time=1593629393.519928 func=open level=INFO logid=a68c2908-bbcb-11ea-838b-ac1f6b05cb96 unit=mgm@cmseosmgm01.fnal.gov:1094 tid=00007f622cdfc700 source=XrdMgmOfsFile:2786 tident=*LocE8cA.1372:586@cmslpc113 sec=unix uid=47644 gid=5063 name=root geo="" op=write path=/eos/uscms/store/user/guerrero/HHFall19code/HH2019/Limits/CMSSW_10_2_13/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/xs/13TeV/13TeV-tHq.txt info=eos.app=fuse&mgm.pcmd=redirect&eos.app=fuse&eos.encodepath=1&eos.bookingsize=0&fst.readahead=true&fst.blocksize=262144&eos.client.openflags=rwcr&eos.client.openmode=100600 target[0]=(cmseos71.fnal.gov,210) target[1]=(cmseos11.fnal.gov,46) redirection=cmseos71.fnal.gov?&cap.sym=<...>&cap.msg=<...>&mgm.logid=a68c2908-bbcb-11ea-838b-ac1f6b05cb96&mgm.replicaindex=0&mgm.replicahead=0&mgm.id=256e99c3&mgm.mtime=0 xrd_port=1095 http_port=8001
xrdlog.mgm-2020-07-01-1593630702.gz:200701 13:49:53 time=1593629393.519978 func=open level=INFO logid=a68c2908-bbcb-11ea-838b-ac1f6b05cb96 unit=mgm@cmseosmgm01.fnal.gov:1094 tid=00007f622cdfc700 source=XrdMgmOfsFile:2794 tident=*LocE8cA.1372:586@cmslpc113 sec=unix uid=47644 gid=5063 name=root geo="" info="redirection" hostport=cmseos71.fnal.gov?&cap.sym=<...>&cap.msg=<...>&mgm.logid=a68c2908-bbcb-11ea-838b-ac1f6b05cb96&mgm.replicaindex=0&mgm.replicahead=0&mgm.id=256e99c3&mgm.mtime=0:1095
xrdlog.mgm-2020-07-01-1593630702.gz:200701 13:49:53 time=1593629393.523668 func=BroadcastReleaseFromExternal level=INFO logid=static.............................. unit=mgm@cmseosmgm01.fnal.gov:1094 tid=00007f622cdfc700 source=Caps:155 tident= sec=(null) uid=99 gid=99 name=- geo="" id=256e99c30000000 mInodeCaps.count=0
xrdlog.mgm-2020-07-01-1593630702.gz:200701 13:50:14 time=1593629414.320425 func=open level=INFO logid=b2f25618-bbcb-11ea-838b-ac1f6b05cb96 unit=mgm@cmseosmgm01.fnal.gov:1094 tid=00007f622cdfc700 source=XrdMgmOfsFile:459 tident=*LocE8cA.1372:586@cmslpc113 sec=unix uid=47644 gid=5063 name=root geo="" op=write trunc=0 path=fxid:256e99c3 info=eos.app=fuse&eos.bookingsize=0&eos.encodepath=1&eos.lfn=fxid:256e99c3&fst.blocksize=262144&fst.readahead=true&fst.valid=1593629473
xrdlog.mgm-2020-07-01-1593630702.gz:200701 13:50:14 time=1593629414.320547 func=open level=INFO logid=b2f25618-bbcb-11ea-838b-ac1f6b05cb96 unit=mgm@cmseosmgm01.fnal.gov:1094 tid=00007f622cdfc700 source=XrdMgmOfsFile:493 tident=*LocE8cA.1372:586@cmslpc113 sec=unix uid=47644 gid=5063 name=root geo="" msg="access by inode" ino=fxid:256e99c3 path=/eos/uscms/store/user/guerrero/HHFall19code/HH2019/Limits/CMSSW_10_2_13/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/xs/13TeV/13TeV-tHq.txt
xrdlog.mgm-2020-07-01-1593630702.gz:200701 13:50:14 time=1593629414.321242 func=open level=INFO logid=b2f25618-bbcb-11ea-838b-ac1f6b05cb96 unit=mgm@cmseosmgm01.fnal.gov:1094 tid=00007f622cdfc700 source=XrdMgmOfsFile:2786 tident=*LocE8cA.1372:586@cmslpc113 sec=unix uid=47644 gid=5063 name=root geo="" op=write path=/eos/uscms/store/user/guerrero/HHFall19code/HH2019/Limits/CMSSW_10_2_13/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/xs/13TeV/13TeV-tHq.txt info=eos.app=fuse&eos.bookingsize=0&eos.encodepath=1&eos.lfn=fxid:256e99c3&fst.blocksize=262144&fst.readahead=true&fst.valid=1593629473 target[0]=(cmseos71.fnal.gov,210) target[1]=(cmseos11.fnal.gov,46) redirection=cmseos71.fnal.gov?&cap.sym=<...>&cap.msg=<...>&mgm.logid=b2f25618-bbcb-11ea-838b-ac1f6b05cb96&mgm.replicaindex=0&mgm.replicahead=0&mgm.id=256e99c3&mgm.mtime=0 xrd_port=1095 http_port=8001
xrdlog.mgm-2020-07-01-1593630702.gz:200701 13:50:14 time=1593629414.321275 func=open level=INFO logid=b2f25618-bbcb-11ea-838b-ac1f6b05cb96 unit=mgm@cmseosmgm01.fnal.gov:1094 tid=00007f622cdfc700 source=XrdMgmOfsFile:2794 tident=*LocE8cA.1372:586@cmslpc113 sec=unix uid=47644 gid=5063 name=root geo="" info="redirection" hostport=cmseos71.fnal.gov?&cap.sym=<...>&cap.msg=<...>&mgm.logid=b2f25618-bbcb-11ea-838b-ac1f6b05cb96&mgm.replicaindex=0&mgm.replicahead=0&mgm.id=256e99c3&mgm.mtime=0:1095
xrdlog.mgm-2020-07-01-1593630702.gz:200701 13:50:14 time=1593629414.336153 func=log_info level=INFO logid=b2f25618-bbcb-11ea-838b-ac1f6b05cb96 unit=mgm@cmseosmgm01.fnal.gov:1094 tid=00007f508eafc700 source=CommitHelper:207 tident=daemon.29884:1051@cmseos71 sec=sss uid=2 gid=2 name=daemon geo="" subcmd=commit path=/eos/uscms/store/user/guerrero/HHFall19code/HH2019/Limits/CMSSW_10_2_13/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/xs/13TeV/13TeV-tHq.txt size=1584 fxid=256e99c3 fsid=210 dropfsid= checksum=af771590 mtime=0 mtime.nsec=0 oc-chunk=0 oc-n=0 oc-max=0 oc-uuid=
xrdlog.mgm-2020-07-01-1593630702.gz:200701 13:50:14 time=1593629414.336532 func=BroadcastRefreshFromExternal level=INFO logid=static.............................. unit=mgm@cmseosmgm01.fnal.gov:1094 tid=00007f508eafc700 source=Caps:202 tident= sec=(null) uid=99 gid=99 name=- geo="" id=256e99c30000000 pid=c44d54
xrdlog.mgm-2020-07-01-1593630702.gz:200701 13:50:14 time=1593629414.338353 func=log_info level=INFO logid=b2f25618-bbcb-11ea-838b-ac1f6b05cb96 unit=mgm@cmseosmgm01.fnal.gov:1094 tid=00007f50b62fd700 source=CommitHelper:207 tident=daemon.26000:731@cmseos11 sec=sss uid=2 gid=2 name=daemon geo="" subcmd=commit path=/eos/uscms/store/user/guerrero/HHFall19code/HH2019/Limits/CMSSW_10_2_13/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/xs/13TeV/13TeV-tHq.txt size=1584 fxid=256e99c3 fsid=46 dropfsid= checksum=af771590 mtime=0 mtime.nsec=0 oc-chunk=0 oc-n=0 oc-max=0 oc-uuid=

xrdlog.mgm-2020-07-03-1593752807.gz:200702 23:46:57 time=1593751617.234846 func=Schedule2Balance level=INFO logid=FstOfsStorage unit=mgm@cmseosmgm01.fnal.gov:1094 tid=00007f626ffca700 source=Schedule2Balance:436 tident=daemon.5121:877@cmseos69 sec=sss uid=2 gid=2 name=daemon geo="" subcmd=scheduling fxid=256e99c3 src_fsid=46 tgt_fsid=206
xrdlog.mgm-2020-07-03-1593752807.gz:200702 23:46:57 time=1593751617.235204 func=Schedule2Balance level=INFO logid=FstOfsStorage unit=mgm@cmseosmgm01.fnal.gov:1094 tid=00007f626ffca700 source=Schedule2Balance:492 tident=daemon.5121:877@cmseos69 sec=sss uid=2 gid=2 name=daemon geo="" cmd=schedule2balance fxid=256e99c3 source_fs=46 target_fs=206
xrdlog.mgm-2020-07-03-1593752807.gz:200702 23:46:58 time=1593751618.448348 func=log_info level=INFO logid=unknown unit=mgm@cmseosmgm01.fnal.gov:1094 tid=00007f6236efd700 source=CommitHelper:207 tident=daemon.5121:877@cmseos69 sec=sss uid=2 gid=2 name=daemon geo="" subcmd=commit path=/eos/uscms/store/user/guerrero/HHFall19code/HH2019/Limits/CMSSW_10_2_13/src/HiggsAnalysis/CombinedLimit/data/lhc-hxswg/sm/xs/13TeV/13TeV-tHq.txt size=1584 fxid=256e99c3 fsid=206 dropfsid=46 checksum=af771590 mtime=1593751618 mtime.nsec=447745000 oc-chunk=0 oc-n=0 oc-max=0 oc-uuid=
xrdlog.mgm-2020-07-03-1593752807.gz:200702 23:51:56 time=1593751916.053040 func=Schedule2Delete level=INFO logid=static.............................. unit=mgm@cmseosmgm01.fnal.gov:1094 tid=00007f2ad5225700 source=Schedule2Delete:169 tident= sec=(null) uid=99 gid=99 name=- geo="" msg="add to deletion message" fid=256e99c3 fsid=46
xrdlog.mgm-2020-07-03-1593752807.gz:200702 23:52:28 time=1593751948.292580 func=Drop level=INFO logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@cmseosmgm01.fnal.gov:1094 tid=00007f521aafb700 source=Drop:55 tident=daemon.26000:731@cmseos11 sec=sss uid=2 gid=2 name=daemon geo="" drop request for &mgm.pcmd=drop&mgm.fsid=46&mgm.fid=256e99c3&mgm.localprefix=/storage/data3

Also, not sure if it’s related, but for a while now I see a lot of files with no disk checksums being stored:

# eos file check /eos/uscms/store/user/nschonbe/run_2020_7_6_11_12/DYJetsToLL_M-50_HT-100to200_TuneCUETP8M1_13TeVMLM_TeV_MG_MC/DYJetsToLL_M-50_HT-100to200_TuneCUETP8M1_13TeVMLM_TeV_MG_MC_0.root
path="/eos/uscms/store/user/nschonbe/run_2020_7_6_11_12/DYJetsToLL_M-50_HT-100to200_TuneCUETP8M1_13TeVMLM_TeV_MG_MC/DYJetsToLL_M-50_HT-100to200_TuneCUETP8M1_13TeVMLM_TeV_MG_MC_0.root" fxid="25eaccf8" size="891" nrep="2" checksumtype="adler" checksum="6fc4ef4900000000000000000000000000000000"
nrep="00" fsid="203" host="cmseos67.fnal.gov:1095" fstpath="/storage/data3/0000f87e/25eaccf8" size="891" statsize="891" checksum="6fc4ef4900000000000000000000000000000000" diskchecksum="6fc4ef4900000000000000000000000000000000" error_label="none"
nrep="01" fsid="135" host="cmseos39.fnal.gov:1095" fstpath="/storage/data2/0000f87e/25eaccf8" size="891" statsize="891" checksum="6fc4ef4900000000000000000000000000000000" diskchecksum="0000000000000000000000000000000000000000" error_label="DISK_CHECKSUM"
# eos file check /eos/uscms/store/user/nschonbe/run_2020_7_6_11_12/DYJetsToLL_M-50_HT-100to200_TuneCUETP8M1_13TeVMLM_TeV_MG_MC/DYJetsToLL_M-50_HT-100to200_TuneCUETP8M1_13TeVMLM_TeV_MG_MC_1.root
path="/eos/uscms/store/user/nschonbe/run_2020_7_6_11_12/DYJetsToLL_M-50_HT-100to200_TuneCUETP8M1_13TeVMLM_TeV_MG_MC/DYJetsToLL_M-50_HT-100to200_TuneCUETP8M1_13TeVMLM_TeV_MG_MC_1.root" fxid="25eacd1f" size="891" nrep="2" checksumtype="adler" checksum="cd55f2f700000000000000000000000000000000"
nrep="00" fsid="53" host="cmseos15.fnal.gov:1095" fstpath="/storage/data1/0000f87e/25eacd1f" size="891" statsize="891" checksum="cd55f2f700000000000000000000000000000000" diskchecksum="0000000000000000000000000000000000000000" error_label="DISK_CHECKSUM"
nrep="01" fsid="178" host="cmseos56.fnal.gov:1095" fstpath="/storage/data1/0000f87e/25eacd1f" size="891" statsize="891" checksum="cd55f2f700000000000000000000000000000000" diskchecksum="0000000000000000000000000000000000000000" error_label="DISK_CHECKSUM"
# eos file check /eos/uscms/store/user/nschonbe/run_2020_7_6_11_12/DYJetsToLL_M-50_HT-100to200_TuneCUETP8M1_13TeVMLM_TeV_MG_MC/DYJetsToLL_M-50_HT-100to200_TuneCUETP8M1_13TeVMLM_TeV_MG_MC_2.root
path="/eos/uscms/store/user/nschonbe/run_2020_7_6_11_12/DYJetsToLL_M-50_HT-100to200_TuneCUETP8M1_13TeVMLM_TeV_MG_MC/DYJetsToLL_M-50_HT-100to200_TuneCUETP8M1_13TeVMLM_TeV_MG_MC_2.root" fxid="25eacd7b" size="891" nrep="2" checksumtype="adler" checksum="d3e0eecb00000000000000000000000000000000"
nrep="00" fsid="202" host="cmseos67.fnal.gov:1095" fstpath="/storage/data2/0000f87e/25eacd7b" size="891" statsize="891" checksum="d3e0eecb00000000000000000000000000000000" diskchecksum="d3e0eecb00000000000000000000000000000000" error_label="none"
nrep="01" fsid="134" host="cmseos39.fnal.gov:1095" fstpath="/storage/data1/0000f87e/25eacd7b" size="891" statsize="891" checksum="d3e0eecb00000000000000000000000000000000" diskchecksum="0000000000000000000000000000000000000000" error_label="DISK_CHECKSUM"

Hi Dan,

The diskchecksum no set means that the ScanDisk process didn’t run yet on that file. This is not something worrying, it should happen at some point.

Concerning the initial question, so the replica with different mode bits and timestamp was coming from a balancing job. Now, everything looks ok from the logs that you posted. The file was initially written to fsids 210 and 46 and then on 03.07.2020 at 23:52 it was balanced from 46 to 206.

Now can you grep only the xrdmgm.log for the next few days for the fxid of the file? We need to find what triggered the drop of the replicas. Since they were clearly dropped only at the MGM.

When running eos-ns-inspect did you exclude the ones which are 0 size?

Cheers,
Elvin

Running the ns-inspect eliminating 0 size files now. Is it expected that 0 size files will have no on-disk replicas?

I see some odd (to me) things like this:

id=353870275 container=0 size=6789 actual-stripes=0 expected-stripes=2 unlinked-stripes=1 locations= unlinked-locations=62

# eos file info fid:353870275
  File: 'fid:353870275'  Flags: 0644  Clock: 1620667bc4733ae4
  Size: 6789
Modify: Wed Jul 12 15:22:17 2017 Timestamp: 1499890937.929775000
Change: Wed Jul 12 15:22:17 2017 Timestamp: 1499890937.885435089
 Birth: Wed Dec 31 18:00:00 1969 Timestamp: 0.000000000
  CUid: 47534 CGid: 5063 Fxid: 1517a1c3 Fid: 353870275 Pid: 0 Pxid: 00000000
XStype: adler    XS: 72 96 ce f0    ETAGs: "94991328634470400:7296cef0"
Layout: replica Stripes: 2 Blocksize: 4k LayoutId: 00100112 Redundancy: d0::t0 
  #Rep: 0
(undeleted) $ 62
*******

# ./getfilefsts.sh 1517a1c3
(nothing returned)

There are a good number of these container=0 files. I’m sure there is a reason for them, I’m always trying to learn more about the inner workings of EOS. It looks like a deleted file where there is still supposedly a replica out on disk (thus the undeleted) but there is no fs 62.

OK, so I re-ran the ns-inspect and there are very few (if any) size=0 files. I still get over 539000 files.

I will attach the log entries for FXID 256e99c3 or send them to you directly as there are too many to put here.

Sent all logs for that fxid to your email, but mostly seems to be just this series of logfiles being repeated.

/var/log/eos/mgm/xrdlog.mgm-2020-07-06-1594062319.gz:200706 13:28:26 time=1594060106.255184 func=CollectFstInfo           level=ERROR logid=6aca1620-bfb6-11ea-90ff-ac1f6b05cb96 unit=mgm@cmseosmgm01.fnal.gov:1094 tid=00007f4c0bffc700 source=FsckEntry:163                  tident=<service> sec=      uid=0 gid=0 name= geo="" msg="failed stat" fxid=256e99c3 fsid=46 local_path=/storage/data3/0000f550/256e99c3
/var/log/eos/mgm/xrdlog.mgm-2020-07-06-1594062319.gz:200706 13:28:26 time=1594060106.255282 func=RepairReplicaInconsistencies level=INFO  logid=6aca1620-bfb6-11ea-90ff-ac1f6b05cb96 unit=mgm@cmseosmgm01.fnal.gov:1094 tid=00007f4c0bffc700 source=FsckEntry:474                  tident=<service> sec=      uid=0 gid=0 name= geo="" fxid=256e99c3 fsid=210
/var/log/eos/mgm/xrdlog.mgm-2020-07-06-1594062319.gz:200706 13:28:26 time=1594060106.255318 func=RepairReplicaInconsistencies level=INFO  logid=6aca1620-bfb6-11ea-90ff-ac1f6b05cb96 unit=mgm@cmseosmgm01.fnal.gov:1094 tid=00007f4c0bffc700 source=FsckEntry:474                  tident=<service> sec=      uid=0 gid=0 name= geo="" fxid=256e99c3 fsid=206
/var/log/eos/mgm/xrdlog.mgm-2020-07-06-1594062319.gz:200706 13:28:26 time=1594060106.255342 func=DropReplica              level=INFO  logid=6aca1620-bfb6-11ea-90ff-ac1f6b05cb96 unit=mgm@cmseosmgm01.fnal.gov:1094 tid=00007f4c0bffc700 source=FsckEntry:675                  tident=<service> sec=      uid=0 gid=0 name= geo="" msg="drop (unregistered) replica" fxid=256e99c3 fsid=46
/var/log/eos/mgm/xrdlog.mgm-2020-07-06-1594062319.gz:200706 13:28:26 time=1594060106.257415 func=RepairReplicaInconsistencies level=INFO  logid=6aca1620-bfb6-11ea-90ff-ac1f6b05cb96 unit=mgm@cmseosmgm01.fnal.gov:1094 tid=00007f4c0bffc700 source=FsckEntry:639                  tident=<service> sec=      uid=0 gid=0 name= geo="" msg="file replicas consistent" fxid=256e99c3
/var/log/eos/mgm/xrdlog.mgm-2020-07-06-1594062319.gz:200706 13:28:28 time=1594060108.190305 func=Drop                     level=INFO  logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@cmseosmgm01.fnal.gov:1094 tid=00007f622effe700 source=Drop:55                        tident=daemon.2012:1542@cmseos11 sec=sss   uid=2 gid=2 name=daemon geo="" drop request for &mgm.pcmd=drop&mgm.fsid=46&mgm.fid=256e99c3&mgm.localprefix=/storage/data3

Hi Daniel,

Yes, files created by the eos file touch command have 0 size and no replicas attached. This is normal, but we changed this behavior recently to align it with the rest of the commands like xrdcp or touch through the fuse interface.

The odd files that you see are files deleted but waiting for a replica on the disk to be deleted before the namespace entry is cleaned up. When you see a file name that contains the fxid it means that a remove was issues, the file object is now detached from its parent container (that is why you have the container 0) and it’s waiting for the physical replicas to be deleted.

If you don’t have anymore the file system on which such a replica exist then you will need to force remove them. You can achieve this by using the “rm -F fxid:” command available since 4.7.11.

So just for me to make sure I understand correctly, you have 539k file that have no replicas and size is different from 0?

The log entries you mention here don’t explain who/why the replicas on file systems 210 and 206 were deleted/dropped.

I didn’t receive anything by email, could you maybe resend them?

Cheers,
Elvin

I did some further analysis of these files. As it turns out, the numbers breakdown like this:

539144 - total number of files with non-zero size and zero replicas

24246 - files with container=0, entries look like this:

id=288651086 container=0 size=18498557 actual-stripes=0 expected-stripes=2 unlinked-stripes=1 locations= unlinked-locations=63

# eos file info fid:288651086
  File: 'fid:288651086'  Flags: 0644  Clock: 16223e9995f30edf
  Size: 18498557
Modify: Sun Dec 25 16:10:51 2016 Timestamp: 1482703851.279158000
Change: Sun Dec 25 16:10:49 2016 Timestamp: 1482703849.454184684
 Birth: Wed Dec 31 18:00:00 1969 Timestamp: 0.000000000
  CUid: 11594 CGid: 5063 Fxid: 1134774e Fid: 288651086 Pid: 0 Pxid: 00000000
XStype: adler    XS: cf 0b 4f f9    ETAGs: "77484185895305216:cf0b4ff9"
Layout: replica Stripes: 2 Blocksize: 4k LayoutId: 00100112 Redundancy: d0::t0 
  #Rep: 0
(undeleted) $ 63
*******
error: cannot retrieve file meta data - Container #0 not found (errc=0) (Success)

537965 - number of entries with unlinked-locations=6[234]

This one bothers me. I haven’t (at least recently) had any filesystems 62, 63, or 64. It’s very likely they did exist at one time, but I haven’t recently removed them. They may go back to before I was even at FNAL.

Any metadata I can retrieve shows older (2017 or older) files:

# eos file info fid:291258707
  File: '/eos/uscms/store/user/lpcdm/noreplica/ygule/fPort-Moriond17/fPort_v1/fPort_dirac_v_lu1_mPhi2000_mChi300_Moriond17_MINIAODSIM/170109_140502/0000/log/cmsRun_761.log.tar.gz'  Flags: 0644  Clock: 16223ef67a6ef2e0
  Size: 10514
Modify: Mon Jan  9 22:05:55 2017 Timestamp: 1484021155.425122000
Change: Mon Jan  9 22:05:54 2017 Timestamp: 1484021154.886821148
 Birth: Wed Dec 31 18:00:00 1969 Timestamp: 0.000000000
  CUid: 49665 CGid: 5063 Fxid: 115c4153 Fid: 291258707 Pid: 5318404 Pxid: 00512704
XStype: adler    XS: 43 b8 36 af    ETAGs: "78184163827515392:43b836af"
Layout: replica Stripes: 1 Blocksize: 4k LayoutId: 00100012 Redundancy: d0::t0 
  #Rep: 0
(undeleted) $ 63
*******

I have not been able to find one that still exists on disk anywhere.

So the vast majority of these are files that were deleted but the metadata was never cleaned up. What is the best (safest!) way to clean these up?

Still waiting to hear on this one as well. These are obviously old entries because locations 62, 63, and 64 no longer exist. I’d like to remove these 500,000+ entries. Just use regular commands? Is that safe? Will it work?

Hi Daniel,

Yes, you can use the normal eos rm command. Unfortunately, if you run an older version of EOS (< 4.7.11) then you have to do some extra steps.

For eos >=4.7.11 it is enough to run the following command twice on each such file:
eos rm -rF fxid:<fxid_file>
eos rm -rF fxid:<fxid_file>

The first command will detach the file from the parent container and send a requests for deletion to the file systems. Since some of them might not exist anymore (62, 63 etc.) there will be no reply and so you are left with a file metadata object not attached to a parent waiting for a deletion confirmation that will never come. That is why you need the second command to force the deletion of the file metadata object.

If you run eos < 4.7.11 then you need to drop the offending file systems manually. For each entry you should do:
find the file systems to drop which don’t exist anymore
eos file drop fxid:<fxid_file> <fsid> -f
eos rm <path_of_file>

Cheers,
Elvin

Just to have a record here, here is what happened to the first one I did. We are running 4.7.16, so I chose to do the eos rm command.

id=30108795 container=0 size=34 actual-stripes=0 expected-stripes=2 unlinked-stripes=1 locations= unlinked-locations=64

# eos file info fid:30108795

File: 'fid:30108795'  Flags: 0644  Clock: 16281275362b9b2e
      Size: 34
    Modify: Fri Jan 23 14:13:01 2015 Timestamp: 1422043981.594388000
    Change: Sat Mar  1 23:55:45 2014 Timestamp: 1393739745.338615000
     Birth: Wed Dec 31 18:00:00 1969 Timestamp: 0.000000000
      CUid: 45721 CGid: 5063 Fxid: 01cb6c7b Fid: 30108795 Pid: 0 Pxid: 00000000
    XStype: adler    XS: bd 1a 09 51    ETAGs: "8082268115435520:bd1a0951"
    Layout: replica Stripes: 2 Blocksize: 4k LayoutId: 00100112 Redundancy: d0::t0 
      #Rep: 0
    (undeleted) $ 64
    *******
    error: cannot retrieve file meta data - Container #0 not found (errc=0) (Success)

I verified no FST held a file with that fid/fxid. (I won’t do that each time. With 500000+ of these it would take forever. Plus most, if not all, have no path and a container id of 0.)

# eos rm -rF fxid:01cb6c7b
# pre-configuring default route to /eos/user/r/root/
# -use $EOSHOME variable to override
info: file object removed from namespace

# eos file info fid:30108795
error: cannot retrieve file meta data - File #30108795 does not exist (found deletion tombstone) (errc=2) (No such file or directory)

# eos rm -rF fxid:01cb6c7b
# pre-configuring default route to /eos/user/r/root/
# -use $EOSHOME variable to override
error: File #30108795 does not exist (found deletion tombstone)

I assume by your description above, because the metadata contained a container id of 0, the second rm was probably not needed in this case.

Searching for the file using eos-ns-inspect shows it is gone:

Error while fetching metadata for FileMD #30108795: Error while fetching FileMD #30108795 protobuf from QDB: Empty response

So this should work to clean up some cobwebs from the namespace. Thanks for your time on this issue.

Yes, everything looks correct.