Hello,
I have seen some repeated error messages in FSTs as below. It would be very helpful for us to understand what the root causes are.
- Cannot stat unlinked flie
200520 05:53:19 time=1589953999.549347 func=SendMessage level=ERROR logid=2cf79a48-7993-11ea-82d6-b8599fa51320 unit=fst@jbod-mgmt-09.eoscluster.sdfarm.kr:1096 tid=00007f64a8efe700 source=XrdMqClient:269 tident= sec= uid=0 gid=0 name= geo=“” msg=“failed to send message” dst=“root://jbod-mgmt-01.eoscluster.sdfarm.kr:1097//eos/jbod-mgmt-09.eoscluster.sdfarm.kr:1096/fst?xmqclient.advisory.flushbacklog=0&xmqclient.advisory.query=0&xmqclient.advisory.status=0” msg=“/eos//errorreport?xrdmqmessage.header=35073680-9a5e-11ea-8375-b8599fa51320^^/eos/jbod-mgmt-09.eoscluster.sdfarm.kr:1096/fst^^^/eos//errorreport^errorreport^1589953999^548996000^0^0^0^0^^^^0^0^&xrdmqmessage.body=200409 06:14:49 time=1586412889.525521 func=Close level=ERROR logid=6ad592de-7a29-11ea-99be-b8599fa51320 unit=fst@jbod-mgmt-09.eoscluster.sdfarm.kr:1096 tid=00007f64c1b3e700 source=XrdFstOssFile:538 tident= sec= uid=0 gid=0 name= geo=”" error=close - cannot stat unlinked file: /jbod/box_18_disk_005/00000000/00000008"
The file chunk /jbod/box_18_disk_005/00000000/00000008 does not exist (already deleted or missing…) but still remains somewhere in namespace?
- “No space left on device” but which is not true and the open failed on the file that does not exist.
200520 05:53:19 time=1589953999.549282 func=SendMessage level=ERROR logid=2c597d36-7993-11ea-9079-b8599fa51320 unit=fst@jbod-mgmt-09.eoscluster.sdfarm.kr:1095 tid=00007f667d2fe700 source=XrdMqClient:269 tident= sec= uid=0 gid=0 name= geo=“” msg=“failed to send message” dst=“root://jbod-mgmt-01.eoscluster.sdfarm.kr:1097//eos/jbod-mgmt-09.eoscluster.sdfarm.kr:1095/fst?xmqclient.advisory.flushbacklog=0&xmqclient.advisory.query=0&xmqclient.advisory.status=0” msg=“/eos//errorreport?xrdmqmessage.header=350736bc-9a5e-11ea-8529-b8599fa51320^^/eos/jbod-mgmt-09.eoscluster.sdfarm.kr:1095/fst^^^/eos//errorreport^errorreport^1589953999^549006000^0^0^0^0^^^^0^0^&xrdmqmessage.body=200409 06:14:49 time=1586412889.421602 func=fileOpen level=ERROR logid=6ae98b54-7a29-11ea-ac45-b8599fa51320 unit=fst@jbod-mgmt-09.eoscluster.sdfarm.kr:1095 tid=00007f6695f3e700 source=XrdIo:233 tident= sec= uid=0 gid=0 name= geo=”" error= “open failed url=root://1@jbod-mgmt-07.eoscluster.sdfarm.kr:1096///eos/gsdc/testrain/file1g-jbod-mgmt-06?cap.msg=dSmf7AbtE6fqPrXYZ84pG0aUzhyUTOs8Smne8PsWeXvOzJ8Dc0teWGQMvZNp8ev9W46WjcYtJvh3IqQ9RroKkXr6Ds/u8Sc4VqrrHMDX8tElrWd3Kp/XmTdjgBwGtI4NgLbpG/EMMxGILq3lEfe7K92SQgUNp19IEz55A5DOX/ES9hEbeuAOsOg4q7Ww87C+J9m/auCkI0+8/AbZhGkKHv5DCGufb4ZpxZSkYWq+5W9wWWfvtEz02l/DWJfne1DxQH2xXlbov7OTnjWsIvy2HQQ1MpAzMTV6e04YgnKo5CgchPfy/dQIBQ8R6TXoII+LyS6q9HLYF9E/LXHfbBeEPItlodb5oqlbyxvU7lsHInSvs98xAiViZ49onmUz8Iv9lFVgnwlrWU2sU2sGah4owjBrDL2raSuaamIb/fKlFY9bhBdDEHX8BfCWjwVFE8kHxK2nsEmvmPbym2KlQThCGNErtfnwvlarwBP8mGBc6HfQw/bCPmk7qkJsIA5zdx3N6YG8s40SQdgUwFQM8Hj+k0Arl4bSjnS7pGQ7UCAZVoceCY+0qLVOydARRTq0VQRqsdd4hJx8+xrPz5BmDc0Ia5vnarI5cs2ZNqjvd40bNnM9wvYvZf6PGPUSYVdCSp908OJizqHywHvFw7PJLODN9kVLKzGERE8vcxqwQ4U7AVMHfQ/OxNnZn52uOGIYdhlV4Bk79kfpX9hcbin8n16m1l4q78cp8NjLZNx/sz9xp7LBVPM4ChJ4dkJmiUSGQ/MsDhfh6UHXqKMNnrBiKIK8+sLP0CxWyR3i+tv4V5SN3p1UGfClevvw4K+S4s7yfSZ93+tmZ6MA2npT1lS7RWmbp7pLdAvtMaHC0QUc4rwcu+xzZbtaMQnYLnV/4Niuol540NPNldBoN/vGJJsGTKWL35YkbVgM2cQ0AL0T4GPNltGg8dH/sbqQriCKsQq9ZyuAA8YHP1S9zWNqgMlum4arpS7nEGJTHUTedKKb0/lcqJx9uyafOgHgogFEjGAL7itWfaVtW0fonAty/QmtC/6dTuykQwD0+6xDoCPb6CMCeI8irglwZ4TVcE8+db60N7+pE7odRFCMuOeJdFJp+A2Ma+h/EuNsRgeaN1VHUyYprG6qN3DJiOCEyml54eYrDVvm/Fs43XiF10gzMVkjv+ygNe38TKul19zO4Zp5QVss5CBKUj+EoIbpetWsUXTUi205BBlA7BJQdsH6y3ttexIAe8mJBmjnX79gmAz63pByKPt9zo57IMngJF2fJJwlUwyfbItHfmmm/IbvrYXVUZCUC3iVl8iq3jgUIKxW8Q/PviNHKKbCQjgNLMNVY3mUI6h/uObjmabzQYNiXZHtXQmiLJzaBc6u4P45q72NVBC+kyI5qleV6fKUspZPCyJhpz1JRhwee/1QAsPbieMnujRVu6Bj5lfXHYQBOFrV5uM6UMVw1DlDD5WP2wdthMJrYVy1Nizyq+0tzD6Wj19vbE7p7sxl1DNmw6ERiyOWi+Q2Hid8xycaekTmR/LmLB+n7X6HZ/Jn91KaJdrGKBI8NL27eJOFzqBDv1/jdWRp7kG2dw1nF06HsFOCb9Vp1zudqw2pIUKMRHyfj+J7L4ynqlCqSzaVbv48sEyk0aGfeIbMzE2Ln2ipWwWjeQXZNH/dYlEP1d60OM2cObrSV2t27g1eb9EE88kUTf0qccf8TTbPGI+Y1ljMgDrR+9HQ/CXH8Ck3NetXX4kb9saYP3y9WS9Z6up3ZbcFhMpVTnngbz+wWHJNUxcr1ISmEtmuUxEd1G2b77H34ys9zwXDEd3ZfSCCdZz0jgdbRaHsTJ4Y15es+FJaHlS6EXzm345WTiMfiQTLBT4rDONRLyaNvlsG3PTfNuHtz6E6c/GIDJn6UFm3/VaicDwsLcWK5Q/eupMPZ/PW5u1SGqy3q+85lYFSTH7JSGThAENW3POpDz1EoiSvKiLYhBE6Uxg/qnRIkI8QG2xk+ngWG4b4LxwzprXze8rh+zYtUJyTkjqlJHaGVPrdi2o=#and#cap.sym=GK+jmevjSPwa/N0MaGwsvDmrjrk=#and#eos.app=eoscp#and#eos.bookingsize=1073741824#and#eos.targetsize=1073741824#and#fst.blocksize=1048576#and#fst.readahead=true#and#fst.valid=1586412948#and#mgm.bookingsize=1073741824#and#mgm.id=00000010#and#mgm.lid=1080299346#and#mgm.logid=6adb06d8-7a29-11ea-a846-b8599fa51330#and#mgm.path=/eos/gsdc/testrain/file1g-jbod-mgmt-06#and#mgm.replicahead=0#and#mgm.replicaindex=9#and#tried=jbod-mgmt-03.eoscluster.sdfarm.kr#and#triedrc=ioerr, errno=3009, errc=400, msg=[ERROR] Error response: No space left on device”"
Is this related to the first one?
And the two minor error messages repeatedly shown…
- “debug level is not known”
200520 06:14:10 time=1589955250.217266 func=SendMessage level=ERROR logid=29977c6a-7993-11ea-a076-b8599fa51240 unit=fst@jbod-mgmt-02.eoscluster.sdfarm.kr:1095 tid=00007f51aeefd700 source=XrdMqClient:269 tident= sec= uid=0 gid=0 name= geo=“” msg=“failed to send message” dst=“root://jbod-mgmt-01.eoscluster.sdfarm.kr:1097//eos/jbod-mgmt-02.eoscluster.sdfarm.kr:1095/fst?xmqclient.advisory.flushbacklog=0&xmqclient.advisory.query=0&xmqclient.advisory.status=0” msg=“/eos//errorreport?xrdmqmessage.header=1e7c010e-9a61-11ea-8371-b8599fa51240^^/eos/jbod-mgmt-02.eoscluster.sdfarm.kr:1095/fst^^^/eos//errorreport^errorreport^1589955250^216990000^0^0^0^0^^^^0^0^&xrdmqmessage.body=200408 12:19:17 time=1586348357.354038 func=ProcessFstConfigChange level=ERROR logid=static… unit=fst@jbod-mgmt-02.eoscluster.sdfarm.kr:1095 tid=00007f51ad7ff700 source=Communicator:191 tident= sec=(null) uid=99 gid=99 name=- geo=”" debug level is not known!"
- Cannot send errorreport broadcast
200520 06:22:00 time=1589955720.285211 func=ErrorReport level=ERROR logid=FstOfsStorage unit=fst@jbod-mgmt-02.eoscluster.sdfarm.kr:1095 tid=00007f51aeefd700 source=ErrorReport:92 tident= sec= uid=0 gid=0 name= geo=“” cannot send errorreport broadcast
This is the version information that I am using:
sh-4.2# eos version
EOS_INSTANCE=gsdc
EOS_SERVER_VERSION=4.7.7 EOS_SERVER_RELEASE=1
EOS_CLIENT_VERSION=4.7.7 EOS_CLIENT_RELEASE=1
QuarkDB information is following:
sh-4.2# redis-cli -p 7777 raft-info
- TERM 1696
- LOG-START 0
- LOG-SIZE 4225838
- LEADER jbod-mgmt-08.eoscluster.sdfarm.kr:7777
- CLUSTER-ID 49612085-f367-4a4d-a181-97706ec69d20
- COMMIT-INDEX 4225837
- LAST-APPLIED 4225837
- BLOCKED-WRITES 0
- LAST-STATE-CHANGE 755935 (8 days, 17 hours, 58 minutes, 55 seconds)
- MYSELF jbod-mgmt-08.eoscluster.sdfarm.kr:7777
- VERSION 0.4.2
- STATUS LEADER
- NODE-HEALTH GREEN
- JOURNAL-FSYNC-POLICY sync-important-updates
- MEMBERSHIP-EPOCH 0
- NODES jbod-mgmt-02.eoscluster.sdfarm.kr:7777,jbod-mgmt-05.eoscluster.sdfarm.kr:7777,jbod-mgmt-08.eoscluster.sdfarm.kr:7777
- OBSERVERS
- QUORUM-SIZE 2
- REPLICA jbod-mgmt-02.eoscluster.sdfarm.kr:7777 | ONLINE | UP-TO-DATE | NEXT-INDEX 4225838 | VERSION 0.4.2
- REPLICA jbod-mgmt-05.eoscluster.sdfarm.kr:7777 | ONLINE | UP-TO-DATE | NEXT-INDEX 4225838 | VERSION 0.4.2
Please just let me know if you need any further information.
Thank you.
Best regards,
Sang-Un