A couple of questions regarding error messages on some of FSTs

sahn · May 20, 2020, 6:31am

Hello,

I have seen some repeated error messages in FSTs as below. It would be very helpful for us to understand what the root causes are.

Cannot stat unlinked flie

200520 05:53:19 time=1589953999.549347 func=SendMessage level=ERROR logid=2cf79a48-7993-11ea-82d6-b8599fa51320 unit=fst@jbod-mgmt-09.eoscluster.sdfarm.kr:1096 tid=00007f64a8efe700 source=XrdMqClient:269 tident= sec= uid=0 gid=0 name= geo=“” msg=“failed to send message” dst=“root://jbod-mgmt-01.eoscluster.sdfarm.kr:1097//eos/jbod-mgmt-09.eoscluster.sdfarm.kr:1096/fst?xmqclient.advisory.flushbacklog=0&xmqclient.advisory.query=0&xmqclient.advisory.status=0” msg=“/eos//errorreport?xrdmqmessage.header=35073680-9a5e-11ea-8375-b8599fa51320^^/eos/jbod-mgmt-09.eoscluster.sdfarm.kr:1096/fst^^^/eos//errorreport^errorreport^1589953999^548996000^0^0^0^0^^^^0^0^&xrdmqmessage.body=200409 06:14:49 time=1586412889.525521 func=Close level=ERROR logid=6ad592de-7a29-11ea-99be-b8599fa51320 unit=fst@jbod-mgmt-09.eoscluster.sdfarm.kr:1096 tid=00007f64c1b3e700 source=XrdFstOssFile:538 tident= sec= uid=0 gid=0 name= geo=”" error=close - cannot stat unlinked file: /jbod/box_18_disk_005/00000000/00000008"

The file chunk /jbod/box_18_disk_005/00000000/00000008 does not exist (already deleted or missing…) but still remains somewhere in namespace?

“No space left on device” but which is not true and the open failed on the file that does not exist.

200520 05:53:19 time=1589953999.549282 func=SendMessage level=ERROR logid=2c597d36-7993-11ea-9079-b8599fa51320 unit=fst@jbod-mgmt-09.eoscluster.sdfarm.kr:1095 tid=00007f667d2fe700 source=XrdMqClient:269 tident= sec= uid=0 gid=0 name= geo=“” msg=“failed to send message” dst=“root://jbod-mgmt-01.eoscluster.sdfarm.kr:1097//eos/jbod-mgmt-09.eoscluster.sdfarm.kr:1095/fst?xmqclient.advisory.flushbacklog=0&xmqclient.advisory.query=0&xmqclient.advisory.status=0” msg=“/eos//errorreport?xrdmqmessage.header=350736bc-9a5e-11ea-8529-b8599fa51320^^/eos/jbod-mgmt-09.eoscluster.sdfarm.kr:1095/fst^^^/eos//errorreport^errorreport^1589953999^549006000^0^0^0^0^^^^0^0^&xrdmqmessage.body=200409 06:14:49 time=1586412889.421602 func=fileOpen level=ERROR logid=6ae98b54-7a29-11ea-ac45-b8599fa51320 unit=fst@jbod-mgmt-09.eoscluster.sdfarm.kr:1095 tid=00007f6695f3e700 source=XrdIo:233 tident= sec= uid=0 gid=0 name= geo=”" error= “open failed url=root://1@jbod-mgmt-07.eoscluster.sdfarm.kr:1096///eos/gsdc/testrain/file1g-jbod-mgmt-06?cap.msg=dSmf7AbtE6fqPrXYZ84pG0aUzhyUTOs8Smne8PsWeXvOzJ8Dc0teWGQMvZNp8ev9W46WjcYtJvh3IqQ9RroKkXr6Ds/u8Sc4VqrrHMDX8tElrWd3Kp/XmTdjgBwGtI4NgLbpG/EMMxGILq3lEfe7K92SQgUNp19IEz55A5DOX/ES9hEbeuAOsOg4q7Ww87C+J9m/auCkI0+8/AbZhGkKHv5DCGufb4ZpxZSkYWq+5W9wWWfvtEz02l/DWJfne1DxQH2xXlbov7OTnjWsIvy2HQQ1MpAzMTV6e04YgnKo5CgchPfy/dQIBQ8R6TXoII+LyS6q9HLYF9E/LXHfbBeEPItlodb5oqlbyxvU7lsHInSvs98xAiViZ49onmUz8Iv9lFVgnwlrWU2sU2sGah4owjBrDL2raSuaamIb/fKlFY9bhBdDEHX8BfCWjwVFE8kHxK2nsEmvmPbym2KlQThCGNErtfnwvlarwBP8mGBc6HfQw/bCPmk7qkJsIA5zdx3N6YG8s40SQdgUwFQM8Hj+k0Arl4bSjnS7pGQ7UCAZVoceCY+0qLVOydARRTq0VQRqsdd4hJx8+xrPz5BmDc0Ia5vnarI5cs2ZNqjvd40bNnM9wvYvZf6PGPUSYVdCSp908OJizqHywHvFw7PJLODN9kVLKzGERE8vcxqwQ4U7AVMHfQ/OxNnZn52uOGIYdhlV4Bk79kfpX9hcbin8n16m1l4q78cp8NjLZNx/sz9xp7LBVPM4ChJ4dkJmiUSGQ/MsDhfh6UHXqKMNnrBiKIK8+sLP0CxWyR3i+tv4V5SN3p1UGfClevvw4K+S4s7yfSZ93+tmZ6MA2npT1lS7RWmbp7pLdAvtMaHC0QUc4rwcu+xzZbtaMQnYLnV/4Niuol540NPNldBoN/vGJJsGTKWL35YkbVgM2cQ0AL0T4GPNltGg8dH/sbqQriCKsQq9ZyuAA8YHP1S9zWNqgMlum4arpS7nEGJTHUTedKKb0/lcqJx9uyafOgHgogFEjGAL7itWfaVtW0fonAty/QmtC/6dTuykQwD0+6xDoCPb6CMCeI8irglwZ4TVcE8+db60N7+pE7odRFCMuOeJdFJp+A2Ma+h/EuNsRgeaN1VHUyYprG6qN3DJiOCEyml54eYrDVvm/Fs43XiF10gzMVkjv+ygNe38TKul19zO4Zp5QVss5CBKUj+EoIbpetWsUXTUi205BBlA7BJQdsH6y3ttexIAe8mJBmjnX79gmAz63pByKPt9zo57IMngJF2fJJwlUwyfbItHfmmm/IbvrYXVUZCUC3iVl8iq3jgUIKxW8Q/PviNHKKbCQjgNLMNVY3mUI6h/uObjmabzQYNiXZHtXQmiLJzaBc6u4P45q72NVBC+kyI5qleV6fKUspZPCyJhpz1JRhwee/1QAsPbieMnujRVu6Bj5lfXHYQBOFrV5uM6UMVw1DlDD5WP2wdthMJrYVy1Nizyq+0tzD6Wj19vbE7p7sxl1DNmw6ERiyOWi+Q2Hid8xycaekTmR/LmLB+n7X6HZ/Jn91KaJdrGKBI8NL27eJOFzqBDv1/jdWRp7kG2dw1nF06HsFOCb9Vp1zudqw2pIUKMRHyfj+J7L4ynqlCqSzaVbv48sEyk0aGfeIbMzE2Ln2ipWwWjeQXZNH/dYlEP1d60OM2cObrSV2t27g1eb9EE88kUTf0qccf8TTbPGI+Y1ljMgDrR+9HQ/CXH8Ck3NetXX4kb9saYP3y9WS9Z6up3ZbcFhMpVTnngbz+wWHJNUxcr1ISmEtmuUxEd1G2b77H34ys9zwXDEd3ZfSCCdZz0jgdbRaHsTJ4Y15es+FJaHlS6EXzm345WTiMfiQTLBT4rDONRLyaNvlsG3PTfNuHtz6E6c/GIDJn6UFm3/VaicDwsLcWK5Q/eupMPZ/PW5u1SGqy3q+85lYFSTH7JSGThAENW3POpDz1EoiSvKiLYhBE6Uxg/qnRIkI8QG2xk+ngWG4b4LxwzprXze8rh+zYtUJyTkjqlJHaGVPrdi2o=#and#cap.sym=GK+jmevjSPwa/N0MaGwsvDmrjrk=#and#eos.app=eoscp#and#eos.bookingsize=1073741824#and#eos.targetsize=1073741824#and#fst.blocksize=1048576#and#fst.readahead=true#and#fst.valid=1586412948#and#mgm.bookingsize=1073741824#and#mgm.id=00000010#and#mgm.lid=1080299346#and#mgm.logid=6adb06d8-7a29-11ea-a846-b8599fa51330#and#mgm.path=/eos/gsdc/testrain/file1g-jbod-mgmt-06#and#mgm.replicahead=0#and#mgm.replicaindex=9#and#tried=jbod-mgmt-03.eoscluster.sdfarm.kr#and#triedrc=ioerr, errno=3009, errc=400, msg=[ERROR] Error response: No space left on device”"

Is this related to the first one?

And the two minor error messages repeatedly shown…

“debug level is not known”

200520 06:14:10 time=1589955250.217266 func=SendMessage level=ERROR logid=29977c6a-7993-11ea-a076-b8599fa51240 unit=fst@jbod-mgmt-02.eoscluster.sdfarm.kr:1095 tid=00007f51aeefd700 source=XrdMqClient:269 tident= sec= uid=0 gid=0 name= geo=“” msg=“failed to send message” dst=“root://jbod-mgmt-01.eoscluster.sdfarm.kr:1097//eos/jbod-mgmt-02.eoscluster.sdfarm.kr:1095/fst?xmqclient.advisory.flushbacklog=0&xmqclient.advisory.query=0&xmqclient.advisory.status=0” msg=“/eos//errorreport?xrdmqmessage.header=1e7c010e-9a61-11ea-8371-b8599fa51240^^/eos/jbod-mgmt-02.eoscluster.sdfarm.kr:1095/fst^^^/eos//errorreport^errorreport^1589955250^216990000^0^0^0^0^^^^0^0^&xrdmqmessage.body=200408 12:19:17 time=1586348357.354038 func=ProcessFstConfigChange level=ERROR logid=static… unit=fst@jbod-mgmt-02.eoscluster.sdfarm.kr:1095 tid=00007f51ad7ff700 source=Communicator:191 tident= sec=(null) uid=99 gid=99 name=- geo=”" debug level is not known!"

Cannot send errorreport broadcast

200520 06:22:00 time=1589955720.285211 func=ErrorReport level=ERROR logid=FstOfsStorage unit=fst@jbod-mgmt-02.eoscluster.sdfarm.kr:1095 tid=00007f51aeefd700 source=ErrorReport:92 tident= sec= uid=0 gid=0 name= geo=“” cannot send errorreport broadcast

This is the version information that I am using:
sh-4.2# eos version
EOS_INSTANCE=gsdc
EOS_SERVER_VERSION=4.7.7 EOS_SERVER_RELEASE=1
EOS_CLIENT_VERSION=4.7.7 EOS_CLIENT_RELEASE=1

QuarkDB information is following:

sh-4.2# redis-cli -p 7777 raft-info

TERM 1696

LOG-START 0

LOG-SIZE 4225838

LEADER jbod-mgmt-08.eoscluster.sdfarm.kr:7777

CLUSTER-ID 49612085-f367-4a4d-a181-97706ec69d20

COMMIT-INDEX 4225837

LAST-APPLIED 4225837

BLOCKED-WRITES 0

LAST-STATE-CHANGE 755935 (8 days, 17 hours, 58 minutes, 55 seconds)

MYSELF jbod-mgmt-08.eoscluster.sdfarm.kr:7777

VERSION 0.4.2

STATUS LEADER

NODE-HEALTH GREEN

JOURNAL-FSYNC-POLICY sync-important-updates

MEMBERSHIP-EPOCH 0

NODES jbod-mgmt-02.eoscluster.sdfarm.kr:7777,jbod-mgmt-05.eoscluster.sdfarm.kr:7777,jbod-mgmt-08.eoscluster.sdfarm.kr:7777

OBSERVERS

QUORUM-SIZE 2

REPLICA jbod-mgmt-02.eoscluster.sdfarm.kr:7777 | ONLINE | UP-TO-DATE | NEXT-INDEX 4225838 | VERSION 0.4.2

REPLICA jbod-mgmt-05.eoscluster.sdfarm.kr:7777 | ONLINE | UP-TO-DATE | NEXT-INDEX 4225838 | VERSION 0.4.2

Please just let me know if you need any further information.

Thank you.

Best regards,
Sang-Un

sahn · June 21, 2020, 11:18pm

Hi,

The first error disappeared when the whole EOS instance was rebuilt from scratch.

Regarding the question number 4, I think that FSTs cannot communicate with MQ. While in MQ log on master node look like:

200621 23:12:23 026 MqOfs_FSctl: Unable to submit message - no listener on requested queue:  /eos/*/errorreport; Invalid argument
200621 23:12:23 026 daemon.6:23@jbod-mgmt-02.eoscluster MqOfs_FSctl: no listener on requested queue:
200621 23:12:23 026 daemon.6:23@jbod-mgmt-02.eoscluster MqOfs_FSctl: /eos/*/errorreport
200621 23:12:23 383 MqOfs_FSctl: Unable to submit message - no listener on requested queue:  /eos/*/errorreport; Invalid argument
200621 23:12:23 383 daemon.6:30@jbod-mgmt-02.eoscluster MqOfs_FSctl: no listener on requested queue:
200621 23:12:23 383 daemon.6:30@jbod-mgmt-02.eoscluster MqOfs_FSctl: /eos/*/errorreport
200621 23:12:23 time=1592781143.839040 func=open                     level=INFO  logid=aa56ec4e-b414-11ea-8f8d-b8599fa51330 unit=mq@jbod-mgmt-01.eoscluster.sdfarm.kr:1097 tid=00007f066fa2a700 source=XrdMqOfs:94                    tident=daemon.6:23@jbod-mgmt-02.eoscluster sec=      uid=0 gid=0 name= geo="" connecting queue: /eos/jbod-mgmt-02.eoscluster.sdfarm.kr:1095/fst_mq_test
200621 23:12:23 time=1592781143.839064 func=open                     level=INFO  logid=aa56ec4e-b414-11ea-8f8d-b8599fa51330 unit=mq@jbod-mgmt-01.eoscluster.sdfarm.kr:1097 tid=00007f066fa2a700 source=XrdMqOfs:139                   tident=daemon.6:23@jbod-mgmt-02.eoscluster sec=      uid=0 gid=0 name= geo="" connected queue: /eos/jbod-mgmt-02.eoscluster.sdfarm.kr:1095/fst_mq_test
200621 23:12:23 time=1592781143.839151 func=open                     level=INFO  logid=aa56f0a4-b414-11ea-84a8-b8599fa51330 unit=mq@jbod-mgmt-01.eoscluster.sdfarm.kr:1097 tid=00007f0666ab7700 source=XrdMqOfs:94                    tident=daemon.6:30@jbod-mgmt-02.eoscluster sec=      uid=0 gid=0 name= geo="" connecting queue: /eos/jbod-mgmt-02.eoscluster.sdfarm.kr:1096/fst_mq_test
200621 23:12:23 time=1592781143.839171 func=open                     level=INFO  logid=aa56f0a4-b414-11ea-84a8-b8599fa51330 unit=mq@jbod-mgmt-01.eoscluster.sdfarm.kr:1097 tid=00007f0666ab7700 source=XrdMqOfs:139                   tident=daemon.6:30@jbod-mgmt-02.eoscluster sec=      uid=0 gid=0 name= geo="" connected queue: /eos/jbod-mgmt-02.eoscluster.sdfarm.kr:1096/fst_mq_test
200621 23:12:23 time=1592781143.839253 func=close                    level=INFO  logid=aa56ec4e-b414-11ea-8f8d-b8599fa51330 unit=mq@jbod-mgmt-01.eoscluster.sdfarm.kr:1097 tid=00007f066fa2a700 source=XrdMqOfs:256                   tident=daemon.6:23@jbod-mgmt-02.eoscluster sec=      uid=0 gid=0 name= geo="" disconnecting queue: /eos/jbod-mgmt-02.eoscluster.sdfarm.kr:1095/fst_mq_test
200621 23:12:23 time=1592781143.839320 func=close                    level=INFO  logid=aa56ec4e-b414-11ea-8f8d-b8599fa51330 unit=mq@jbod-mgmt-01.eoscluster.sdfarm.kr:1097 tid=00007f066fa2a700 source=XrdMqOfs:292                   tident=daemon.6:23@jbod-mgmt-02.eoscluster sec=      uid=0 gid=0 name= geo="" disconnected queue: /eos/jbod-mgmt-02.eoscluster.sdfarm.kr:1095/fst_mq_test
200621 23:12:23 time=1592781143.839340 func=close                    level=INFO  logid=aa56f0a4-b414-11ea-84a8-b8599fa51330 unit=mq@jbod-mgmt-01.eoscluster.sdfarm.kr:1097 tid=00007f0666ab7700 source=XrdMqOfs:256                   tident=daemon.6:30@jbod-mgmt-02.eoscluster sec=      uid=0 gid=0 name= geo="" disconnecting queue: /eos/jbod-mgmt-02.eoscluster.sdfarm.kr:1096/fst_mq_test
200621 23:12:23 time=1592781143.839392 func=close                    level=INFO  logid=aa56f0a4-b414-11ea-84a8-b8599fa51330 unit=mq@jbod-mgmt-01.eoscluster.sdfarm.kr:1097 tid=00007f0666ab7700 source=XrdMqOfs:292                   tident=daemon.6:30@jbod-mgmt-02.eoscluster sec=      uid=0 gid=0 name= geo="" disconnected queue: /eos/jbod-mgmt-02.eoscluster.sdfarm.kr:1096/fst_mq_test

I have tried to look at MQ with debug mode but I could not get any hints. Is there any way to have further information to tackle down this issue?

Thank you.

Best regards,
Sang-Un

esindril · June 22, 2020, 8:17am

Hi Sang-Un,

I think your MQ daemon is working fine, there is not much log that you can print. Actually you can set the value below to high in /etc/xrd.cf.mq but usually it’s not very informative:

# low|medium|high as trace levels
mq.trace low

What might be the issue here is that on the MGM there should be the following process that consumes these messages that the FSTs are trying to sent related to “errorreports”:

ps aux | grep "console"
root      4891  0.6  0.0 636336 14868 ?        Sl   Jun20  14:33 eos -b console log _MGMID_

Can you check if you have this running? This is normally started automatically tby the MGM process.

Cheers,
Elvin

sahn · June 24, 2020, 11:20pm

Hi Elvin,

Yes. I can see “console” running in MGM.

sh-4.2# ps aux | grep "console"
root       404  0.0  0.0 224416  9724 pts/0    Sl+  Jun17   0:41 eos -b console log _MGMID_

Then, what would you suggest to do? If this is a normal process and the error logs are not actually erroneous, can it be left running?

When I tried to lift the trace level to high, as you noted there are huge bunch of messages which I cannot figure out what they mean… so then went back to low level.

Best regards,
Sang-Un

esindril · June 25, 2020, 8:26am

Hi Sang-Un,

We are also experiencing similar error messages (Unable to submit message - no listener on requested queue: /eos/*/errorreport) in some of our instances but we haven’t yet figured it out how to mitigate this. This does not affect the actual performance of the instance, but we’ll look into fixing it. I will let you know when I have more news.

Thanks,
Elvin

sahn · June 26, 2020, 1:29am

Hi Elvin,

Thanks a lot for the explanation. I will note and hope to be fixed sooner or later.

Best regards,
Sang-Un

CERN Accelerating science

A couple of questions regarding error messages on some of FSTs