Errors in eosd logs

Hi CERN,

I am hoping you can help us.

On 2020-08-03 UTC we suffered an outage, which triggered an investigation of our full production stack. Looking at our eosd logs I noticed the following:

Our environment consists of 10 eosd clients per site with 4 sites around the country. I have also included a count (grep -c) and 3 examples of the log lines for the particular errors, with filenames replaced.

What do all these errors mean?

Which of these errors should we be concerned about?

What else should we be looking at?

Cheers
Michael


grep “No route to host” eosdlog.200803 | sort -R | head -n3
200803 03:01:01 t=1596423661.029421 f=fileOpen l=ERROR tid=00007faf207ff700 s=XrdIo:233 error= “open failed url=root://*user@eos:1094///#curl#/eos/randomfolder/randomfile?eos.app=fuse&eos.bookingsize=0&eos.encodepath=1&eos.lfn=fxid:6730f6bf&fst.valid=1596423719, errno=3014, errc=400, msg=[ERROR] Error response: No route to host”
200803 03:00:27 t=1596423627.003590 f=fileOpen l=ERROR tid=00007faf327ff700 s=XrdIo:233 error= “open failed url=root://*user@eos:1094///#curl#/eos/randomfolder/randomfile?eos.app=fuse&eos.bookingsize=0&eos.encodepath=1&eos.lfn=fxid:67315ab2&fst.valid=1596423686, errno=3014, errc=400, msg=[ERROR] Error response: No route to host”
200803 13:17:35 t=1596460655.044267 f=fileOpen l=ERROR tid=00007f81d7bff700 s=XrdIo:233 error= “open failed url=root://*user@eos:1094///#curl#/eos/randomfolder/randomfile?eos.app=fuse&eos.checksum=ignore&eos.encodepath=1&fst.valid=1596460714, errno=3014, errc=400, msg=[ERROR] Error response: No route to host”

grep -c “No route to host” eosdlog.200803
4149


grep “error code is 3014” eosdlog.200803 | sort -R | head -n3
200803 00:03:04 t=1596412984.200811 f=open l=ERROR tid=00007fa608fff700 s=filesystem:3380 open failed for root://*user@eos///#curl#/eos/randomfolder/randomfile : error code is 3014.
200803 04:56:07 t=1596430567.150310 f=open l=ERROR tid=00007faf317ff700 s=filesystem:3380 open failed for root://*user@eos///#curl#/eos/randomfolder/randomfile : error code is 3014.
200804 08:08:24 t=1596528504.675494 f=open l=ERROR tid=00007f81b8bff700 s=filesystem:3380 open failed for root://*user@eos///#curl#/eos/randomfolder/randomfile : error code is 3014.

grep -c “error code is 3014” eosdlog.200803
2761


grep “Sync-open got errNo=3011 errCode=400” eosdlog.200803 | sort -R | head -n3
200803 18:24:15 t=1596479055.425264 f=Open l=ERROR tid=00007ff431ffe700 s=LayoutWrapper:667 Sync-open got errNo=3011 errCode=400
200803 02:05:46 t=1596420346.415664 f=Open l=ERROR tid=00007fa5fabff700 s=LayoutWrapper:667 Sync-open got errNo=3011 errCode=400
200803 10:13:49 t=1596449629.340119 f=Open l=ERROR tid=00007f484ebef700 s=LayoutWrapper:667 Sync-open got errNo=3011 errCode=400

grep -c “Sync-open got errNo=3011 errCode=400” eosdlog.200803
79192


grep “Sync-open got errNo=3014 errCode=400” eosdlog.200803 | sort -R | head -n3
200803 04:38:24 t=1596429504.051449 f=Open l=ERROR tid=00007faf227ff700 s=LayoutWrapper:667 Sync-open got errNo=3014 errCode=400
200803 14:33:18 t=1596465198.353751 f=Open l=ERROR tid=00007f81dabff700 s=LayoutWrapper:667 Sync-open got errNo=3014 errCode=400
200803 01:52:43 t=1596419563.371132 f=Open l=ERROR tid=00007faf2d7ff700 s=LayoutWrapper:667 Sync-open got errNo=3014 errCode=400

grep -c “Sync-open got errNo=3014 errCode=400” eosdlog.200803
3684


grep “error while writing piece” eosdlog.200803 | sort -R | head -n3
200803 06:18:54 t=1596435534.551920 f=DoWrite l=ERROR tid=00007fa60bfff700 s=CacheEntry:342 error while writing piece off=72359936, len=8192
200803 07:15:46 t=1596438946.993203 f=DoWrite l=ERROR tid=00007fd274ffd700 s=CacheEntry:342 error while writing piece off=9961472, len=262144
200803 09:13:35 t=1596446015.685017 f=DoWrite l=ERROR tid=00007fa6323fd700 s=CacheEntry:342 error while writing piece off=7602176, len=262144

grep -c “error while writing piece” eosdlog.200803
2476


grep “WARNING - read lock held for” eosdlog.200803 | sort -R | head -n3
200803 08:35:24 t=1596443724.542205 f=Release l=WARN tid=00007f9c077fd700 s=RWMutex:1372 WARNING - read lock held for 20483 milliseconds by this thread:
200803 08:01:01 t=1596441661.118927 f=Release l=WARN tid=00007f30e8fff700 s=RWMutex:1372 WARNING - read lock held for 11852 milliseconds by this thread:
200803 08:12:42 t=1596442362.335476 f=Release l=WARN tid=00007f995d3fc700 s=RWMutex:1372 WARNING - read lock held for 14287 milliseconds by this thread:

grep -c “WARNING - read lock held for” eosdlog.200803
529


grep “WARNING - write lock held for” eosdlog.200803 | sort -R | head -n3
200803 09:19:46 t=1596446386.596939 f=UnLockWrite l=WARN tid=00007ff42c7ff700 s=RWMutex:424 WARNING - write lock held for 31220 milliseconds by this thread:
200803 10:49:38 t=1596451778.509734 f=UnLockWrite l=WARN tid=00007f1ed7fff700 s=RWMutex:424 WARNING - write lock held for 10191 milliseconds by this thread:
200803 09:20:25 t=1596446425.104409 f=UnLockWrite l=WARN tid=00007f32ea3ff700 s=RWMutex:424 WARNING - write lock held for 23938 milliseconds by this thread:

grep -c “WARNING - write lock held for” eosdlog.200803
15301


grep “failed to lazy open request” eosdlog.200803 | sort -R | head -n3
200803 03:39:32 t=1596425972.791198 f=LazyOpen l=ERROR tid=00007f47993ff700 s=LayoutWrapper:336 failed to lazy open request //#curl#/eos/randomfolder/randomfile?eos.app=fuse&mgm.pcmd=redirect&eos.app=fuse&eos.encodepath=1&eos.bookingsize=0&eos.client.openflags=rwcr&eos.client.openmode=100644 at url root://*user@eos:1094/ code=400 errno=3006
200803 01:12:25 t=1596417145.542493 f=LazyOpen l=ERROR tid=00007f0e92ffd700 s=LayoutWrapper:336 failed to lazy open request //#curl#/eos/randomfolder/randomfile?eos.app=fuse&mgm.pcmd=redirect&eos.app=fuse&eos.encodepath=1&eos.bookingsize=0&eos.client.openflags=rwcr&eos.client.openmode=100644 at url root://*user@eos:1094/ code=400 errno=3006
200803 01:12:44 t=1596417164.588678 f=LazyOpen l=ERROR tid=00007faf367ff700 s=LayoutWrapper:336 failed to lazy open request //#curl#/eos/randomfolder/randomfile?eos.app=fuse&mgm.pcmd=redirect&eos.app=fuse&eos.encodepath=1&eos.bookingsize=0&eos.client.openflags=rwcr&eos.client.openmode=100644 at url root://*user@eos:1094/ code=400 errno=3006

grep -c “failed to lazy open request” eosdlog.200803
356


grep “open failed for root://” eosdlog.200803 | sort -R | head -n3
200803 00:57:52 t=1596416272.983293 f=open l=ERROR tid=00007fe29f3ff700 s=filesystem:3380 open failed for root://*user@eos///#curl#/eos/randomfolder/randomfile : error code is 3011.
200803 00:44:38 t=1596415478.433658 f=open l=ERROR tid=00007f99553ff700 s=filesystem:3380 open failed for root://*user@eos///#curl#/eos/randomfolder/randomfile : error code is 3011.
200803 03:56:51 t=1596427011.298709 f=open l=ERROR tid=00007f38437ff700 s=filesystem:3380 open failed for root://*user@eos///#curl#/eos/randomfolder/randomfile : error code is 3011.

grep -c “open failed for root://” eosdlog.200803
82833


grep “Sync-open got errNo=3012 errCode=400” eosdlog.200803 | sort -R | head -n3
200803 07:43:15 t=1596440595.478528 f=Open l=ERROR tid=00007fec847ff700 s=LayoutWrapper:667 Sync-open got errNo=3012 errCode=400
200803 08:24:41 t=1596443081.700483 f=Open l=ERROR tid=00007f994c3ff700 s=LayoutWrapper:667 Sync-open got errNo=3012 errCode=400
200803 08:24:54 t=1596443094.305392 f=Open l=ERROR tid=00007f61013fe700 s=LayoutWrapper:667 Sync-open got errNo=3012 errCode=400

grep -c “Sync-open got errNo=3012 errCode=400” eosdlog.200803
31


grep “Sync-open got errNo=22 errCode=306” eosdlog.200803 | sort -R | head -n3
200803 13:41:53 t=1596462113.785068 f=Open l=ERROR tid=00007fb65e7fd700 s=LayoutWrapper:667 Sync-open got errNo=22 errCode=306
200803 13:16:21 t=1596460581.477106 f=Open l=ERROR tid=00007f451affd700 s=LayoutWrapper:667 Sync-open got errNo=22 errCode=306
200803 01:51:17 t=1596419477.201847 f=Open l=ERROR tid=00007f164a7fd700 s=LayoutWrapper:667 Sync-open got errNo=22 errCode=306

grep -c “Sync-open got errNo=22 errCode=306” eosdlog.200803
19


grep “Sync-open got errNo=3005 errCode=400” eosdlog.200803 | sort -R | head -n3
200803 10:11:08 t=1596449468.020137 f=Open l=ERROR tid=00007fe5783eb700 s=LayoutWrapper:667 Sync-open got errNo=3005 errCode=400
200803 10:12:55 t=1596449575.094346 f=Open l=ERROR tid=00007f9c057ff700 s=LayoutWrapper:667 Sync-open got errNo=3005 errCode=400
200803 10:12:59 t=1596449579.433609 f=Open l=ERROR tid=00007f9bfafff700 s=LayoutWrapper:667 Sync-open got errNo=3005 errCode=400

grep -c “Sync-open got errNo=3005 errCode=400” eosdlog.200803
300


grep -e “file root.* is already closed - won’t open” eosdlog.200803 | sort -R | head -n3
200803 08:50:17 t=1596444617.123185 f=MakeOpen l=ERROR tid=00007fa6323fd700 s=LayoutWrapper:131 file root://*user@eos///#curl#/eos/randomfolder/randomfile is already closed - won’t open
200803 01:21:32 t=1596417692.194734 f=MakeOpen l=ERROR tid=00007f0ac7bfd700 s=LayoutWrapper:131 file root://*user@eos///#curl#/eos/randomfolder/randomfile is already closed - won’t open
200803 01:31:47 t=1596418307.048399 f=MakeOpen l=ERROR tid=00007f587bbfd700 s=LayoutWrapper:131 file root://*user@eos///#curl#/eos/randomfolder/randomfile is already closed - won’t open

grep -c -e “file root.* is already closed - won’t open” eosdlog.200803
2088


grep “Error writing from wrapper : file root” eosdlog.200803 | sort -R | head -n3
200803 01:49:07 t=1596419347.084079 f=Write l=ERROR tid=00007f6121bfd700 s=LayoutWrapper:877 Error writing from wrapper : file root://*user@eos///#curl#/eos/acti-k7.cdn.aarnet.edu.au opaque eos.app=fuse&eos.encodepath=1&eos.bookingsize=0&eos.lfn=fxid:6730773e
200803 01:49:10 t=1596419350.557547 f=Write l=ERROR tid=00007f6121bfd700 s=LayoutWrapper:877 Error writing from wrapper : file root://*user@eos///#curl#/eos/acti-k7.cdn.aarnet.edu.au opaque eos.app=fuse&eos.encodepath=1&eos.bookingsize=0&eos.lfn=fxid:6730773e
200803 06:18:54 t=1596435534.054823 f=Write l=ERROR tid=00007fa60bfff700 s=LayoutWrapper:877 Error writing from wrapper : file root://*user@eos///#curl#/eos/acti-k7.cdn.aarnet.edu.au opaque eos.app=fuse&eos.encodepath=1&eos.bookingsize=0&eos.lfn=fxid:67339727

grep -c “Error writing from wrapper : file root” eosdlog.200803
315


grep “status is NOT ok : [ERROR] Error response: Invalid or incomplete multibyte or wide character” eosdlog.200803 | sort -R | head -n3
200803 08:22:05 t=1596442925.201014 f=xrdreq_retryonnullbuf l=ERROR tid=00007ff4337ff700 s=xrdutils:86 status is NOT ok : [ERROR] Error response: Invalid or incomplete multibyte or wide character
200803 08:22:04 t=1596442924.312292 f=xrdreq_retryonnullbuf l=ERROR tid=00007f1ebfbff700 s=xrdutils:86 status is NOT ok : [ERROR] Error response: Invalid or incomplete multibyte or wide character
200803 08:21:57 t=1596442917.841808 f=xrdreq_retryonnullbuf l=ERROR tid=00007f5f017ff700 s=xrdutils:86 status is NOT ok : [ERROR] Error response: Invalid or incomplete multibyte or wide character

grep -c “status is NOT ok : [ERROR] Error response: Invalid or incomplete multibyte or wide character” eosdlog.200803
80


grep “giving up retrying after attempt=2” eosdlog.200803 | sort -R | head -n3
200803 10:11:42 t=1596449502.190715 f=Open l=ERROR tid=00007fe5b07ff700 s=LayoutWrapper:677 giving up retrying after attempt=2
200803 10:13:06 t=1596449586.094669 f=Open l=ERROR tid=00007f9bc8bf1700 s=LayoutWrapper:677 giving up retrying after attempt=2
200803 10:14:28 t=1596449668.798087 f=Open l=ERROR tid=00007f329f3ff700 s=LayoutWrapper:677 giving up retrying after attempt=2

grep -c “giving up retrying after attempt=2” eosdlog.200803
150


errno: 3014: no route to host: network to target host does not work or target host name is bogus
errno: 3011: no such file or directory
errno: 3014: no FST available to read the file
readlock/writelock held is just slow IO
errno: 3006: invalid request (little weired), very few of them
errno: 3012: server error (not sure what this is), very few of them
errno: 3005: filesystem error (can be many things), very few
already closed (can be ignored)
writing from warpper is a top level message of one of the previous one
Invalid or incomplete multibyte or wide character: this means, that the server wanted a signature (xroot 5), which is weired
giving up retrying is just s econdary error from a previous one like no route to host or similiar

Most of this are harmless, but no route to host and no fst avilable points to network problems.

Cheers Andreas.