Hi CERN,
I am hoping you can help us.
On 2020-08-03 UTC we suffered an outage, which triggered an investigation of our full production stack. Looking at our eosd logs I noticed the following:
Our environment consists of 10 eosd clients per site with 4 sites around the country. I have also included a count (grep -c) and 3 examples of the log lines for the particular errors, with filenames replaced.
What do all these errors mean?
Which of these errors should we be concerned about?
What else should we be looking at?
Cheers
Michael
grep “No route to host” eosdlog.200803 | sort -R | head -n3
200803 03:01:01 t=1596423661.029421 f=fileOpen l=ERROR tid=00007faf207ff700 s=XrdIo:233 error= “open failed url=root://*user@eos:1094///#curl#/eos/randomfolder/randomfile?eos.app=fuse&eos.bookingsize=0&eos.encodepath=1&eos.lfn=fxid:6730f6bf&fst.valid=1596423719, errno=3014, errc=400, msg=[ERROR] Error response: No route to host”
200803 03:00:27 t=1596423627.003590 f=fileOpen l=ERROR tid=00007faf327ff700 s=XrdIo:233 error= “open failed url=root://*user@eos:1094///#curl#/eos/randomfolder/randomfile?eos.app=fuse&eos.bookingsize=0&eos.encodepath=1&eos.lfn=fxid:67315ab2&fst.valid=1596423686, errno=3014, errc=400, msg=[ERROR] Error response: No route to host”
200803 13:17:35 t=1596460655.044267 f=fileOpen l=ERROR tid=00007f81d7bff700 s=XrdIo:233 error= “open failed url=root://*user@eos:1094///#curl#/eos/randomfolder/randomfile?eos.app=fuse&eos.checksum=ignore&eos.encodepath=1&fst.valid=1596460714, errno=3014, errc=400, msg=[ERROR] Error response: No route to host”
grep -c “No route to host” eosdlog.200803
4149
grep “error code is 3014” eosdlog.200803 | sort -R | head -n3
200803 00:03:04 t=1596412984.200811 f=open l=ERROR tid=00007fa608fff700 s=filesystem:3380 open failed for root://*user@eos///#curl#/eos/randomfolder/randomfile : error code is 3014.
200803 04:56:07 t=1596430567.150310 f=open l=ERROR tid=00007faf317ff700 s=filesystem:3380 open failed for root://*user@eos///#curl#/eos/randomfolder/randomfile : error code is 3014.
200804 08:08:24 t=1596528504.675494 f=open l=ERROR tid=00007f81b8bff700 s=filesystem:3380 open failed for root://*user@eos///#curl#/eos/randomfolder/randomfile : error code is 3014.
grep -c “error code is 3014” eosdlog.200803
2761
grep “Sync-open got errNo=3011 errCode=400” eosdlog.200803 | sort -R | head -n3
200803 18:24:15 t=1596479055.425264 f=Open l=ERROR tid=00007ff431ffe700 s=LayoutWrapper:667 Sync-open got errNo=3011 errCode=400
200803 02:05:46 t=1596420346.415664 f=Open l=ERROR tid=00007fa5fabff700 s=LayoutWrapper:667 Sync-open got errNo=3011 errCode=400
200803 10:13:49 t=1596449629.340119 f=Open l=ERROR tid=00007f484ebef700 s=LayoutWrapper:667 Sync-open got errNo=3011 errCode=400
grep -c “Sync-open got errNo=3011 errCode=400” eosdlog.200803
79192
grep “Sync-open got errNo=3014 errCode=400” eosdlog.200803 | sort -R | head -n3
200803 04:38:24 t=1596429504.051449 f=Open l=ERROR tid=00007faf227ff700 s=LayoutWrapper:667 Sync-open got errNo=3014 errCode=400
200803 14:33:18 t=1596465198.353751 f=Open l=ERROR tid=00007f81dabff700 s=LayoutWrapper:667 Sync-open got errNo=3014 errCode=400
200803 01:52:43 t=1596419563.371132 f=Open l=ERROR tid=00007faf2d7ff700 s=LayoutWrapper:667 Sync-open got errNo=3014 errCode=400
grep -c “Sync-open got errNo=3014 errCode=400” eosdlog.200803
3684
grep “error while writing piece” eosdlog.200803 | sort -R | head -n3
200803 06:18:54 t=1596435534.551920 f=DoWrite l=ERROR tid=00007fa60bfff700 s=CacheEntry:342 error while writing piece off=72359936, len=8192
200803 07:15:46 t=1596438946.993203 f=DoWrite l=ERROR tid=00007fd274ffd700 s=CacheEntry:342 error while writing piece off=9961472, len=262144
200803 09:13:35 t=1596446015.685017 f=DoWrite l=ERROR tid=00007fa6323fd700 s=CacheEntry:342 error while writing piece off=7602176, len=262144
grep -c “error while writing piece” eosdlog.200803
2476
grep “WARNING - read lock held for” eosdlog.200803 | sort -R | head -n3
200803 08:35:24 t=1596443724.542205 f=Release l=WARN tid=00007f9c077fd700 s=RWMutex:1372 WARNING - read lock held for 20483 milliseconds by this thread:
200803 08:01:01 t=1596441661.118927 f=Release l=WARN tid=00007f30e8fff700 s=RWMutex:1372 WARNING - read lock held for 11852 milliseconds by this thread:
200803 08:12:42 t=1596442362.335476 f=Release l=WARN tid=00007f995d3fc700 s=RWMutex:1372 WARNING - read lock held for 14287 milliseconds by this thread:
grep -c “WARNING - read lock held for” eosdlog.200803
529
grep “WARNING - write lock held for” eosdlog.200803 | sort -R | head -n3
200803 09:19:46 t=1596446386.596939 f=UnLockWrite l=WARN tid=00007ff42c7ff700 s=RWMutex:424 WARNING - write lock held for 31220 milliseconds by this thread:
200803 10:49:38 t=1596451778.509734 f=UnLockWrite l=WARN tid=00007f1ed7fff700 s=RWMutex:424 WARNING - write lock held for 10191 milliseconds by this thread:
200803 09:20:25 t=1596446425.104409 f=UnLockWrite l=WARN tid=00007f32ea3ff700 s=RWMutex:424 WARNING - write lock held for 23938 milliseconds by this thread:
grep -c “WARNING - write lock held for” eosdlog.200803
15301
grep “failed to lazy open request” eosdlog.200803 | sort -R | head -n3
200803 03:39:32 t=1596425972.791198 f=LazyOpen l=ERROR tid=00007f47993ff700 s=LayoutWrapper:336 failed to lazy open request //#curl#/eos/randomfolder/randomfile?eos.app=fuse&mgm.pcmd=redirect&eos.app=fuse&eos.encodepath=1&eos.bookingsize=0&eos.client.openflags=rwcr&eos.client.openmode=100644 at url root://*user@eos:1094/ code=400 errno=3006
200803 01:12:25 t=1596417145.542493 f=LazyOpen l=ERROR tid=00007f0e92ffd700 s=LayoutWrapper:336 failed to lazy open request //#curl#/eos/randomfolder/randomfile?eos.app=fuse&mgm.pcmd=redirect&eos.app=fuse&eos.encodepath=1&eos.bookingsize=0&eos.client.openflags=rwcr&eos.client.openmode=100644 at url root://*user@eos:1094/ code=400 errno=3006
200803 01:12:44 t=1596417164.588678 f=LazyOpen l=ERROR tid=00007faf367ff700 s=LayoutWrapper:336 failed to lazy open request //#curl#/eos/randomfolder/randomfile?eos.app=fuse&mgm.pcmd=redirect&eos.app=fuse&eos.encodepath=1&eos.bookingsize=0&eos.client.openflags=rwcr&eos.client.openmode=100644 at url root://*user@eos:1094/ code=400 errno=3006
grep -c “failed to lazy open request” eosdlog.200803
356
grep “open failed for root://” eosdlog.200803 | sort -R | head -n3
200803 00:57:52 t=1596416272.983293 f=open l=ERROR tid=00007fe29f3ff700 s=filesystem:3380 open failed for root://*user@eos///#curl#/eos/randomfolder/randomfile : error code is 3011.
200803 00:44:38 t=1596415478.433658 f=open l=ERROR tid=00007f99553ff700 s=filesystem:3380 open failed for root://*user@eos///#curl#/eos/randomfolder/randomfile : error code is 3011.
200803 03:56:51 t=1596427011.298709 f=open l=ERROR tid=00007f38437ff700 s=filesystem:3380 open failed for root://*user@eos///#curl#/eos/randomfolder/randomfile : error code is 3011.
grep -c “open failed for root://” eosdlog.200803
82833
grep “Sync-open got errNo=3012 errCode=400” eosdlog.200803 | sort -R | head -n3
200803 07:43:15 t=1596440595.478528 f=Open l=ERROR tid=00007fec847ff700 s=LayoutWrapper:667 Sync-open got errNo=3012 errCode=400
200803 08:24:41 t=1596443081.700483 f=Open l=ERROR tid=00007f994c3ff700 s=LayoutWrapper:667 Sync-open got errNo=3012 errCode=400
200803 08:24:54 t=1596443094.305392 f=Open l=ERROR tid=00007f61013fe700 s=LayoutWrapper:667 Sync-open got errNo=3012 errCode=400
grep -c “Sync-open got errNo=3012 errCode=400” eosdlog.200803
31
grep “Sync-open got errNo=22 errCode=306” eosdlog.200803 | sort -R | head -n3
200803 13:41:53 t=1596462113.785068 f=Open l=ERROR tid=00007fb65e7fd700 s=LayoutWrapper:667 Sync-open got errNo=22 errCode=306
200803 13:16:21 t=1596460581.477106 f=Open l=ERROR tid=00007f451affd700 s=LayoutWrapper:667 Sync-open got errNo=22 errCode=306
200803 01:51:17 t=1596419477.201847 f=Open l=ERROR tid=00007f164a7fd700 s=LayoutWrapper:667 Sync-open got errNo=22 errCode=306
grep -c “Sync-open got errNo=22 errCode=306” eosdlog.200803
19
grep “Sync-open got errNo=3005 errCode=400” eosdlog.200803 | sort -R | head -n3
200803 10:11:08 t=1596449468.020137 f=Open l=ERROR tid=00007fe5783eb700 s=LayoutWrapper:667 Sync-open got errNo=3005 errCode=400
200803 10:12:55 t=1596449575.094346 f=Open l=ERROR tid=00007f9c057ff700 s=LayoutWrapper:667 Sync-open got errNo=3005 errCode=400
200803 10:12:59 t=1596449579.433609 f=Open l=ERROR tid=00007f9bfafff700 s=LayoutWrapper:667 Sync-open got errNo=3005 errCode=400
grep -c “Sync-open got errNo=3005 errCode=400” eosdlog.200803
300
grep -e “file root.* is already closed - won’t open” eosdlog.200803 | sort -R | head -n3
200803 08:50:17 t=1596444617.123185 f=MakeOpen l=ERROR tid=00007fa6323fd700 s=LayoutWrapper:131 file root://*user@eos///#curl#/eos/randomfolder/randomfile is already closed - won’t open
200803 01:21:32 t=1596417692.194734 f=MakeOpen l=ERROR tid=00007f0ac7bfd700 s=LayoutWrapper:131 file root://*user@eos///#curl#/eos/randomfolder/randomfile is already closed - won’t open
200803 01:31:47 t=1596418307.048399 f=MakeOpen l=ERROR tid=00007f587bbfd700 s=LayoutWrapper:131 file root://*user@eos///#curl#/eos/randomfolder/randomfile is already closed - won’t open
grep -c -e “file root.* is already closed - won’t open” eosdlog.200803
2088
grep “Error writing from wrapper : file root” eosdlog.200803 | sort -R | head -n3
200803 01:49:07 t=1596419347.084079 f=Write l=ERROR tid=00007f6121bfd700 s=LayoutWrapper:877 Error writing from wrapper : file root://*user@eos///#curl#/eos/acti-k7.cdn.aarnet.edu.au opaque eos.app=fuse&eos.encodepath=1&eos.bookingsize=0&eos.lfn=fxid:6730773e
200803 01:49:10 t=1596419350.557547 f=Write l=ERROR tid=00007f6121bfd700 s=LayoutWrapper:877 Error writing from wrapper : file root://*user@eos///#curl#/eos/acti-k7.cdn.aarnet.edu.au opaque eos.app=fuse&eos.encodepath=1&eos.bookingsize=0&eos.lfn=fxid:6730773e
200803 06:18:54 t=1596435534.054823 f=Write l=ERROR tid=00007fa60bfff700 s=LayoutWrapper:877 Error writing from wrapper : file root://*user@eos///#curl#/eos/acti-k7.cdn.aarnet.edu.au opaque eos.app=fuse&eos.encodepath=1&eos.bookingsize=0&eos.lfn=fxid:67339727
grep -c “Error writing from wrapper : file root” eosdlog.200803
315
grep “status is NOT ok : [ERROR] Error response: Invalid or incomplete multibyte or wide character” eosdlog.200803 | sort -R | head -n3
200803 08:22:05 t=1596442925.201014 f=xrdreq_retryonnullbuf l=ERROR tid=00007ff4337ff700 s=xrdutils:86 status is NOT ok : [ERROR] Error response: Invalid or incomplete multibyte or wide character
200803 08:22:04 t=1596442924.312292 f=xrdreq_retryonnullbuf l=ERROR tid=00007f1ebfbff700 s=xrdutils:86 status is NOT ok : [ERROR] Error response: Invalid or incomplete multibyte or wide character
200803 08:21:57 t=1596442917.841808 f=xrdreq_retryonnullbuf l=ERROR tid=00007f5f017ff700 s=xrdutils:86 status is NOT ok : [ERROR] Error response: Invalid or incomplete multibyte or wide character
grep -c “status is NOT ok : [ERROR] Error response: Invalid or incomplete multibyte or wide character” eosdlog.200803
80
grep “giving up retrying after attempt=2” eosdlog.200803 | sort -R | head -n3
200803 10:11:42 t=1596449502.190715 f=Open l=ERROR tid=00007fe5b07ff700 s=LayoutWrapper:677 giving up retrying after attempt=2
200803 10:13:06 t=1596449586.094669 f=Open l=ERROR tid=00007f9bc8bf1700 s=LayoutWrapper:677 giving up retrying after attempt=2
200803 10:14:28 t=1596449668.798087 f=Open l=ERROR tid=00007f329f3ff700 s=LayoutWrapper:677 giving up retrying after attempt=2
grep -c “giving up retrying after attempt=2” eosdlog.200803
150