peby
(Pete Eby)
January 9, 2020, 2:01am
1
eos community,
eos@mgm service was failing to start, reporting an issue with the directories md file:
ident=<service> sec= uid=0 gid=0 name= geo="" initialization returned ec=14 Unrecognized file type: /var/eos/md/directories.ornl-eos-01.ornl.gov.mdlog
Running file on the directories*.mdlog reported it as type gzip rather than data.
Backed up existing md files and ran eos-log-repair (output below.) After repair file type reported as ‘data’ however, now eos fails to start with:
d ec=14 Log file exists: /var/eos/md/directories.ornl-eos-01.ornl.gov.mdlog and the requested content flag (0x2) does not match the one read from file (0x0)
Any suggestion are appreciated.
Cheers,
Pete
eos-log-repair output:
[root@ornl-eos-01 md]# eos-log-repair /var/eos/md/directories.$HOSTNAME.m
dlog.tmp /var/eos/md/directories.$HOSTNAME.mdlog && eos-log-repair /var/e
os/md/files.$HOSTNAME.mdlog.tmp /var/eos/md/files.$HOSTNAME.mdlog Header status: broken (Unrecognized file type: /var/eos/md/directories.ornl-eos-01.ornl.gov.mdlog.tmp)
error: discarded block from offset [ 8 <=> 208 ] [ len=512 ]
Elapsed time: 26 m. 27 s. Progress: 26.791 GB / 27.868 GB error: discarded block from offset [ 6b398dc68 <=> 6b398f800 ] [ len=7064 ]
Elapsed time: 27 m. 17 s. Progress: 27.868 GB / 27.868 GB
Scanned: 227243988
Healthy: 227243986
Bytes total: 29923771184
Bytes accepted: 29923763608
Bytes discarded: 7576
Not fixed: 2
Fixed (wrong magic): 0
Fixed (wrong checksum): 0
Fixed (wrong size): 0
Elapsed time: 27 m. 17 s.
Header status: OK (version: 0x1, content: 0x1)
Elapsed time: 28 m. 57 s. Progress: 27.819 GB / 27.819 GB
Scanned: 228458001
Healthy: 228458001
Bytes total: 29871443796
Bytes accepted: 29871443796
Bytes discarded: 0
Not fixed: 0
Fixed (wrong magic): 0
Fixed (wrong checksum): 0
Fixed (wrong size): 0
Elapsed time: 28 m. 57 s.
apeters
(Andreas Joachim Peters)
January 10, 2020, 10:01am
2
You have a file corrutpion inside the header.
Have a look with
od -x directories.mdlog | less
and the first 8 bytes have too look like:
0000000 4847 4543 0201 0100
and probaby you have 00?? instead of 02?? …
If you fix that, it will boot …
Cheers Andreas.
peby
(Pete Eby)
January 10, 2020, 3:41pm
3
Hi Andreas,
Thanks for the hint. Edited that octet (which was 00 00) with hexedit and is now:
# hexdump -n8 directories.ornl-eos-01.ornl.gov.mdlog
0000000 4847 4543 0201 0000
Restart now scans file okay, but then boot failes with:
PROGRESS [ scan directories.ornl-eos-01.ornl.gov.mdlog ] 98% estimate 1.8s [ 89s/91s ]
ALERT [ directories.ornl-eos-01.ornl.gov.mdlog ] finished in 90s
200110 16:12:05 time=1578669125.773108 func=BootNamespace level=CRIT logid=5a53da88-33bb-11ea-85c5-0060dd4265f8 unit=mgm@ornl-eos-01.ornl.gov:1094 tid=00007f6687c7c880 source=Master:1946 tident=<service> sec= uid=0 gid=0 name= geo="" eos view initialization failed after 90 seconds
200110 16:12:05 time=1578669125.781815 func=BootNamespace level=CRIT logid=5a53da88-33bb-11ea-85c5-0060dd4265f8 unit=mgm@ornl-eos-01.ornl.gov:1094 tid=00007f6687c7c880 source=Master:1949 tident=<service> sec= uid=0 gid=0 name= geo="" initialization returned ec=22 Not enough data to fulfil the request
File sizes:
[root@ornl-eos-01 md]# du -sh *.mdlog
28G directories.ornl-eos-01.ornl.gov.mdlog
28G files.ornl-eos-01.ornl.gov.mdlog
apeters
(Andreas Joachim Peters)
January 10, 2020, 3:53pm
4
Can you re-repair these files ?
peby
(Pete Eby)
January 10, 2020, 5:03pm
5
re-repaired, nothing reported this time:
[root@ornl-eos-01 md]# eos-log-repair directories.ornl-eos-01.ornl.gov.mdlog.headerFixed directories.ornl-eos-01.ornl.gov.mdlog
Header status: OK (version: 0x1, content: 0x2)
Elapsed time: 28 m. 19 s. Progress: 27.868 GB / 27.868 GB
Scanned: 227243986
Healthy: 227243986
Bytes total: 29923763608
Bytes accepted: 29923763608
Bytes discarded: 0
Not fixed: 0
Fixed (wrong magic): 0
Fixed (wrong checksum): 0
Fixed (wrong size): 0
Elapsed time: 28 m. 19 s.
eos@mgm still fails to start, with the same “ec=22 Not enough data to fulfil the request” error.
Any other necromancy to try?
Cheers,
Pete