Eos-log-repair content flag mismatch

peby · January 9, 2020, 2:01am

eos community,

eos@mgm service was failing to start, reporting an issue with the directories md file:

ident=<service> sec= uid=0 gid=0 name= geo="" initialization returned ec=14 Unrecognized file type: /var/eos/md/directories.ornl-eos-01.ornl.gov.mdlog

Running file on the directories*.mdlog reported it as type gzip rather than data.

Backed up existing md files and ran eos-log-repair (output below.) After repair file type reported as ‘data’ however, now eos fails to start with:

d ec=14 Log file exists: /var/eos/md/directories.ornl-eos-01.ornl.gov.mdlog and the requested content flag (0x2) does not match the one read from file (0x0)

Any suggestion are appreciated.

Cheers,
Pete

eos-log-repair output:

[root@ornl-eos-01 md]# eos-log-repair /var/eos/md/directories.$HOSTNAME.m
dlog.tmp /var/eos/md/directories.$HOSTNAME.mdlog && eos-log-repair /var/e
os/md/files.$HOSTNAME.mdlog.tmp /var/eos/md/files.$HOSTNAME.mdlog Header status: broken (Unrecognized file type: /var/eos/md/directories.ornl-eos-01.ornl.gov.mdlog.tmp)
error: discarded block from offset [ 8 <=> 208 ] [ len=512 ]
Elapsed time: 26 m. 27 s. Progress: 26.791 GB / 27.868 GB error: discarded block from offset [ 6b398dc68 <=> 6b398f800 ] [ len=7064 ]
Elapsed time: 27 m. 17 s. Progress: 27.868 GB / 27.868 GB
Scanned: 227243988
Healthy: 227243986
Bytes total: 29923771184
Bytes accepted: 29923763608
Bytes discarded: 7576
Not fixed: 2
Fixed (wrong magic): 0
Fixed (wrong checksum): 0
Fixed (wrong size): 0
Elapsed time: 27 m. 17 s.

Header status: OK (version: 0x1, content: 0x1)
Elapsed time: 28 m. 57 s. Progress: 27.819 GB / 27.819 GB
Scanned: 228458001
Healthy: 228458001
Bytes total: 29871443796
Bytes accepted: 29871443796
Bytes discarded: 0
Not fixed: 0
Fixed (wrong magic): 0
Fixed (wrong checksum): 0
Fixed (wrong size): 0
Elapsed time: 28 m. 57 s.

apeters · January 10, 2020, 10:01am

You have a file corrutpion inside the header.
Have a look with
od -x directories.mdlog | less

and the first 8 bytes have too look like:
0000000 4847 4543 0201 0100

and probaby you have 00?? instead of 02?? …

If you fix that, it will boot …

Cheers Andreas.

peby · January 10, 2020, 3:41pm

Hi Andreas,

Thanks for the hint. Edited that octet (which was 00 00) with hexedit and is now:

# hexdump -n8 directories.ornl-eos-01.ornl.gov.mdlog 
0000000 4847 4543 0201 0000

Restart now scans file okay, but then boot failes with:

PROGRESS [ scan directories.ornl-eos-01.ornl.gov.mdlog                           ] 98% estimate 1.8s [ 89s/91s ]
ALERT    [ directories.ornl-eos-01.ornl.gov.mdlog                           ] finished in 90s
200110 16:12:05 time=1578669125.773108 func=BootNamespace            level=CRIT  logid=5a53da88-33bb-11ea-85c5-0060dd4265f8 unit=mgm@ornl-eos-01.ornl.gov:1094 tid=00007f6687c7c880 source=Master:1946                    tident=<service> sec=      uid=0 gid=0 name= geo="" eos view initialization failed after 90 seconds
200110 16:12:05 time=1578669125.781815 func=BootNamespace            level=CRIT  logid=5a53da88-33bb-11ea-85c5-0060dd4265f8 unit=mgm@ornl-eos-01.ornl.gov:1094 tid=00007f6687c7c880 source=Master:1949                    tident=<service> sec=      uid=0 gid=0 name= geo="" initialization returned ec=22 Not enough data to fulfil the request

File sizes:

[root@ornl-eos-01 md]# du -sh *.mdlog
28G     directories.ornl-eos-01.ornl.gov.mdlog
28G     files.ornl-eos-01.ornl.gov.mdlog

apeters · January 10, 2020, 3:53pm

Can you re-repair these files ?

peby · January 10, 2020, 5:03pm

re-repaired, nothing reported this time:

[root@ornl-eos-01 md]# eos-log-repair directories.ornl-eos-01.ornl.gov.mdlog.headerFixed directories.ornl-eos-01.ornl.gov.mdlog
Header status: OK (version: 0x1, content: 0x2)
Elapsed time: 28 m. 19 s. Progress: 27.868 GB / 27.868 GB  
Scanned:                227243986
Healthy:                227243986
Bytes total:            29923763608
Bytes accepted:         29923763608
Bytes discarded:        0
Not fixed:              0
Fixed (wrong magic):    0
Fixed (wrong checksum): 0
Fixed (wrong size):     0
Elapsed time:           28 m. 19 s.

eos@mgm still fails to start, with the same “ec=22 Not enough data to fulfil the request” error.

Any other necromancy to try?

Cheers,
Pete

CERN Accelerating science

Eos-log-repair content flag mismatch