Files mdlog compaction failed

Dear Experts,

I have encountered the following errors while trying to compact files.$HOSTNAME.mdlog in /var/eos/md/.

[root@alice-t1-eos-mgm01 ~]# tail -f /var/log/eos/mgm/xrdlog.mgm | grep -i compact
200320 11:30:13 time=1584671413.208608 func=Compacting               level=ALERT logid=fec9a8de-6a4f-11ea-83a3-001a4a615c30 unit=mgm@alice-t1-eos-mgm01.sdfarm.kr:1094 tid=00007f347dbff700 source=Master:663                     tident=<service> sec=      uid=0 gid=0 name= geo="" msg="online-compacting running"        
200320 11:30:13 time=1584671413.208865 func=Compacting               level=NOTE  logid=fec9a8de-6a4f-11ea-83a3-001a4a615c30 unit=mgm@alice-t1-eos-mgm01.sdfarm.kr:1094 tid=00007f347dbff700 source=Master:665                     tident=<service> sec=      uid=0 gid=0 name= geo="" msg="starting online compaction"       
200320 11:31:06 time=1584671466.280476 func=Compacting               level=ALERT logid=fec9a8de-6a4f-11ea-83a3-001a4a615c30 unit=mgm@alice-t1-eos-mgm01.sdfarm.kr:1094 tid=00007f347dbff700 source=Master:838                     tident=<service> sec=      uid=0 gid=0 name= geo="" msg="compact done"                     
200320 13:17:06 time=1584677826.296367 func=Compacting               level=ALERT logid=fec9a8de-6a4f-11ea-83a3-001a4a615c30 unit=mgm@alice-t1-eos-mgm01.sdfarm.kr:1094 tid=00007f347dbff700 source=Master:663                     tident=<service> sec=      uid=0 gid=0 name= geo="" msg="online-compacting running"        
200320 13:17:06 time=1584677826.296560 func=Compacting               level=NOTE  logid=fec9a8de-6a4f-11ea-83a3-001a4a615c30 unit=mgm@alice-t1-eos-mgm01.sdfarm.kr:1094 tid=00007f347dbff700 source=Master:665                     tident=<service> sec=      uid=0 gid=0 name= geo="" msg="starting online compaction"       
200320 13:36:09 time=1584678969.998252 func=Compacting               level=CRIT  logid=fec9a8de-6a4f-11ea-83a3-001a4a615c30 unit=mgm@alice-t1-eos-mgm01.sdfarm.kr:1094 tid=00007f347dbff700 source=Master:832                     tident=<service> sec=      uid=0 gid=0 name= geo="" online-compacting returned ec=5 error: Changelog file has corruption - autorepair is disabled
200320 13:36:11 time=1584678971.008025 func=Compacting               level=CRIT  logid=fec9a8de-6a4f-11ea-83a3-001a4a615c30 unit=mgm@alice-t1-eos-mgm01.sdfarm.kr:1094 tid=00007f347dbff700 source=Master:872                     tident=<service> sec=      uid=0 gid=0 name= geo="" failed online compactification

Before trying compaction, I stopped all eos mgm services (mgm, mq and sync) then tried to repair mdlogs of files and directories. The repair was OK as you can see below.

[root@alice-t1-eos-mgm01 md]# eos-log-repair files.alice-t1-eos-mgm01.sdfarm.kr.mdlog.tmp files.alice-t1-eos-mgm01.sdfarm.kr.mdlog                                                                                                                                                                                           
Header status: OK (version: 0x1, content: 0x1)                                                                                                                                                                                                                                                                               
Elapsed time: 83 m. 58 s. Progress: 42.489 GB / 42.489 GB                                                                                                                                                                                                                                                                    
Scanned:                336537874                                                                                                                                                                                                                                                                                            
Healthy:                336537874                                                                                                                                                                                                                                                                                            
Bytes total:            45623333224                                                                                                                                                                                                                                                                                          
Bytes accepted:         45623333224                                                                                                                                                                                                                                                                                           
Bytes discarded:        0                                                                                                                                                                                                                                                                                                     
Not fixed:              0                                                                                                                                                                                                                                                                                                     
Fixed (wrong magic):    0                                                                                                                                                                                                                                                                                                     
Fixed (wrong checksum): 0                                                                                                                                                                                                                                                                                                     
Fixed (wrong size):     0                                                                                                                                                                                                                                                                                                     
Elapsed time:           83 m. 58 s. 

[root@alice-t1-eos-mgm01 md]# eos-log-repair directories.alice-t1-eos-mgm01.sdfarm.kr.mdlog.tmp directories.alice-t1-eos-mgm01.sdfarm.kr.mdlog                                                                                                                                                                               
Header status: OK (version: 0x1, content: 0x2)                                                                                                                                                                                                                                                                               
Elapsed time: 73 m. 47 s. Progress: 42.505 GB / 42.505 GB                                                                                                                                                                                                                                                                     
Scanned:                271753438                                                                                                                                                                                                                                                                                             
Healthy:                271753438                                                                                                                                                                                                                                                                                             
Bytes total:            45641029140                                                                                                                                                                                                                                                                                           
Bytes accepted:         45641029140                                                                                                                                                                                                                                                                                           
Bytes discarded:        0                                                                                                                                                                                                                                                                                                     
Not fixed:              0                                                                                                                                                                                                                                                                                                     
Fixed (wrong magic):    0                                                                                                                                                                                                                                                                                                     
Fixed (wrong checksum): 0                                                                                                                                                                                                                                                                                                     
Fixed (wrong size):     0                                                                                                                                                                                                                                                                                                    
Elapsed time:           73 m. 47 s. 

As shown above, the compaction on directories mdlog is done but on files is failed even though any suspicious things found during the repairing.

By the way, MGM booting is OK and the instance is working fine. Do you have any idea on this?

I plan to convert this EOS instance using (still!) in-memory namespace to the one with QuarkDB. I just would like to make sure that everything is OK before the conversion.

Thank you in advance.

Best regards,
Sang-Un

1 Like

Hi Sang-Un,

I guess this is no longer relevant since you switched to QuarkDB, right?

Cheers,
Elvin

Hi Elvein,

Unfortunately not. We are testing another EOS instance with QuarkDB (for archiving purpose) and this is a production instance for ALICE Disk. Indeed I want to switch to QuarkDB from in-memory namespace and make sure everything is fine beforehand. Any suggestions?

Thank you.

Best regards,
Sang-Un

Hi Sang-Un,

Can you please retry these operations using the following sequence:

  • repair both file and directory changelogs using the eos-log-repair tool
  • do an offline compaction of both changelogs using the eos-log-compact tool

And the post the output. If this is successful then you can stop your service and redo the procedure to update the changelog - apparently online compaction is not possible.
For the QuarkDB migration you will need a properly compacted version of the changelogs for the conversion to work.

Cheers,
Elvin

Hi Elvin,

Thanks a lot for the suggestions. One question regarding the procedure: should I stop EOS services before starting repair steps?

Last time I did repairing offline, which means I stopped all eos related services and then ran repairing. Because stopping EOS needs a scheduled downtime to maintain site reliability.

Best regards,
Sang-Un

Hi Sang-Un,

Yes, the service should be stopped otherwise there would be new entries added or some removed from the changelog so the newly repaired and compacted changelog would not be a refection of the reality that you could use later on when restarting the service.

Cheers,
Elvin

Hi Elvin,

Thanks a lot for the help. I will follow the procedure during the scheduled downtime then come back to you if there are any issues.

Best regards,
Sang-Un