Some questions regarding fsck stat

Dear Experts,

We are running EOS in RAIN with 16 stripes (12+4) on 18 FSTs for archiving ALICE data. At the end of April, we upgraded EOS to 4.8.82 (from 4.8.31 or something) which is the version that fsck can be enabled. It was enabled initially but we turned it off a while after because it looked generating much of traffics than we expected (MGMs were gone unresponsive rather frequently).

More recently we enabled fsck again after we placed LACP configuration on both switches and servers with two 40G NICs. Basically the internal bandwidth can afford up to 10GB/s for both IN and OUT directions while the traffic generated by tpc transfers fsck induced reaches around 5GB/s for both directions.

Anyway, for the moment, the snapshot of eos fsck stat looks as below:

Info: collection thread status → enabled
Info: repair thread status → enabled
220804 03:32:34 1659583954.006593 Start error collection
220804 03:32:34 1659583954.006693 Filesystems to check: 1512
220804 03:32:39 1659583959.309439 d_mem_sz_diff : 1710916
220804 03:32:39 1659583959.309506 orphans_n : 31
220804 03:32:39 1659583959.309530 rep_diff_n : 4589768
220804 03:32:39 1659583959.309551 rep_missing_n : 95201
220804 03:32:39 1659583959.309558 unreg_n : 936
220804 03:32:39 1659583959.309570 Finished error collection
220804 03:32:39 1659583959.309577 Next run in 30 minutes

I was just wondering whether these will be eventually gone after letting fsck fix or there will be any actions needed to be taken by admins. d_mem_sz_diff looks ever increasing while rep_diff_n slightly keep decreasing.

FYI, the number of orphans was originally multiple millions but it was down to tens of by running the command eos fsck clean_orphans.

Please just let me know if you need more information about our instance.

Thank you.

Best regards,
Sang-Un

Hi Sang-Un,

One can control how many threads are doing the fsck repair and therefore limit the amount of activity that is generated by fsck in general by using the eos fsck config max-thread-pool-size <value> command. You can see the current defaults by doing eos ns | grep fsck.

The d_mem_sz_diff errors are not actually real errors especially given the fact that you are using RAIN files. Therefore, my first suggestion would be to upgrade to the 4.8.88 release since includes some important fixes to fsck in general. Furthermore, the 4.8.88 release also introduces the possibility to enable fsck repair only for certain categories. I would enable it starting with the unreg_n and rep_missing_n - one category at a time. Once these are fixed also the rep_diff_n reported errors should decrease.

Concerning the d_mem_sz_diff errors the “fix” is a bit more involved. Let’s first tackle the other categories and then we can also look into this. Indeed, for the orphans you can run regularly the eos fsck clean_orphans command to clean them up.

If you did enable fsck repair and the unreg_n and rep_missing_n categories still don’t go down then let me know and I can have a look at some of these files.

Cheers,
Elvin

Hi Elvin,

Thanks a lot for the suggestion and clarifications.

The current value for fsck thread pool size is 20 as shown below:

ALL fsck info thread_pool=fsck min=2 max=20 size=20 queue_size=1001
ALL tracker info tracker=fsck size=1021

After running the command eos fsck config max-thread-pool-size 10, it gives as follows:

ALL fsck info thread_pool=fsck min=2 max=10 size=20 queue_size=1000
ALL tracker info tracker=fsck size=1020

max has changed to 10 but size remains the same. Is it normal?

I will come back to you with the update of EOS to 4.8.88 as you suggested and then try to run fsck repair for certain categories.

By the way, what do you think regarding the upgrade to EOS v5 with the current RAIN layout? Would it be safe or should we install it from scratch?

Thank you.

Best regards,
Sang-Un

size also changed to 10 after a while and the traffic down to half. Thanks!

Hi Sang-Un,

Yes, the pool will scale down once the size of the jobs in the queue will decrease, this can take some time.

Yes, I would strongly recommend upgrading to EOS 5. All our physics instances at CERN are now EOS 5. We also have a couple of production instance with EOS 5 and RAIN, so we are in a better position to support this use-case. I think the upcoming 5.0.30 is a very good candidate for the upgrade.

Cheers,
Elvin

Thanks Elvin for the recommendation. I will look forward to it coming soon. It will definitely be very helpful to know when 5.0.30 will be available.

Best regards,
Sang-Un

Send me privately an email address where you would like to get notified when new releases are out and I can add it to our existing distribution list.

Cheers,
Elvin

Hi @esindril again,

This is just a kind of reminder that I would like to post about the current status of fsck after we successfully upgraded our instance from v4 to v5, precisely v5.1.22 (thanks for joining me the distribution list :slight_smile: )

The stat of fsck shows far much better than what we had as posted in the previous threads.

Info: collection thread status → enabled
Info: repair thread status → enabled
Info: repair category → all
230623 00:06:26 1687478786.648484 Start error collection
230623 00:06:26 1687478786.648545 Filesystems to check: 1512
230623 00:06:28 1687478788.196974 d_mem_sz_diff : 383
230623 00:06:28 1687478788.197041 orphans_n : 640
230623 00:06:28 1687478788.197052 rep_diff_n : 220
230623 00:06:28 1687478788.197059 rep_missing_n : 118103
230623 00:06:28 1687478788.197064 unreg_n : 15
230623 00:06:28 1687478788.197071 Finished error collection
230623 00:06:28 1687478788.197076 Next run in 30 minutes

The initial data just after the upgrade looked like,

Info: collection thread status → enabled
Info: repair thread status → enabled
Info: repair category → all
230612 04:26:59 1686544019.267555 Start error collection
230612 04:26:59 1686544019.267599 Filesystems to check: 1512
230612 04:27:00 1686544020.620404 blockxs_err : 13
230612 04:27:00 1686544020.620487 d_mem_sz_diff : 754815
230612 04:27:00 1686544020.620532 orphans_n : 642
230612 04:27:00 1686544020.620543 rep_diff_n : 11019
230612 04:27:00 1686544020.620562 rep_missing_n : 919207
230612 04:27:00 1686544020.620578 unreg_n : 453637
230612 04:27:00 1686544020.620584 Finished error collection
230612 04:27:00 1686544020.620589 Next run in 30 minutes

As you can see, the numbers in this stat report has been dramatically decreased for the last ten days. rep_missing_n looks still high but slowly and steadily it keeps decreasing. The performance of fsck of EOS v5 compared to v4 looks great and make the instance more reliable.

Thanks a lot for the great work of you and your team.

Best regards,
Sang-Un