We are running EOS in RAIN with 16 stripes (12+4) on 18 FSTs for archiving ALICE data. At the end of April, we upgraded EOS to 4.8.82 (from 4.8.31 or something) which is the version that fsck can be enabled. It was enabled initially but we turned it off a while after because it looked generating much of traffics than we expected (MGMs were gone unresponsive rather frequently).
More recently we enabled fsck again after we placed LACP configuration on both switches and servers with two 40G NICs. Basically the internal bandwidth can afford up to 10GB/s for both IN and OUT directions while the traffic generated by tpc transfers fsck induced reaches around 5GB/s for both directions.
Anyway, for the moment, the snapshot of eos fsck stat looks as below:
I was just wondering whether these will be eventually gone after letting fsck fix or there will be any actions needed to be taken by admins. d_mem_sz_diff looks ever increasing while rep_diff_n slightly keep decreasing.
FYI, the number of orphans was originally multiple millions but it was down to tens of by running the command eos fsck clean_orphans.
Please just let me know if you need more information about our instance.
One can control how many threads are doing the fsck repair and therefore limit the amount of activity that is generated by fsck in general by using the eos fsck config max-thread-pool-size <value> command. You can see the current defaults by doing eos ns | grep fsck.
The d_mem_sz_diff errors are not actually real errors especially given the fact that you are using RAIN files. Therefore, my first suggestion would be to upgrade to the 4.8.88 release since includes some important fixes to fsck in general. Furthermore, the 4.8.88 release also introduces the possibility to enable fsck repair only for certain categories. I would enable it starting with the unreg_n and rep_missing_n - one category at a time. Once these are fixed also the rep_diff_n reported errors should decrease.
Concerning the d_mem_sz_diff errors the “fix” is a bit more involved. Let’s first tackle the other categories and then we can also look into this. Indeed, for the orphans you can run regularly the eos fsck clean_orphans command to clean them up.
If you did enable fsck repair and the unreg_n and rep_missing_n categories still don’t go down then let me know and I can have a look at some of these files.
Yes, the pool will scale down once the size of the jobs in the queue will decrease, this can take some time.
Yes, I would strongly recommend upgrading to EOS 5. All our physics instances at CERN are now EOS 5. We also have a couple of production instance with EOS 5 and RAIN, so we are in a better position to support this use-case. I think the upcoming 5.0.30 is a very good candidate for the upgrade.
This is just a kind of reminder that I would like to post about the current status of fsck after we successfully upgraded our instance from v4 to v5, precisely v5.1.22 (thanks for joining me the distribution list )
The stat of fsck shows far much better than what we had as posted in the previous threads.
As you can see, the numbers in this stat report has been dramatically decreased for the last ten days. rep_missing_n looks still high but slowly and steadily it keeps decreasing. The performance of fsck of EOS v5 compared to v4 looks great and make the instance more reliable.
Thanks a lot for the great work of you and your team.