While we’re not CERN, there are some tunings that you may find useful. Mostly to do with higher latency between FSTs due to our geographic distribution, and artifacts of using SMR storage media with lots of small files. 96% of our data is below 10MB in size, and 92% is below 1MB.
If you haven’t seen them, Ted T’so and others have written many papers on the performance of Linux with certain writes on SMR media, https://www.usenix.org/conference/fast17/technical-sessions/presentation/aghayev is one co-authored with Abutalib Aghayev, Garth Gibson, and Peter Desnoyers. We came across these issues and confirmed them during a encryption-at-rest conversion of some older FSTs, particularly when we were unpacking disk images onto disk and the data consisted of many small files.
Standard TCP tunings for 65+ ms at > 10Gbps.
net.ipv4.tcp_rmem=4096 262144 268431360
net.ipv4.tcp_wmem=4096 131072 134215680
We do use HTCP in some cases, but we prefer to run the elrepo mainline kernels as they have BBR and more efficient IO and TCP stacks. BBR across distance has a far higher goodput, and a far faster ramp up speed.
We found some switches try to do ECN at l3 when they’re supposed to be l2, so we play it safe and tell the host to flat out ignore any ECN
Ensure we have enough sockets when things get busy
Another performance thing. We use 802.1AX and LACP in some places, and we can handle out of order packets
https://www.kernel.org/doc/Documentation/sysctl/vm.txt - ensure we write dirty pages out when we start having RAM pressure, rather than trying to page them out or just running out of memory
Keep these numbers low, as they’re a percentage of RAM. If the SMR media can’t write lots of files out quickly enough, we start seeing IO timeout reports in the kernel output, and we start seeing the FSTs flap according to monitoring an EOS. Each of our FSTs has 128GB of RAM, and we force the kernel to start flushing to disk at 1% of RAM, and force IO to wait at 3%
Set vm.min_free_kbytes to stop fragmentation issues and ensuring we don’t page to disk excessively in some cases.
We also use XFS as our filesystem, and we mount with the options
We run our FST metadata on SSD, although the metadata isn’t a large IO problem by any stretch of the imagination.
As for ulimits, we permit 100k nprocs on our FSTs as that’s all they do, and up to 1 million file handles.