Hi Dan,
While we’re not CERN, there are some tunings that you may find useful. Mostly to do with higher latency between FSTs due to our geographic distribution, and artifacts of using SMR storage media with lots of small files. 96% of our data is below 10MB in size, and 92% is below 1MB.
If you haven’t seen them, Ted T’so and others have written many papers on the performance of Linux with certain writes on SMR media, Evolving Ext4 for Shingled Disks | USENIX is one co-authored with Abutalib Aghayev, Garth Gibson, and Peter Desnoyers. We came across these issues and confirmed them during a encryption-at-rest conversion of some older FSTs, particularly when we were unpacking disk images onto disk and the data consisted of many small files.
Standard TCP tunings for 65+ ms at > 10Gbps.
net.ipv4.tcp_syncookies=1
net.ipv4.ipfrag_high_thresh=4194303
net.ipv4.ipfrag_low_thresh=1048575
net.core.rmem_max=268431360
net.core.wmem_max=134215680
net.core.rmem_default=262144
net.core.wmem_default=131072
net.core.somaxconn=16384
net.ipv4.tcp_rmem=4096 262144 268431360
net.ipv4.tcp_wmem=4096 131072 134215680
net.ipv4.tcp_adv_win_scale=1
net.core.netdev_max_backlog=30000
net.ipv4.tcp_max_syn_backlog=30000
net.ipv4.tcp_no_metrics_save=0
We do use HTCP in some cases, but we prefer to run the elrepo mainline kernels as they have BBR and more efficient IO and TCP stacks. BBR across distance has a far higher goodput, and a far faster ramp up speed.
net.ipv4.tcp_congestion_control=bbr
We found some switches try to do ECN at l3 when they’re supposed to be l2, so we play it safe and tell the host to flat out ignore any ECN
net.ipv4.tcp_ecn=0
Ensure we have enough sockets when things get busy
net.ipv4.ip_local_port_range=10000 64512
Another performance thing. We use 802.1AX and LACP in some places, and we can handle out of order packets
net.ipv4.tcp_max_reordering=3000
net.ipv4.tcp_reordering=300
https://www.kernel.org/doc/Documentation/sysctl/vm.txt - ensure we write dirty pages out when we start having RAM pressure, rather than trying to page them out or just running out of memory
vm.zone_reclaim_mode=3
Keep these numbers low, as they’re a percentage of RAM. If the SMR media can’t write lots of files out quickly enough, we start seeing IO timeout reports in the kernel output, and we start seeing the FSTs flap according to monitoring an EOS. Each of our FSTs has 128GB of RAM, and we force the kernel to start flushing to disk at 1% of RAM, and force IO to wait at 3%
vm.dirty_background_ratio=1
vm.dirty_ratio=3
vm.dirty_expire_centisecs=250
Set vm.min_free_kbytes to stop fragmentation issues and ensuring we don’t page to disk excessively in some cases.
vm.min_free_kbytes=1048576
We also use XFS as our filesystem, and we mount with the options
noatime,nodiratime,swalloc,attr2,inode64,logbufs=8,logbsize=256k,noquota
We run our FST metadata on SSD, although the metadata isn’t a large IO problem by any stretch of the imagination.
As for ulimits, we permit 100k nprocs on our FSTs as that’s all they do, and up to 1 million file handles.