EOS Server Tuning

I would be interested in hearing from the CERN EOS team members on what, if any, specific server tuning they do for MGM and FST nodes.

I’m specifically thinking about:

  • ulimit settings like file descriptors
  • TCP parameters like txqueuelen
  • Any TCP/IP or other settings that are tuned via systctl
  • Any other server tuning that would be relevant to the type of node (FST/MGM)

If someone there wanted to run down what kind of server tuning the CERN team does when installing EOS nodes, I and others, I’m sure, would appreciate it.

I understand the tuning CERN uses may not be appropriate for all installations, but since they are most familiar with how these settings would affect EOS, it could be a good starting point. If anyone else wanted to add anything they’ve learned, that would also be helpful.


Dan Szkola
FNAL

Hi Dan,

While we’re not CERN, there are some tunings that you may find useful. Mostly to do with higher latency between FSTs due to our geographic distribution, and artifacts of using SMR storage media with lots of small files. 96% of our data is below 10MB in size, and 92% is below 1MB.

If you haven’t seen them, Ted T’so and others have written many papers on the performance of Linux with certain writes on SMR media, Evolving Ext4 for Shingled Disks | USENIX is one co-authored with Abutalib Aghayev, Garth Gibson, and Peter Desnoyers. We came across these issues and confirmed them during a encryption-at-rest conversion of some older FSTs, particularly when we were unpacking disk images onto disk and the data consisted of many small files.

Standard TCP tunings for 65+ ms at > 10Gbps.

net.ipv4.tcp_syncookies=1
net.ipv4.ipfrag_high_thresh=4194303
net.ipv4.ipfrag_low_thresh=1048575
net.core.rmem_max=268431360
net.core.wmem_max=134215680
net.core.rmem_default=262144
net.core.wmem_default=131072
net.core.somaxconn=16384
net.ipv4.tcp_rmem=4096 262144 268431360
net.ipv4.tcp_wmem=4096 131072 134215680
net.ipv4.tcp_adv_win_scale=1
net.core.netdev_max_backlog=30000
net.ipv4.tcp_max_syn_backlog=30000
net.ipv4.tcp_no_metrics_save=0

We do use HTCP in some cases, but we prefer to run the elrepo mainline kernels as they have BBR and more efficient IO and TCP stacks. BBR across distance has a far higher goodput, and a far faster ramp up speed.

net.ipv4.tcp_congestion_control=bbr

We found some switches try to do ECN at l3 when they’re supposed to be l2, so we play it safe and tell the host to flat out ignore any ECN

net.ipv4.tcp_ecn=0

Ensure we have enough sockets when things get busy

net.ipv4.ip_local_port_range=10000 64512

Another performance thing. We use 802.1AX and LACP in some places, and we can handle out of order packets

net.ipv4.tcp_max_reordering=3000
net.ipv4.tcp_reordering=300

https://www.kernel.org/doc/Documentation/sysctl/vm.txt - ensure we write dirty pages out when we start having RAM pressure, rather than trying to page them out or just running out of memory

vm.zone_reclaim_mode=3

Keep these numbers low, as they’re a percentage of RAM. If the SMR media can’t write lots of files out quickly enough, we start seeing IO timeout reports in the kernel output, and we start seeing the FSTs flap according to monitoring an EOS. Each of our FSTs has 128GB of RAM, and we force the kernel to start flushing to disk at 1% of RAM, and force IO to wait at 3%

vm.dirty_background_ratio=1
vm.dirty_ratio=3
vm.dirty_expire_centisecs=250

Set vm.min_free_kbytes to stop fragmentation issues and ensuring we don’t page to disk excessively in some cases.

vm.min_free_kbytes=1048576

We also use XFS as our filesystem, and we mount with the options

noatime,nodiratime,swalloc,attr2,inode64,logbufs=8,logbsize=256k,noquota

We run our FST metadata on SSD, although the metadata isn’t a large IO problem by any stretch of the imagination.

As for ulimits, we permit 100k nprocs on our FSTs as that’s all they do, and up to 1 million file handles.