Unexpected low write performance with EOS raid5/raid6 layout compared to plain layout

Hello everyone,

I am currently testing write performance in EOS, especially with the idea of using an EOS space as a CTA disk buffer.

In my setup, writing to a normal directory reaches around 2.0–2.6 GB/s.
When I write to a directory using raid5 with nstripes=6, the performance drops to around 800 MB/s.

The two directories are configured as follows:

# Normal directory
attr ls cta_test

sys.forced.blocksize="4M"
sys.forced.checksum="adler"
sys.forced.group="0"
sys.forced.iotype:w="direct"
sys.forced.space="default"
# EC / striped directory
attr ls cta_test_ec/

sys.forced.blocksize="4M"
sys.forced.checksum="adler"
sys.forced.group="0"
sys.forced.iotype:w="direct"
sys.forced.layout="raid5"
sys.forced.nstripes="6"
sys.forced.space="default"

Both directories are using the same EOS space and group:

sys.forced.space="default"
sys.forced.group="0"

The relevant EOS group layout is:

eos group ls

groupview default.0 on 60 filesystems
groupview default.2 on 8 filesystems

And the nodes are:

eos node ls

dseosfst01.gsi.de:1095 online 30 filesystems
dseosfst02.gsi.de:1095 online 30 filesystems
dseosfst05.gsi.de:1095 online 8 filesystems

So the raid5 test is currently using default.0, which contains 60 HDD-based filesystems across two FST nodes.

What I find confusing is that I expected the striped/EC layout to benefit from parallelism across multiple filesystems, but in practice it is much slower than the plain layout.

I also tested raid6, and the performance was very similar to raid5. Therefore, I suspect that the bottleneck is not only the additional parity calculation, but perhaps something related to the EC/RAIN write path, scheduling, client-side writing, network/FST behavior, or the way the stripes are distributed.

My main question is:

Given this setup, what would you consider the most likely cause of the performance drop with raid5/raid6, and would you recommend using such an EOS layout on a 60-HDD group as a CTA disk buffer at all?

More generally, is EOS striping/EC intended to improve write throughput in this kind of use case, or should it mainly be seen as a redundancy/capacity-efficiency feature, while a plain layout would be preferable for a high-throughput CTA disk buffer?

Thanks in advance for any advice.

Hi Amine,

It’s hard to give you a conclusive answer what might be affecting the performance in your particular setup, but there are a few things that one needs to be aware of.

First of all, in order to have a more realistic setup (similar to what you might run in production for RAIN layouts) you should have at least 6 different nodes. The way RAIN layouts work in EOS is that the entry point FST acts as a gateway that will funnel all the traffic from the client to the rest of the FSTs involved in the layout for that particular file.

So for a single RAIN file, all of its traffic funnels through one FST’s CPU and NIC. For raid6, the entry FST takes 1× traffic in from the client and pushes ≈1× traffic back out to the 5 peers (5 data + 1 parity = 6/5 stripe data, of which 5/6 leaves the box, normally). Plus it runs the parity computation. Therefore, writing a single striped file does not necessarily get faster from the 60-disk group. It gets funnelled through one node, and moreover it will also receive 1/2 of the data in this particular case (which two nodes).

raid6 is more CPU-expensive than raid5 and since in your case the performance is similar, this means that parity computation is not the bottleneck. Therefore, the bottleneck must either come from the I/O path or the internal memory bandwidth. We do have some improvements in terms of memory bandwidth optimisation for RAIN files, but this will only be available in EOS 5.5.0.

Concerning the write performance that you get, I assume these are values that you get for multiple streams, right? In general, EOS RAIN is primarily a capacity-efficiency and redundancy feature not a single-stream write-throughput accelerator. Per-file throughput is in general lower than plain because of the gateway funnel and parity amplification.

Another thing you might what to look into is the iotype for the RAIN layouts. Direct I/O bypasses the FST page cache; full-stripe streaming writes are fine, but anything that triggers partial-group / non-streaming handling becomes read-modify-write straight against HDD. I would also make some tests with buffered I/O.

For a CTA disk buffer specifically, I’d go with a replica (or even a single-replica) layout, not RAIN since CTA disk buffer is just for staging. You’re paying RAIN’s throughput and CPU/network price for redundancy you mostly don’t need. CTA’s access pattern is large files, write-once-sequential then read-once for migration. RAIN reads (in general) also go through the entry-server reconstruction path, so you pay the price on both ingest and tape-migration.

I hope this gives you some clues on how RAIN layouts work in EOS and helps you better target the investigation on your side.

Cheers,
Elvin