Direct access to EOS data on CephFS in k8s

Hello,

I am interested in evaluating EOS to consider using it for our ATLAS T2.
The T2 compute jobs run natively on our k8s cluster on Openstack. We have several PB of CephFS available for storage. I have followed several recent presentations about EOS on CephFS. However our situation provides an additional interesting opportunity; since EOS can also run on k8s it would allow us to deploy the equivalent of a CE and SE, combined and integrated together, on the same kubernetes cluster. That means that the compute jobs could have direct (read-only) access to the same CephFS filesystem used for EOS storage, which raises an opportunity I would be very interested in: would it be possible for the jobs to read EOS data files directly from CephFS, instead of reading the data over the network from an FST service from CephFS?

Aside from the obvious inefficiency of reading from CephFS via a network transfer from a different node, instead of directly reading from CephFS on the local node, there is another aspect related to containerization. Network connections between k8s pods flow through the container network fabric (e.g. Calico) which typically incurs some packet overhead (e.g. IP in IP overlay). On the other hand, filesystems mounted into the pod are managed by the kubelet (and kernel) so the CephFS traffic uses the underlying node network, with no extra overhead or performance cost.

I understand the directory structure of the underlying data store on CephFS would require some metadata operations in order for a client to locate the data file on CephFS. So maybe there could be some sort of ā€˜eos cpā€™ operation which contacts the MGM to locate the data file in CephFS, but then instead of conducting an actual data transfer it instantly creates a symlink to the file on CephFS. This is how Storm SE (which was designed for this purpose) handled ā€œstage-inā€ of data files on a cluster filesystem. Would something like that be possible?

Thanks!

Hi Ryan, interesting topic.

So, there are several approaches to do that:

  1. trivial approach: you mount a read-only CephFS on all clients. You always write through EOS into CephFS for T2 data, while EOS can redirect reading to the local CephFS mount. This means, your clients always contact the EOS MGM on open, depending on the operation they get redirected to the EOS FST to write or to the local mount to read. The downside of this appoach is that there is no data privacy but probably you only want to handle GRID data in this way. The positive side is, that you get the full GRID stack integration and you have still all the knobs to tune access to the data on the EOS MGM. If you have the CephFS mount everywhere, it is a few lines change to have this configuration in the MGM. I will add this as an option. It is a good idea! One can also see if the client version supports this local redirect, which was introduced with XRootD V4.9 and if not, redirect also the WRITE to the FST.

  2. Instead of mounting CephFS one could mount CephFS inside the XRootD client on a redirect and the MGM can hand-out a read-only key for CephFS, which is handed out if the authenticated client has permission to do so. This however requires to deploy a plug-in and can be messy with all the XRootD versions around in experiment frameworks.

  3. Mirror the logical namespace from EOS into CephFS and have a registration function to add part of CephFS to EOS. By doing this one could map EOS ACLs to CephFS POSIX ACLs and vice-versa. This would allow also a complete standalone operation on CephFS with publishing into the GRID part (EOS).

Probably the simplest solution is the best, so 1) is doable with a small add-on, which I will do now.

Let me know, what you think.

Hi Andreas,

Yes #1 sounds like what I had in mind. Great! The whole cluster is for ATLAS only so it is fine for all the jobs to have RO access to all the EOS data. Glad to hear it can be done easily.

Interesting to learn that a xrootd client can natively mount CephFS, however it might be difficult or insecure to integrate that with our Manila CephFS provider on Openstack.

As for writing, I imagine it would be problematic for a client to write a file directly into the filesystem of a FST, which it should not have privileges for anyway. I think the main benefit to be gained is from reading the input directly, however I wonder if there is also a way to potentially improve writing too, compared to a network transfer to a FST to CephFS.

Suppose CephFS is not mounted RO, but instead there are simply two locations, one writeable by EOS FST servers and one writeable by ATLAS jobs.

drwxr-xr-x 3 eos eos 40 Jan 28 2021 FST-data
drwxr-xr-x 3 atlas atlas 40 Jan 28 2021 job-data

A job could write its output data in job-data/, then notify MGM that it wants to do a copy operation of the file into EOS. However if the FST knows that the data is already available on the same CephFS filesystem, then it only needs to do ā€˜cp -lā€™, and the stage-in can be instant via a hardlink. Just an idea, what do you think? We might need to finesse the job config a bit to set the output file location in CephFS, while keeping the rest of the job work dir on local disk of the compute node.

I have added feature 1) to the EOS master branch, it is available since EOS 4.8.79 . See Cern Authentication
Assuming you have created the space SHARED with a shared filesystem, one can enable this feature by doing:
eos space config SHARED space.policy.localredirect=1

You can find the description in the documentation ā€œUsing EOS/Policiesā€ Space and Application Policies ā€” EOS CITRINE documentation)

I am currently finishing a file adoption feature, where you can sort of grab an arbitrary file in a shared filesystem via a hardlink or symlink into EOS.

Cheers Andreas.

1 Like

Fantastic, thanks a lot Andreas! Iā€™ll give it a try as soon as I get it working on my cluster.

For the record I am using EOS 5.0.14 on k8s via Helm.

I realized we might encounter a difficulty with the writing use-case:
As I understand, each FST must have its own separate independent storage directory, and with the helm chart this is achieved by mounting only each FSTā€™s own subdirectory (e.g. /eos-data/fst-1 ) to the local location /fst_storage in the FST pod, so it is not possible for an FST to see outside of its subdirectory. Changing that might require some significant reworking of how the helm chart works.

However it would be relatively straightforward to also mount the ATLAS /job-data location as another (read-only) volume in the FST pods, but I am not sure if it would be possible to hardlink a file from one mount to another, even if both mounts are of the same underlying cephfs filesystem ?

Hi Ryan,
if it is cross-device I fallback to a symlink, but then you can be victim to deletion in the external filesystem, because you will keep a dangling symlink. In case of a hardlink you take ownership if the file is deleted in the location from where it got adopted.
The best solution for me is to have a helm chart which supports the mounting of the shared filesystem root. If you want to go for the read-only think, then we have to add a garbage collection feature for invalid symlinks.

Cheers Andreas.

Yes, hardlinking would be best; symlinking would cause a risk of data loss.
I made different way of mounting FST volumes to support direct writing via CephFS (#72) Ā· Issues Ā· eos / eos-charts Ā· GitLab to consider chart changes in the longer term.

The only other option I can think of would be some server-side copy functionality in CephFS (not sure if that exists) but coupled with some cleverness in the kernel to be able to detect that two different devices actually use the same underlying filesystem, which might be complicated or impossible.

There is one important information for you. The whole filesystem has only to be visible on the MGM, the FSTs need only their ā€˜localā€™ submount! The MGM is the one creating a hardlink, the FSTs see then their inodes inside their mounts.

Cheers Andreas.

Oh perfect, that will make it definitely easier for the Helm part!
I didnā€™t know the MGM could use a cephfs mount but that makes sense (though I wonder how the MGM would know where to find the FST dirs in cephfs, given that the FSTs themselves do not know where they are in the filesystem tree).

There is one slight problem. If a node has a different cephfs mounted on the same path as FSTs, the empty dirs are created within that foreign cephfs, eg

find /ceph/eos/ -type d

/ceph/eos/
/ceph/eos/fst01
/ceph/eos/fst01/02
/ceph/eos/fst01/02/0000003c
/ceph/eos/fst01/02/0000003f
/ceph/eos/fst01/02/0000003a
/ceph/eos/fst01/02/00000015
/ceph/eos/fst01/02/00000040
/ceph/eos/fst01/02/0000003d
/ceph/eos/fst01/02/00000041
/ceph/eos/fst01/02/0000003e
ā€¦

/ceph/eos and the structure below is created. Is it possible to protect against it? the client with failed direct reading from cephfs still seems to work properly though.

I seem to vaguely recall the FSTs use a hidden file in the storage directory containing an ID as a sort of marker to make sure it is the right place.

Or do you mean this happened on a client node, not a FST node? Probably an EOS client should not have permissions to create /ceph/eos? (But it should not try anyway)

The dirs are created by eosxd. It runs as root so it has permissions to write to cephfs mountpoint. Not trivial to prevent it. Guess itā€™s eosxd bug.

Some further issues, I noticed that ā€™ cp /eos/ā€¦ā€™ with eosxfd mount works fine with localredirect from a remote client without the sharedfs mount (still creating the dummy dirs), but xrdcp or ā€˜eoscpā€™ fail:
$ xrdcp -d 2 root://eoshome.sling.si//eos/test/andrej/0 /tmp/0
ā€¦
[2023-02-10 12:53:44.871429 +0100][Debug ][XRootD ] Redirect trace-back:
[2023-02-10 12:53:44.871429 +0100][Debug ][XRootD ] 0. Redirected from: root://eoshome.sling.si:1094//eos/test/andrej/0 to: root://eosmgm2.sling.si:1094//eos/test/andrej/0
[2023-02-10 12:53:44.871429 +0100][Debug ][XRootD ] 1. Redirected from: root://eosmgm2.sling.si:1094//eos/test/andrej/0 to: file://localhost/ceph/eos/eosfst03/02/00000037/00086cbf
[2023-02-10 12:53:44.871429 +0100][Debug ][XRootD ] 2. Retrying: root://eoshome.sling.si:1094//eos/test/andrej/0
[2023-02-10 12:53:44.871444 +0100][Debug ][ExDbgMsg ] [eoshome.sling.si:1094] Destroying MsgHandler: 0xa04527b0.
ā€¦
$ eoscp -d root://eoshome.sling.si//eos/test/andrej/0 /tmp/0
[eoscp]: allocate copy buffer with 8388608 bytes
[eoscp] src<0>=root://eoshome.sling.si//eos/test/andrej/0 dst<0>=/tmp/0
[eoscp] having PIO_ACCESS for source location=0 size=547
[eoscp] doing standard accessā€¦
[eoscp]: copy protocol xroot:=>file:

[eoscp]: doing XROOT/RAIDIO stat on /eos/test/andrej/0
[eoscp]: XROOT is transparent for staging - nothing to check
[eoscp]: doing XROOT open to read /eos/test/andrej/0
error: [ERROR] Server responded with an error: [3011] Unable to open file /ceph/eos/eosfst01/02/00000038/00089ca5; No such file or directory