MGM Sync and QDB replication in tens or hundreds milliseconds of distance

sahn · September 10, 2019, 12:47am

Dear Experts and Developers,

I was just wondering whether Sync daemon between Master and Slave MGMs works in the distance with tens or hundreds milliseconds of latency, e.g. around 100ms under 200ms. It would be very helpful for me to understand a brief picture of how the Sync between MGMs works, e.g. how it determines the completion of sync in between two MGMs synchronously or asynchronously?

Also I was wondering whether the backup and replication process in between two or more QuarkDB clusters in different locations are already implemented or there are any workarounds in this regard.

Because I am thinking of a distributed setup of two or multiple EOS instances in cross-country wise in our region in order to help local administrators reduce operational costs and efforts in a way to simply setting up a single EOS instances on top of distributed FSTs across countries. I believe that Geotag would be a useful tool to identify FSTs. If MGM sync works well with such long latencies, it would be feasible to setup MGM master in one place and MGM slave in other place so that two different sites could be a mutual backup.

Also it would be mandatory for which this setup should work to have QuarkDB has to be replicated properly. According to RocksDB, it does not support synchronous replication between two different QuarkDB clusters. To make two or more QDB clusters to be synchronized(?), one has to assure that a periodic and frequent backup and replication process is working automatically. I believe that this is almost nonsense because if one expects frequent writings in a short period of time, one might lose a fraction of data in between the backup and replication processes. However if one expects a sparse writings (or a series of writings at a specific period of time), the backup and replication process to synchronize QuarkDB would work quite well.

I might be unclear or completely ignorance because simply I do not understand well how EOS works. It will definitely be very helpful for me to understand and make things work if you kindly make any comments or suggestions.

Thank you.

Best regards,
Sang-Un

crystal · September 10, 2019, 3:13am

Hi, this is a little out of date, but may be of interest for your reading: https://www.researchgate.net/publication/321236974_Global_EOS_exploring_the_300-ms-latency_region

Our largest production cluster has a master and slave 20ms away from each other, and FSTs a little further out than that (across the country, about 35-40ms). As far as the sync daemon goes, we haven’t seen any problems.

With regards to QuarkDB, I’m also keen on finding out. I know CERN had a site in Wigner, reasonably far away from Geneva, I’m not sure if a QuarkDB cluster was ever run across those sites. In Australia, we’re looking to deploy a QuarkDB cluster with nodes about 20ms away from each other; currently in the testing phase for that.

gbitzes · September 10, 2019, 3:03pm

Hi Sang-Un,

There is some confusion in terminology in your post which I’ll try to clarify - please let me know if I’ve misunderstood something, or you’d like some further clarification about a topic.

I believe this process works well, however the sync daemon is only used together with in-memory namespaces. Setting up brand new EOS instances with in-memory namespace is very discouraged - please start with QDB namespace instead. In such case, the sync daemon will not be used at all.

An “EOS instance” is a separate world of its own - there is normally no interaction between different EOS instances, they are completely independent, each with its own separate MGMs, FSTs, and filesystems. Different EOS instances will also use different QuarkDB instances.

Yes, setting up a distributed setup where FSTs are spread around the world within a single EOS instance is certainly possible. Geotags will indeed be very helpful for indicating the geographic relationship of different FSTs.

Yes, having multiple MGMs within a single EOS instance, geographically distributed to reduce latency for clients will be certainly possible. Please note this does not constitute a backup in any sense - no actual data will be backed-up (the served files remain in the same FSTs), nor metadata (handled by QuarkDB - the MGM is simply a metadata cache)

RocksDB is the key-value store for organizing the data on each node. RocksDB offers no replication of its own, that is achieved by QuarkDB through raft consensus.

That is not possible. Each QDB cluster (or, equivalently QDB instance) is a world of its own, each QDB cluster holds different data and the nodes in different QDB clusters will not interact with one another.

Yes, the replication process within a QuarkDB cluster is fully automatic. Backups are for disaster recovery, and it’s certainly recommended to take backups of the QuarkDB contents on a regular schedule. (check the documentation on how to do that)

Replication within a single QuarkDB cluster should always work correctly and reliably, no matter how frequent writes are.

Let us know if you are any more questions. You are also welcome to tell us what you are trying to achieve at a high level, and we’ll try to offer advice on how to do that.

Cheers,
Georgios

gbitzes · September 10, 2019, 3:08pm

I’m keen to find out too, keep us updated. We have not done extensive testing of a QDB cluster with nodes in both Geneva and Wigner. I believe it will work, as QDB replication should be quite efficient. (it’s asynchronous + pipelined + batched)

Only negative might be the increased write latency, since writes are acknowledged to clients once they have been safely replicated, and with a geographically distributed setup this will take more time.

Still, EOS heavily pipelines writes to QuarkDB, so increased write latency should not be a problem either. But the only way to know for sure is to try it.

sahn · September 10, 2019, 11:58pm

Hi Georgios,

Thanks a lot for the answers. Sorry for introducing a confusion on terminology. I think you already understood most correctly as what I meant.

Yes. I will deploy the most latest release of EOS with QuarkDB. It is good to know that sync only works with in-memory namespace.

Thanks for this clarification. I was wondering whether a single EOS instance on top of various FSTs across countries works or not, or even it is feasible to configure.

So, what I understood from your comments regarding QuarkDB is that a single EOS instance should have a single QuarkDB cluster for the namespace. Would it also be possible, like distributing MGM Master/Slave to different location, to place QuarkDB cluster nodes in a distributed way across the region, e.g. as mentioned with long latencies? Still the replication across within QuarkDB works?

We started last month a pilot project to combine two storages in Korea and Thailand with EOS system. With have lack of knowledges on EOS, we deployed two separate EOS instances here in Korea and in Thailand and then we tried to figure it out how to integrated them into one single storage. Based on your comments, we will try to deploy a single EOS with distributed FSTs + distributed MGM Master/Slave (perhaps master in Korea and slave in Thailand) + (only if possible) distributed QuarkDB cluster nodes (maybe two in Korea and one in Thailand).

The purpose of this work is to demonstrate a real world working example to measure networking performance in the region recently shown significant improvement in terms of LHCONE and to provide a possibility of collaborative work with the Asian partnership on storage consolidation in line with WLCG DOMA.

Your advice is already pretty much helpful for me to envisage the picture of EOS deployment and operations in a distributed way. It would be grateful if you have further comments or suggestions regarding our project.

Thank you.

Best regards,
Sang-Un

sahn · September 11, 2019, 12:09am

Hi @crystal

Thanks a lot for reminding this work. I had noticed before about this project but I forgot. This paper helps me assure what I am going to do should work just fine.

I am also quite curious what the status of work in Australia is. In the paper, it mentions about cloudstor project in the country but I cannot find much of information about in its webpage: https://cloudstor.aarnet.edu.au

It would be very grateful if you can provide more information on your project. Also you may want to visit our project here: The 5th Asia Tier Center Forum & 1st Asian HTCondor workshop 2019 (24-26 October 2019): Overview · Indico
The new event to be held in Mumbai in October this year and we will present the status of setup for distributed storage in the region and discuss regarding this intensively. You may also visit to the previous events what we have done in this Asian initiative: https://atcforum.org:19141

If I may and if you are interested, I would like to invite you and your colleagues to this workshop so that it would be very helpful for the community to listen to your experience and share the knowledges.

Thank you.

Best regards,
Sang-Un

crystal · September 11, 2019, 2:49am

Hi,

Here’s the general information page for CloudStor: https://www.aarnet.edu.au/network-and-services/cloud-services/cloudstor

CloudStor is run by AARNet, the Australian NREN, providing cloud storage and some other services such as OnlyOffice Jupyter Notebooks (SWAN) that leverage the underlying distributed EOS storage. We have four sites across Australia(https://status.aarnet.edu.au/cloudstor/), hence our interest in high-latency discussions!

Thank you for the link to that workshop, it looks incredibly interesting! I don’t think we’ll be able to attend on short notice, but will definitely keep it in mind for next year We’re always happy to chat about what we do, please feel free contact @davidjericho or myself if you have any questions!

sahn · September 17, 2019, 2:05am

@gbitzes, one another question regarding authentication methods, what is the recommendation on authentication methods regarding the distributed setup? Our initial setup is with Kerberos 5 but I was wondering whether it will work with GSI. Since we should cover different countries, Kerberos 5 seems not to be efficient. What do you think?

sahn · September 17, 2019, 2:18am

Hi @crystal,

Thanks a lot for the links. This could be a good example to be deployed within our country for file-sharing to help domestic research communities. Recent years, we have had lots of request on storage resources from the communities to store their large-scale (e.g. PB scale they say in structural biology) of experimental data safely and access them freely.

Sorry for the short notice. I would have contacted you if I noticed much earlier regarding the CloudStor project. If you allow me, I would like to mention this project as the reference in the discussion session at the forum and to invite you to the next forum.

gbitzes · September 17, 2019, 8:34am

Hi Sang-Un,

Yes, replication across QuarkDB should still work fine across high latencies. We haven’t tested this setup, so there might be hiccups, please let us know anything weird you find. My biggest concern would be higher acknowledgement latencies for writes towards QDB on the EOS side, but this should not be visible to end-users.

I believe the kerberos servers are contacted very infrequently, something like once per client / server combination per each day. Even if the kerberos server is far away, the delay should be imperceptible to clients given how rarely it occurs.

EOS supports GSI as well, and it should be possible to use both authentication methods within the same instance at the same time, so that each client uses whichever they prefer. Up to you to decide whether to enable it.

Cheers,
Georgios

sahn · September 20, 2019, 2:11am

Thanks We will test and hopefully report you.

We have tried to bring up MGM with QDB backend but it looks still MGM wants Sync daemon to be running. Since we are based on Docker container for EOS components (eos-docker), when a MGM with having two different MGM nodes (such as MGM1 and MGM2 for Master/Slave) in its configuration (/etc/xrd.cf.mgm) starts, it tries to bring up Sync via systemd - systemd is not installed in the base CentOS image - and then it fails.

I am quite confused that you told Sync is only used for in-memory namespace but it seems that it involves in HA configuration between MGM Master and Slave. Would you please have a look at the log below?

Thank you.

Best regards,
Sang-Un

++++++ (c) 2015 CERN/IT-DSS MgmOfs (meta data redirector) 4.5.8
=====> mgmofs enforces SSS authentication for XROOT clients
jemalloc is loaded!
jemalloc heap profiling is disabled
=====> mgmofs.hostname: eos-mgm-01.eoscluster.sdfarm.kr
=====> mgmofs.hostpref: eos-mgm-01
=====> mgmofs.managerid: eos-mgm-01.eoscluster.sdfarm.kr:1094
=====> mgmofs.fs: /
=====> mgmofs.targetport: 1095
=====> mgmofs.instance : eostestatcf
=====> mgmofs.metalog: /var/eos/md
=====> mgmofs.txdir:   /var/eos/tx
=====> mgmofs.authdir:   /var/eos/auth
=====> mgmofs.reportstorepath: /var/eos/report
=====> mgmofs.cfgtype: quarkdb
=====> mgmofs.fstgw: someproxy.cern.ch:3001
=====> mgmofs.nslib : /usr/lib64/libEosNsQuarkdb.so
=====> mgmofs.qdbcluster : eos-qdb-01.eoscluster.sdfarm.kr:7777 eos-qdb-02.eoscluster.sdfarm.kr:7777 eos-qdb-03.eoscluster.sdfarm.kr:7777                                                                                             
=====> mgmofs.redirector : false
=====> mgmofs.broker : root://eos-mgm.eoscluster.sdfarm.kr:1097//eos/eos-mgm-01.eoscluster.sdfarm.kr/mgm
=====> mgmofs.defaultreceiverqueue : /eos/*/fst
=====> mgmofs.fs: /
=====> mgmofs.errorlog : enabled
=====> all.role: manager
=====> setting message filter: Process,AddQuota,Update,UpdateHint,Deletion,PrintOut,SharedHash,work
=====> comment log in /var/log/eos/mgm/logbook.log
=====> eosxd stacktraces log in /var/log/eos/mgm/eosxd-stacktraces.log
=====> eosxd logtraces log in /var/log/eos/mgm/eosxd-logtraces.log
=====> mgmofs.alias: eos-mgm.eoscluster.sdfarm.kr
190920 01:20:13 time=1568942413.171078 func=Configure                level=NOTE  logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=mgm@eos-mgm-01.eoscluster.sdfarm.kr:1094 tid=00007f530a1f28c0 source=XrdMgmOfsConfigure:1324        tident=<single-exec> sec=      uid=0 gid=0 name= geo="" MGM_HOST=eos-mgm-01.eoscluster.sdfarm.kr MGM_PORT=1094 VERSION=4.5.8 RELEASE=20190910164416gitbb18f96 KEYTABADLER=136b2796 SYMKEY=Smb4JQeUdyJPQf+C+8w47in4j8g=                  
190920 01:20:13 time=1568942413.171219 func=set                      level=INFO  logid=static.............................. unit=mgm@eos-mgm-01.eoscluster.sdfarm.kr:1094 tid=00007f530a1f28c0 source=InstanceName:39                tident= sec=(null) uid=99 gid=99 name=- geo="" Setting global instance name => eostestatcf
190920 01:20:13 time=1568942413.193492 func=Init                     level=INFO  logid=cb9bce6e-db44-11e9-801a-0242864b7d17 unit=mgm@eos-mgm-01.eoscluster.sdfarm.kr:1094 tid=00007f530a1f28c0 source=Master:82                      tident=<service> sec=      uid=0 gid=0 name= geo="" systemd found on the machine = 0
190920 01:20:13 time=1568942413.324585 func=Init                     level=CRIT  logid=cb9bce6e-db44-11e9-801a-0242864b7d17 unit=mgm@eos-mgm-01.eoscluster.sdfarm.kr:1094 tid=00007f530a1f28c0 source=Master:178                     tident=<service> sec=      uid=0 gid=0 name= geo="" failed to start sync service
190920 01:20:13 251 XrootdConfig: Unable to create file system object via libXrdEosMgm.so
190920 01:20:13 251 XrootdConfig: Unable to load file system.
------ xrootd protocol initialization failed.
190920 01:20:13 251 XrdProtocol: Protocol xrootd could not be loaded
190920 01:20:13 251 XrdConfig: Unable to create home directory //mgm; permission denied
------ xrootd mgm@eos-mgm-01.eoscluster.sdfarm.kr:-1 initialization failed.

crystal · September 20, 2019, 3:51am

Hi @sahn - we have run into this issue before as well, as we also run our EOS in docker containers + no systemd.
Try setting the environment variable EOS_START_SYNC_SEPARATELY=1 to stop EOS from trying to start the sync process?

This does sound kind of like a bug though, as QDB shouldn’t require the sync client to start at all.

sahn · September 20, 2019, 4:28am

Hi @crystal, thanks a lot!!

we also just have found the issue you have already experienced and we set EOS_START_SYNC_SEPARATELY to stop EOS from starting sync.

sahn · September 20, 2019, 7:07am

Hi @gbitzes,

Thinking of expanding QuarkDB cluster, one might have to add additional nodes in /etc/xrd.cf.mgm, restart mgm and quarkdb, and then run quarkdb-create command including the new nodes. This looks quite tricky and has potential risk of service interruption. Is there any other (easier) way to expand QuarkDB nodes?

And would you guide me to a basic QuarkDB administration? I have referred to redis cli documentation but many of redis commands are not working in QuarkDB. Please correct me if I am wrong.

Finally, is it correct that any files do not need to be sync via eossync any longer when we rely on QuarkDB?

Best regards,
Sang-Un

gbitzes · September 20, 2019, 8:33am

Hi Sang-Un,

Expanding or shrinking a QDB cluster is done through “membership updates”, and is performed on-the-fly while the cluster is running: Membership updates - QuarkDB Documentation

After the process on the QDB side is complete, you may update the list of nodes in xrd.cf.mgm and restart the MGM.

Alternatively, you may setup a DNS alias which points to all QDB nodes, such as my-qdb-cluster.example.com that maps to qdb-host-1.com, qdb-host-2.com, qdb-host-3.com. Upon adding a fourth QDB node, you update the alias to point to qdb-host-4.com as well, and no changes to xrd.cf.mgm are necessary or MGM restarts.

Have a look at the documentation: Introduction - QuarkDB Documentation
Yes, not all redis commands are implemented yet, only the subset we need for EOS. Try running raft-info and quarkdb-info to check the status of a node.

Yes - if both the namespace and MGM configuration are on QDB, eossync is no longer necessary, and will be removed some time in the future once the in-memory namespace is fully deprecated. This should hopefully reduce confusion and make EOS easier to setup and operate, even for us.

Cheers,
Georgios

sahn · September 23, 2019, 4:23am

Hi Georgios,

Thank you for the useful documentation. I knew it exists but has never got noticed this part. How stupid I am

I was just wondering what will happen when a DNS alias (round-robin) points to one faulty node among others. Will MGM make a retry until getting a right response from one of nodes?

Good to know. Thank you.

Best regards,
Sang-Un

gbitzes · September 23, 2019, 7:41am

That’s correct. The MGM uses qclient to talk to QDB, which will continuously try every possibility until it is able to contact the cluster.

For example, if mgm.qdbcluster contains a list of hostnames (a.example.com, b.example.com, c.example.com), we will try, in the following order:

Do a DNS lookup on a.example.com. If it resolves, try every IP that this DNS entry points to, including all IPv4 and IPv6 endpoints, in the order given by the DNS server.
Do the same for b.example.com, then c.example.com.
If nothing of the above works, we sleep a bit and repeat ad infinitum, starting from the beginning. (fresh DNS lookups)

Cheers,
Georgios

CERN Accelerating science

MGM Sync and QDB replication in tens or hundreds milliseconds of distance