Can I write FSTs when using a mix of EOS 5.1 and 5.2?

GeonmoRyu · December 24, 2024, 8:24am

Hello, everyone.

Currently, we have two versions as we are adding new FSTs to our EOS cluster.

Major daemons like MGM are on version 5.1 and the new FST is on version 5.2.

We have noticed that when we drain a specific FST, the results of that drain are moving to the new FST. However, we are not seeing any data moving to the new FST at all, even though all of our groups are balanced. In other words, we are draining but not balancing.

I’m wondering if I need to upgrade all of my FSTs to resolve this or if there is more configuration that needs to be done.

The error varies from time to time, but the following errors seem to be the most relevant ones to report.

Note that jbod-mgmt-11 and 12 are the new FSTs.

241224 08:15:47 time=1735028147.316993 func=fileOpen level=ERROR logid=46153010-c1cf-11ef-aba7-b8599f9c5190 unit=fst@jbod-mgmt-11.sdfarm.kr:1095 tid=00007f06482ff640 source=XrdIo:334 tident= sec= uid=0 gid=0 name= geo=“” error= “open failed url=root://1@jbod-mgmt-12.sdfarm.kr:1095//#curl#/eos/gsdc/grid/07/33595/1721f29d-11b0-11ee-8086-02427d886279?fst.readahead=true&fst.valid=1735028204&tpc.key=0f4fd94b0007006e676a6db1&tpc.org=5.110@jbod-mgmt-07.sdfarm.kr&tpc.stage=copy, errno=3007, errc=400, msg=[ERROR] Error response: input/output error”
241224 08:15:47 time=1735028147.317006 func=DoTpcTransfer level=ERROR logid=46153010-c1cf-11ef-aba7-b8599f9c5190 unit=fst@jbod-mgmt-11.sdfarm.kr:1095 tid=00007f06482ff640 source=XrdFstOfsFile:3801 tident=5.110:56@jbod-mgmt-07 sec= uid=1 gid=1 name=nobody geo=“” msg=“TPC open failed for url=root://6@jbod-mgmt-12.sdfarm.kr:1095//#curl#/eos/gsdc/grid/07/33595/1721f29d-11b0-11ee-8086-02427d886279?fst.readahead=true cgi=tpc.key=0f4fd94b0007006e676a6db1&tpc.org=5.110@jbod-mgmt-07.sdfarm.kr&tpc.stage=copy”

GeonmoRyu · January 9, 2025, 9:28am

Hello, everyone.

After adding the ATTR setting (below) to the existing FSTs on that system, restarting them all to make them ATTR, and then upgrading the MGM servers to version 5.2, the data is now load-balancing to the new disks just fine.

fstofs.filemd_handler attr

However, I am experiencing one issue and am reaching out to you for help.

Currently, I am constantly getting messages that the MGM master has been changed sequentially on each FST.

250109 09:19:29 time=1736414369.050678 func=ProcessCapOpaque level=WARN logid=d3dd0a06-ce6a-11ef-a347-b8599fa512f0 unit=fst@jbod-mgmt-08.sdfarm.kr:1095 tid=00007fa1345bb640 source=XrdFstOfsFile:2840 tident=3.14:72@jbod-mgmt-11 sec=(null) uid=99 gid=99 name=(null) geo=“” msg=“MGM master seems to have changed - adjusting global config” old-manager=“jbod-mgmt-04.sdfarm.kr:1094” new-manager=“jbod-mgmt-01.sdfarm.kr:1094”
250109 09:19:29 30661 XrootdXeq: 6.22:273@jbod-mgmt-06 disc 0:09:49
250109 09:19:29 30767 XrootdXeq: 7.20:301@jbod-mgmt-01 pub IP46 login as daemon
250109 09:19:29 time=1736414369.726071 func=ProcessCapOpaque level=WARN logid=d4496282-ce6a-11ef-a751-b8599f9c4330 unit=fst@jbod-mgmt-08.sdfarm.kr:1095 tid=00007fa2295fb640 source=XrdFstOfsFile:2840 tident=8.14:178@jbod-mgmt-11 sec=(null) uid=99 gid=99 name=(null) geo=“” msg=“MGM master seems to have changed - adjusting global config” old-manager=“jbod-mgmt-01.sdfarm.kr:1094” new-manager=“jbod-mgmt-04.sdfarm.kr:1094”
250109 09:19:29 time=1736414369.905068 func=ProcessCapOpaque level=WARN logid=d4529ffa-ce6a-11ef-b2d0-b8599fa512f0 unit=fst@jbod-mgmt-08.sdfarm.kr:1095 tid=00007fa12993c640 source=XrdFstOfsFile:2840 tident=6.14:69@jbod-mgmt-11 sec=(null) uid=99 gid=99 name=(null) geo=“” msg=“MGM master seems to have changed - adjusting global config” old-manager=“jbod-mgmt-04.sdfarm.kr:1094” new-manager=“jbod-mgmt-01.sdfarm.kr:1094”

On the MGM server, when I use the ns command, I see that the master is not changing.

I was wondering if you could tell me if you have encountered this issue and what I should do to find the cause?

Regards,
– Geonmo

esindril · January 14, 2025, 8:03am

Hi Geonmo,

Normally, these kind of messages should go away and they appear only after an MGM master-slave transition appears. Which is the current master MGM in you instance (which hostname) ? Do you have any messages like this “FST node configuration change” ?

Thanks,
Elvin

GeonmoRyu · January 15, 2025, 5:30am

Hello, Elvin.

Unfortunately, I got an error that took precedence, so I’ll have to fix the MGM changed after I fix this. Currently, I’m seeing the following message in the MGM DrainJob.log file. The drain of the filesystem was stalled when the below display occurred.

[root@jbod-mgmt-01 mgm]# tail -n 1 DrainJob.log 
250115 05:22:02 ERROR DrainTransferJob:105           msg="url invalid" src="root://daemon@/#curl#/eos/gsdc/grid/12/58304/86d1cab9-328c-11ee-8085-0242bbcb4377?cap.msg=S7u8/4oFff3KboTr1CnWtro8fZ/0DyqGXXfLlEUri8Nt7OAKm/mFL5dh+85CQIoAhwPYEwz6/yFAvzH3t9uPF45M0i3ahR/8r++rDrq+I3E+Uto/wes+fliHZTjpL2+wIRo/i/EsgC6n7U9A6sxD38HVmjZk1JVfy9b6wJuS7nK8kxnNR0JkEaVmkSPTrPdD8uH2fq+a43whOUIOCh+ykqd0jgM+r6NR6CYhl5XdAn23LAnaAUEJxtPDMamlEW9yT97DOlvtDlWFSqux2piak1qaornM6OmShZmzFztE9gc=&cap.sym=T3YbdYkv+rl8YFJhtEugBjOFPt0=&eos.encodepath=curl&eos.pio.action=reconstruct&eos.pio.recfs=857&mgm.logid=a71ab622-d300-11ef-99a8-b8599fa512f0" dst="root://daemon@jbod-mgmt-11.sdfarm.kr:1096//replicate:0?cap.msg=kKJytp6YGplT6O/TG1CdNsUDzr4BWSXtOOpKO1jPQUAaHEQJVq4j6VWg7bFINxG6u3VzRO8HArAenv2GIQTixBPIglx9LUvAaeAV+U6pKr6ACRN3MDACq2swhbojhoxDgF6bict5oUc/Y6rKRd2IF3Fk6Ur0sn42jhNqRmC+vtBRxxU7IoBa6x99wLkQ7WXGL2BhGnhoLv3UuH2esMA0z1ms/frzNU5SWeObPrCVFZ3oprqhJ3J7WcrYJzZfjlqMfR1TABEZYsQHUlcfNRgYL4h2FiB8hD/JYdRoKYva0yg=&cap.sym=T3YbdYkv+rl8YFJhtEugBjOFPt0=&mgm.logid=a71ab622-d300-11ef-99a8-b8599fa512f0"

The example above is jbod-mgmt-11, but I also found logs about the existing server before the storage extension.

We need your help fast because being unable to drain is fatal to our system.

In addition, the “eos status” information is below.

instance: gsdc
          health:     WARN        fs-down:1
          nodes:      fst      21 online on
          versions:   mgm       1 5.2.31-1
                      qdb       3 5.1.2.5.1.22
                      fst      18 5.1.28-1
                                3 5.2.29-1

services:          
                      jbod-mgmt-01.sdfarm.kr:1094 (active)
                      namespace [booted] [1 s]
                      qdb [GREEN]

storage:  data:       default    (22.77 PB total / 15.26 PB used 7.50 PB free / 7.50 PB avail )
          meta-data:  12844306 files 49966 directories
          groups:              84 default on
          filesystems:       1760   stat
          scheduler:      80% (fill limit)

clients:  19 clients
          auth:                13 sss   (XRoot)
                                2 sss   (balance)
          io:         0.00 GB/s IN 0.00 GB/s OUT

          fuse :      0 clients (0 active) caps 0 locked 0
                      
                      v:  
                      t:  
                      h:

– Geonmo

esindril · January 15, 2025, 7:27am

Hi Geonmo,

For this particular error it looks like the MGM in which this drain job is started does not know who the MGM master is. For simplicity, I would recommend running with just one MGM for the time being, so that you know for sure who is the current master MGM.

Can you please let me know exactly what version of the EOS packages you are using?
Also what is the output of the following commands?
eos ns
eos node ls
eos fs ls

Thanks,
Elvin

GeonmoRyu · January 15, 2025, 7:47am

Hello, Elvin.

I checked what you mentioned and found that the daemons that went up to the problematic server had an error where they set the /etc/sysconfig/eos_env environment variable, so the entry for the MGM server was blank.

It turns out that the error was caused by the mistake of writing the environment variable without defining it with export while subtracting the one that was already defined with the export statement.

I apologize for creating another problem and causing you more nervousness.

We’re back to square one now and are working on fixing the mgm changed error.

I’ll try to force the FST daemons to have a single MGM as per your guide and see how it works.

However, it seems difficult to stop the slave MGMs and set them to a single MGM because our setup is DNS load balancing, so if we turn off the other MGMs, the status will be critical.

For now, we will get back to you after setting up the FST.

Also, here is the “ns” and “node ls” information you requested. “fs ls” has 1600 filesystems, so you’ll need to provide your email address or upload space to get it.

# ------------------------------------------------------------------------------------
# Namespace Statistics
# ------------------------------------------------------------------------------------
ALL      Files                            12844312 [booted] (0s)
ALL      Directories                      49966
ALL      Total boot time                  0 s
ALL      Contention                       write: 22.20 % read:11.29 %
# ------------------------------------------------------------------------------------
ALL      Replication                      is_master=true master_id=jbod-mgmt-01.sdfarm.kr:1094
# ------------------------------------------------------------------------------------
ALL      files created since boot         6
ALL      container created since boot     0
# ------------------------------------------------------------------------------------
ALL      current file id                  29196327
ALL      current container id             51717
# ------------------------------------------------------------------------------------
ALL      eosxd caps                       0 c: 0 cc: 0 cic: 0 ic: 0
ALL      eosxd clients                    0
ALL      eosxd active clients             0
ALL      eosxd locked clients             0
# ------------------------------------------------------------------------------------
ALL      File cache max num               40000000
ALL      File cache occupancy             5863414
ALL      In-flight FileMD                 1
ALL      Container cache max num          5000000
ALL      Container cache occupancy        30789
ALL      In-flight ContainerMD            0
# ------------------------------------------------------------------------------------
ALL      eosViewRWMutex status            available (0s) 
ALL      eosViewRWMutex peak-latency      0ms (last) 0ms (1 min) 0ms (2 min) 4ms (5 min)
ALL      eosViewRWMutex locked for 0.00% of the penultimate second
# ------------------------------------------------------------------------------------
ALL      QClient overall RTT              0ms (min)  2ms (avg)  213ms (max)  
ALL      QClient recent peak RTT          4ms (1 min) 4ms (2 min) 4ms (5 min)
# ------------------------------------------------------------------------------------
ALL      memory virtual                   33.91 GB
ALL      memory resident                  24.67 GB
ALL      memory share                     8.37 MB
ALL      memory growths                   30.64 GB
ALL      threads                          504
ALL      fds                              832
ALL      uptime                           6053
# ------------------------------------------------------------------------------------
ALL      drain info                       pool=drain          min=10  max=100  size=10   queue_sz=0
ALL      fsck info                        pool=fsck           min=2   max=20   size=20   queue_sz=1001
ALL      converter info                   pool=converter      min=16  max=100  size=16   queue_sz=0
ALL      balancer info                    pool=balance        min=10  max=100  size=13   queue_sz=17 space=default
# ------------------------------------------------------------------------------------
ALL      tracker info                     tracker=balance size=31
ALL      tracker info                     tracker=convert size=0
ALL      tracker info                     tracker=drain size=10
ALL      tracker info                     tracker=fsck size=1021
# ------------------------------------------------------------------------------------
┌────────┬───────┬────────┬───────┬──────┬─────────┬────────────────┐
│     uid│threads│sessions│  limit│stalls│stalltime│          status│
└────────┴───────┴────────┴───────┴──────┴─────────┴────────────────┘
        0       1        4 65.54 K      0         1          user-OK

[root@jbod-mgmt-01 ~]# eos node ls
┌──────────┬────────────────────────────────┬────────────────┬──────────┬────────────┬────────────────┬─────┐
│type      │                        hostport│          geotag│    status│   activated│  heartbeatdelta│ nofs│
└──────────┴────────────────────────────────┴────────────────┴──────────┴────────────┴────────────────┴─────┘
 nodesview       jbod-mgmt-01.sdfarm.kr:1095 kisti::gsdc::g01     online           on                0    83 
 nodesview       jbod-mgmt-01.sdfarm.kr:1096 kisti::gsdc::g01     online           on                0    83 
 nodesview       jbod-mgmt-02.sdfarm.kr:1095 kisti::gsdc::g01     online           on                1    84 
 nodesview       jbod-mgmt-02.sdfarm.kr:1096 kisti::gsdc::g01     online           on                2    84 
 nodesview       jbod-mgmt-03.sdfarm.kr:1095 kisti::gsdc::g01     online           on                1    84 
 nodesview       jbod-mgmt-03.sdfarm.kr:1096 kisti::gsdc::g01     online           on                0    84 
 nodesview       jbod-mgmt-04.sdfarm.kr:1095 kisti::gsdc::g02     online           on                1    84 
 nodesview       jbod-mgmt-04.sdfarm.kr:1096 kisti::gsdc::g02     online           on                0    83 
 nodesview       jbod-mgmt-05.sdfarm.kr:1095 kisti::gsdc::g02     online           on                0    84 
 nodesview       jbod-mgmt-05.sdfarm.kr:1096 kisti::gsdc::g02     online           on                0    83 
 nodesview       jbod-mgmt-06.sdfarm.kr:1095 kisti::gsdc::g02     online           on                0    84 
 nodesview       jbod-mgmt-06.sdfarm.kr:1096 kisti::gsdc::g02     online           on                2    84 
 nodesview       jbod-mgmt-07.sdfarm.kr:1095 kisti::gsdc::g03     online           on                1    84 
 nodesview       jbod-mgmt-07.sdfarm.kr:1096 kisti::gsdc::g03     online           on                0    84 
 nodesview       jbod-mgmt-08.sdfarm.kr:1095 kisti::gsdc::g03     online           on                2    84 
 nodesview       jbod-mgmt-08.sdfarm.kr:1096 kisti::gsdc::g03     online           on                1    84 
 nodesview       jbod-mgmt-09.sdfarm.kr:1095 kisti::gsdc::g03     online           on                1    84 
 nodesview       jbod-mgmt-09.sdfarm.kr:1096 kisti::gsdc::g03     online           on                2    84 
 nodesview       jbod-mgmt-11.sdfarm.kr:1095 kisti::gsdc::e01     online           on                2    84 
 nodesview       jbod-mgmt-11.sdfarm.kr:1096 kisti::gsdc::e01     online           on                2    84 
 nodesview       jbod-mgmt-12.sdfarm.kr:1095 kisti::gsdc::e01     online           on                0    84

GeonmoRyu · January 20, 2025, 8:09am

Hello, Elvin.

The three MGM daemons are running eos 5.2.31-1, the three qdb daemons are running 5.1.2.5.1.22, the three FST daemons are running 5.2.29-1, and the remaining 18 FSTs are running version 5.2.31-1. First, we confirmed that mixing FST versions 5.1 and 5.2 was not the cause.

I set the FST, MGM, and MQ to use only the jbod-mgmt-04.sdfarm.kr server as the MGM server with the following configuration. Our system couldn’t use the environmentfile of systemd because we were running the eos daemon inside the Podman container. So the individual values have export in front of them, which you can ignore.

MGM

...
export EOS_MGM_HOST=jbod-mgmt-0X.sdfarm.kr
export EOS_MGM_HOST_TARGET=jbod-mgmt-04.sdfarm.kr
export EOS_HA_REDIRECT_READS=1
export EOS_MGM_MASTER1=jbod-mgmt-04.sdfarm.kr
export EOS_MGM_MASTER2=jbod-mgmt-04.sdfarm.kr
export EOS_MGM_ALIAS=jbod-mgmt-04.sdfarm.kr
export EOS_BROKER_URL=root://${EOS_MGM_ALIAS}:1097//eos/
export EOS_MGM_URL="root://${EOS_MGM_ALIAS}:1094"
export EOS_FUSE_MGM_ALIAS=${EOS_MGM_ALIAS}
...

MQ

export EOS_MGM_MASTER1=jbod-mgmt-04.sdfarm.kr
export EOS_MGM_MASTER2=jbod-mgmt-04.sdfarm.kr
export EOS_MGM_ALIAS=jbod-mgmt-04.sdfarm.kr
export EOS_MGM_URL="root://${EOS_MGM_ALIAS}:1094"
export EOS_BROKER_URL=root://${EOS_MGM_ALIAS}:1097//eos/
export EOS_FUSE_MGM_ALIAS=${EOS_MGM_ALIAS}

FST

export EOS_MGM_ALIAS=jbod-mgmt-04.sdfarm.kr
export EOS_MGM_URL="root://${EOS_MGM_ALIAS}:1094"
export EOS_FUSE_MGM_ALIAS=${EOS_MGM_ALIAS}
export EOS_BROKER_URL=root://${EOS_MGM_ALIAS}:1097//eos/

After the above settings, when the jbod-mgmt-04 server is the MGM master (RW node), messages are not output when the MGM server is changed. However, it is assumed that the message is generated when the MGM server is changed in some cases.

The problem is that the failure occurs after that setup.

After running the MGM, MQ, and FST daemons, it works fine, but after about 3 hours, the processing time for operations such as writes exceeds 1 minute. ALICE’s ADD test becomes critical because it expects operations to be processed in less than a minute. After about 5 hours, we see that GET operations exceed 4 minutes and become critical.

Even while this is happening, MGM shows that the nodes are connected in less than 3 seconds.

EOS Console [root://localhost] |/eos/gsdc/> node ls
┌──────────┬────────────────────────────────┬────────────────┬──────────┬────────────┬────────────────┬─────┐
│type      │                        hostport│          geotag│    status│   activated│  heartbeatdelta│ nofs│
└──────────┴────────────────────────────────┴────────────────┴──────────┴────────────┴────────────────┴─────┘
 nodesview       jbod-mgmt-01.sdfarm.kr:1095 kisti::gsdc::g01     online           on                1    83 
 nodesview       jbod-mgmt-01.sdfarm.kr:1096 kisti::gsdc::g01     online           on                2    83 
 nodesview       jbod-mgmt-02.sdfarm.kr:1095 kisti::gsdc::g01     online           on                0    84 
 nodesview       jbod-mgmt-02.sdfarm.kr:1096 kisti::gsdc::g01     online           on                2    84 
 nodesview       jbod-mgmt-03.sdfarm.kr:1095 kisti::gsdc::g01     online           on                3    84 
 nodesview       jbod-mgmt-03.sdfarm.kr:1096 kisti::gsdc::g01     online           on                0    84 
 nodesview       jbod-mgmt-04.sdfarm.kr:1095 kisti::gsdc::g02     online           on                1    84 
 nodesview       jbod-mgmt-04.sdfarm.kr:1096 kisti::gsdc::g02     online           on                0    83 
 nodesview       jbod-mgmt-05.sdfarm.kr:1095 kisti::gsdc::g02     online           on                2    84 
 nodesview       jbod-mgmt-05.sdfarm.kr:1096 kisti::gsdc::g02     online           on                1    83 
 nodesview       jbod-mgmt-06.sdfarm.kr:1095 kisti::gsdc::g02     online           on                1    84 
 nodesview       jbod-mgmt-06.sdfarm.kr:1096 kisti::gsdc::g02     online           on                0    84 
 nodesview       jbod-mgmt-07.sdfarm.kr:1095 kisti::gsdc::g03     online           on                1    84 
 nodesview       jbod-mgmt-07.sdfarm.kr:1096 kisti::gsdc::g03     online           on                2    84 
 nodesview       jbod-mgmt-08.sdfarm.kr:1095 kisti::gsdc::g03     online           on                3    84 
 nodesview       jbod-mgmt-08.sdfarm.kr:1096 kisti::gsdc::g03     online           on                1    84 
 nodesview       jbod-mgmt-09.sdfarm.kr:1095 kisti::gsdc::g03     online           on                1    84 
 nodesview       jbod-mgmt-09.sdfarm.kr:1096 kisti::gsdc::g03     online           on                0    84 
 nodesview       jbod-mgmt-11.sdfarm.kr:1095 kisti::gsdc::e01     online           on                0    84 
 nodesview       jbod-mgmt-11.sdfarm.kr:1096 kisti::gsdc::e01     online           on                0    84 
 nodesview       jbod-mgmt-12.sdfarm.kr:1095 kisti::gsdc::e01     online           on                1    84

However, individual FST servers occasionally experienced the following behavior where the node appeared to be offline.(__offline_jbod-mgmt-02.sdarm.kr) The full message is long, so we’re only showing a brief summary.

250120 04:48:11 time=1737348491.242950 func=fileOpen                 level=ERROR logid=c033e9ac-d6e9-11ef-b19a-b8599fa51310 unit=fst@jbod-mgmt-05.sdfarm.kr:1095 tid=00007f23183a0640 source=XrdIo:334                      tident=<service> sec=      uid=0 gid=0 name= geo="" error= "open failed url=root://1@__offline_jbod-mgmt-02.sdfarm.kr:1096//eos/gsdc/grid/02/63948/d8c38a44-6415-11e2-9717-2823a10abeef?cap.msg=~~~~&eos.clientinfo=zbase64:MD~~~~&fst.valid=1737348550&mgm.id=000647b1&mgm.logid=b6eb556a-d6e9-11ef-9f5a-b8599f9c4330&mgm.mtime=1642197616&mgm.replicaindex=3, errno=0, errc=101, msg=[FATAL] Invalid address"

The messages below were also found in the FST.

250120 07:45:04 time=1737359104.874558 func=Close                    level=WARN  logid=46883f9a-d702-11ef-89ce-b8599f9c5190 unit=fst@jbod-mgmt-07.sdfarm.kr:1095 tid=00007fa5699ff640 source=RainMetaLayout:1727            tident=<service> sec=      uid=0 gid=0 name= geo="" msg="failed close for null file"

250120 07:45:21 time=1737359121.774973 func=Read                     level=WARN  logid=4b522450-d702-11ef-8aff-b8599f9c5190 unit=fst@jbod-mgmt-07.sdfarm.kr:1095 tid=00007fa512cbe640 source=RainMetaLayout:657             tident=<service> sec=      uid=0 gid=0 name= geo="" msg="read too big resizing the read length" end_offset=1033895936 file_size=1032925807

250120 07:46:45 time=1737359205.172128 func=fileOpen                 level=WARN  logid=b2339cee-d702-11ef-aa05-b8599f9c5190 unit=fst@jbod-mgmt-07.sdfarm.kr:1095 tid=00007fa5643ff640 source=XrdIo:339                      tident=<service> sec=      uid=0 gid=0 name= geo="" msg="error encountered despite errno=0; setting errno=22"

GeonmoRyu · January 20, 2025, 8:21am

Also, interestingly, when I restart MGM, most of the time it only takes a few seconds for the nodes to connect on eos node ls, but sometimes it takes up to 5 minutes for some nodes.

I’m assuming that the number of nodes that take a long time to connect increases over time, and if the number of such nodes affects write or read operations, the test operation fails.

Not surprisingly, starting the FST daemon immediately connects, and I’m not sure since I haven’t done it a few times, but restarting MGM from that state seems to connect the FST immediately.

EOS Console [root://localhost] |/eos/gsdc/> node ls
┌──────────┬────────────────────────────────┬────────────────┬──────────┬────────────┬────────────────┬─────┐
│type      │                        hostport│          geotag│    status│   activated│  heartbeatdelta│ nofs│
└──────────┴────────────────────────────────┴────────────────┴──────────┴────────────┴────────────────┴─────┘
 nodesview       jbod-mgmt-01.sdfarm.kr:1095 kisti::gsdc::g01    offline           on              104    83 
 nodesview       jbod-mgmt-01.sdfarm.kr:1096 kisti::gsdc::g01    offline           on              288    83 
 nodesview       jbod-mgmt-02.sdfarm.kr:1095 kisti::gsdc::g01     online           on                1    84 
 nodesview       jbod-mgmt-02.sdfarm.kr:1096 kisti::gsdc::g01     online           on                4    84 
 nodesview       jbod-mgmt-03.sdfarm.kr:1095 kisti::gsdc::g01     online           on                2    84 
 nodesview       jbod-mgmt-03.sdfarm.kr:1096 kisti::gsdc::g01    offline           on              286    84 
 nodesview       jbod-mgmt-04.sdfarm.kr:1095 kisti::gsdc::g02     online           on                3    84 
 nodesview       jbod-mgmt-04.sdfarm.kr:1096 kisti::gsdc::g02    offline           on              287    83 
 nodesview       jbod-mgmt-05.sdfarm.kr:1095 kisti::gsdc::g02     online           on                2    84 
 nodesview       jbod-mgmt-05.sdfarm.kr:1096 kisti::gsdc::g02    offline           on              285    83 
 nodesview       jbod-mgmt-06.sdfarm.kr:1095 kisti::gsdc::g02     online           on                1    84 
 nodesview       jbod-mgmt-06.sdfarm.kr:1096 kisti::gsdc::g02    unknown           on                ~    84 
 nodesview       jbod-mgmt-07.sdfarm.kr:1095 kisti::gsdc::g03     online           on                1    84 
 nodesview       jbod-mgmt-07.sdfarm.kr:1096 kisti::gsdc::g03     online           on                1    84 
 nodesview       jbod-mgmt-08.sdfarm.kr:1095 kisti::gsdc::g03     online           on                3    84 
 nodesview       jbod-mgmt-08.sdfarm.kr:1096 kisti::gsdc::g03     online           on                3    84 
 nodesview       jbod-mgmt-09.sdfarm.kr:1095 kisti::gsdc::g03    offline           on               94    84 
 nodesview       jbod-mgmt-09.sdfarm.kr:1096 kisti::gsdc::g03     online           on                2    84 
 nodesview       jbod-mgmt-11.sdfarm.kr:1095 kisti::gsdc::e01     online           on                1    84 
 nodesview       jbod-mgmt-11.sdfarm.kr:1096 kisti::gsdc::e01     online           on                1    84 
 nodesview       jbod-mgmt-12.sdfarm.kr:1095 kisti::gsdc::e01     online           on                2    84

esindril · January 22, 2025, 10:38am

Hi Geonmo,

I don’t see anything obviously wrong in your setup that would explain the behavior that you are describing. So let’s summarize the issues that you are facing:

Sometimes you see high heartbeat values for the nodes after a certain amount of time when the system was working as expected. For this could you please post the output of the following command eos ns.
The stability of the instance is affected by errors after a certain amount of time during which it was running fine (for example the 3h that you mentioned).

I think both issues are the side effects of the same problem, namely the fact that the heartbeats are increasing, leading after 60 seconds for those file systems to be considered as offline. Are you running any heavy draining or balancing activity in your instance?

One last piece of information that would be useful from your side. Could you please provide me with the output of the following command? eu-stack -p <mgm_pid>. This might be long depending on how many active threads you have in your MGM process.

Thanks,
Elvin

GeonmoRyu · February 4, 2025, 8:29am

eu-stack log

Sorry for the late reply. Here is the eu-stack log file.

However, we’re sending it to you first as a reference because it’s before the FST stops restarting, and the write operation goes over a minute, not after.

Due to a sudden increase in work after the holidays, I can’t change this setting to create a failure status right now, so I’d appreciate your patience.

GeonmoRyu · March 17, 2025, 1:42am

Hello, @esindril and everyone.

The issue seems to have disappeared after I upgraded our EOS servers to version 5.3.8-1.

It’s not clear if the version upgrade resolved the cause or if the legacy information that was still there disappeared, causing the MGM information to change.

However, it appears that the issue has been resolved and the key to resolving the biggest issue is unifying the MGM and mq main servers, so I’ve specified the solution in that answer.

I appreciate your help.

– Geonmo

esindril · March 17, 2025, 7:45am

Hi Geonmo,

Happy to hear things are working as expected now. It’s also very good that you upgraded to 5.3.8 as this is the current version that we also run in our production services at CERN.

Cheers,
Elvin

CERN Accelerating science

Can I write FSTs when using a mix of EOS 5.1 and 5.2?

Note that jbod-mgmt-11 and 12 are the new FSTs.