WFE unable to parse sys.vid

byujiang · July 16, 2025, 8:16am

Hi,

We have been using WFE to trigger file creation for a long time, and it works well. Currently we upgraded our EOS from 5.2.24 to 5.3.15, and the WFE engine doesn’t work as expected.

The WFE.log has some errors as bellow:

250716 16:05:12 INFO  WFE:496                        workflow="default" fxid=7ce04162
250716 16:05:12 CRIT  WFE:514                        parsing of f�J\
 failed - setting nobody

250716 16:05:20 INFO  WFE:381                        workflowdir="/eos/dev/proc/workflow/20250716/q/default/" retry=0 when=1752653120 job-time=1752653120
250716 16:05:22 INFO  WFE:496                        workflow="default" fxid=7ce04162
250716 16:05:22 CRIT  WFE:514                        parsing of K��\
 failed - setting nobody

It seems that WFE cannot parse something. Looking into mgm/WFE.cc, we found:

 502        try {
   503          time_t t_when = strtoull(when.c_str(), 0, 10);
   504          AddAction(fmd->getAttribute("sys.action"), event, t_when, savedAtDay, workflow,
   505                    q);
   506        } catch (eos::MDException& ex) {
   507          eos_static_err("msg=\"no action stored\" path=\"%s\"", f.c_str());
   508        }
   509
   510        try {
   511          auto vidstring = fmd->getAttribute("sys.vid").c_str();
   512
   513          if (!eos::common::Mapping::VidFromString(mVid, vidstring)) {
   514            eos_static_crit("parsing of %s failed - setting nobody\n", vidstring);
   515            mVid = eos::common::VirtualIdentity::Nobody();
   516          }
   517        } catch (eos::MDException& ex) {
   518          mVid = eos::common::VirtualIdentity::Nobody();
   519          eos_static_err("msg=\"no vid stored\" path=\"%s\"", f.c_str());
   520        }

We selected one file, and tried to get the fmd:

 attr -g eos.fmd /data72/eos/00033263/7ce04162
Attribute "eos.fmd" had a 146 byte value for /data72/eos/00033263/7ce04162:
        bA�|�
                l%��vh-99]5��vh=99]ES�vhN�UY��;a��;i��;c8391075zc8391075���.�D���364,3136

The output is unreadable, but the vid might be 99, while the nobody is 65534 in alma 9. Might this be the problem?

esindril · July 16, 2025, 8:35am

Hi Yujiang,

Could you please issue the following command in the eos console:

eos attr ls fxid:7ce04162

Thank you,
Elvin

byujiang · July 16, 2025, 8:49am

Hi Elvin,
The output is:

$ eos attr ls fxid:7ce04162
sys.eos.btime="1752626436.416291816"
sys.fs.tracking="+364+3136"
sys.utrace="7d2b7dee-61dd-11f0-b660-1423f231eb00"
sys.vtrace="[Wed Jul 16 08:40:36 2025] uid:12015[lhaasospade] gid:580[lhaaso] tident:lhaasosp.93683:771@lhaaso-fts4.lhaaso name:lhaasospade dn: prot:unix app:eoscp host:lhaaso-fts4.lhaaso.ihep.ac.cn domain:lhaaso.ihep.ac.cn geo: sudo:0 trace: onbehalf:"

esindril · July 16, 2025, 9:21am

Hi Yujiang,

Ok, so this means the parsing was faling also before since this attribute does not exist on the file. What other indication in the logs do you have concerning the fact that the WFE does not work as expected? There must be some other clue later in the logs - especially related to the action which is triggered by the workflow. I would recommend tracing this in the main MGM log in /var/log/eos/mgm/xrdlog.mgm as you get more info in there.

Indeed, one difference might be that user nobody has uid 65534 on Alma9. Did you also update the operating system during the upgrade? Can you check the actual error that the action performed by the workflow gets?

Thanks,
Elvin

byujiang · July 16, 2025, 11:39am

Hi Elvin,

We didn’t upgrade OS. Actually, we run EOS 5.2.24 on Almalinux 9.4, and WFE works well.
We just run a simple workflow

sys.workflow.closew.default="bash:shell:km2a echo <eos::wfe::path> <eos::wfe::size> <eos::wfe::checksum>"

And in /var/log/eos/mgm/xrdlog.mgm, the errors are similar as in WFE.log,

250716 19:35:04 time=1752665704.162280 func=Load                     level=INFO  logid=static.............................. unit=mgm@eos01.ihep.ac.cn:1094 tid=00007f0f8b7fe640 source=WFE:496                        tident= sec=(null) uid=0 gid=0 name=- geo="" xt="" ob="" workflow="default" fxid=7ce04162
250716 19:35:04 time=1752665704.162329 func=Load                     level=CRIT  logid=static.............................. unit=mgm@eos01.ihep.ac.cn:1094 tid=00007f0f8b7fe640 source=WFE:514                        tident= sec=(null) uid=0 gid=0 name=- geo="" xt="" ob="" parsing of
Rr\
 failed - setting nobody

Best regards

esindril · July 16, 2025, 12:26pm

Hi Yujiang,

Does it still work when you run 5.3.18?

I am interested in any messages that you get in the logs when the workflow faild - presummably with 5.3.18. From what I understand, the same message is displayed with both 5.2.24 and 5.3.18 so this is not the root issue per se. We can downgrade the messager to warning rather than critical just to avoid any confusion. This still leaves open the question why your workflow does not work as expected in 5.3.18, but for this you need investigate more the logs and isolate the root cause.

Thanks,
Elvin

byujiang · July 16, 2025, 2:21pm

Hi Elvin,

You are right. The error occurred in EOS 5.2.24 but WFE worked as expected. From current log, EOS WFE seems to keep trying the same workflow:

250716 22:17:53 INFO  WFE:496                        workflow="default" fxid=7ce358c7
250716 22:17:53 CRIT  WFE:514                        parsing of WFE failed - setting nobody

250716 22:18:03 INFO  WFE:496                        workflow="default" fxid=7ce358c7
250716 22:18:03 CRIT  WFE:514                        parsing of WFE failed - setting nobody

250716 22:18:13 INFO  WFE:496                        workflow="default" fxid=7ce358c7
250716 22:18:13 CRIT  WFE:514                        parsing of WFE failed - setting nobody

250716 22:18:23 INFO  WFE:496                        workflow="default" fxid=7ce358c7
250716 22:18:23 CRIT  WFE:514                        parsing of WFE failed - setting nobody

250716 22:18:33 INFO  WFE:496                        workflow="default" fxid=7ce358c7
250716 22:18:33 CRIT  WFE:514                        parsing of WFE failed - setting nobody

Or the WFE is blocked by one workflow. And if I delete this manually, WFE would be blocked by another one again.

esindril · July 30, 2025, 2:28pm

Hi Yujiang,

Did you make any progress in understanding what is the root cause of this issue?
Is there something I can do to help out?

Cheers,
Elvin

byujiang · August 1, 2025, 12:58am

Hi Elvin,

We didn’t have any progress so far. And to make our archival workflow working, we have turned to an external PostgreSQL instead of EOS WFE. BTW WFE worked somehow for a long time, while the group balance didn’t. Maybe it’s related to mq?

apeters · August 1, 2025, 7:28am

If you still have some entries in the proc directory from the workflow engine, you have to list the attributes on that file.

E.g.

find workflow/
/eos/dev/proc/workflow/
/eos/dev/proc/workflow/20250801/
/eos/dev/proc/workflow/20250801/d/
/eos/dev/proc/workflow/20250801/d/default/
/eos/dev/proc/workflow/20250801/d/default/1754033233:0001073d:closew
/eos/dev/proc/workflow/20250801/q/
/eos/dev/proc/workflow/20250801/q/default/
/eos/dev/proc/workflow/20250801/r/
/eos/dev/proc/workflow/20250801/r/default/
EOS Console [root://localhost] |/eos/dev/proc/> attr ls /eos/dev/proc/workflow/20250801/d/default/1754033233:0001073d:closew
sys.action=“notify:redis|localhost|6379|notification|2000”
sys.vid=“3:4:adm:adm:daemon:sss:nobody@unknown”
sys.wfe.errmsg=“”
sys.wfe.log=“moved to done”
sys.wfe.retc=“0”
sys.wfe.retry=“0”

You see, there is an extended attribute on the workflow entry with sys.vid and there is something wrong in the string there!

Cheers Andreas.

CERN Accelerating science

WFE unable to parse sys.vid