Shouldn’t one of the stripes for the file also be located on fst-2 ?
Not necessarily, you have 7 stripes on 7 different machines ![]()
Shouldn’t one of the stripes for the file also be located on fst-2 ?
Not necessarily, you have 7 stripes on 7 different machines ![]()
If you have the logs of the day when you did the conversion of the files MD from levelDB to extended attribute, do you see errors related to the conversion? Some stuff related to FmdConverter or FmdHandler?
Anyway,
Here are my recommendations to improve the situation again.
I would suggest downgrading the two problematic FSTs, delete the file .eosattrconverted that is under each /srv/data/data.??/ directory and restart the FST to re-trigger the conversion… Maybe try this with one FST, before doing the other one.
You should see if some conversion are happening by typing: cat /var/log/eos/fst/xrdlog.fst | grep "conversion done" | cut -c '1-12' |sort | uniq -c
Cheers,
Cedric
Hi Cedric,
As the missing xattrs were on 0-byte files, I’ve stopped the fst, removed all of those files (mostly crated after the update to 5.2.0), and startet fst again. It’s now booting all filesystems correctly. However the segfault issue remains on this one.
As we see now the “previous” behaviour, I’ll do the same to fst-5.
My impression is that the segfaulting fst might cause 0-byte files to be written without xattrs (and without EC header) which might cause the FST to not come up at all after a crash (i.e. when it encounters the file with the missing attributes).
All our FSTs were fully converted previously, and hat the .eosconvert marker file written. I’ve checked this before I started the update.
Best,
Erich
Hi Erich,
Thanks a lot for your explanations. I think your assumption is correct. I really need to find why your FSTs are segfaulting…
We have another observation by @useren which mit be related to the segfault:
On currently not crashing fst-1, this log message is repeating 1x every ~ 15min
231024 13:02:22 4450 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 13:17:23 4450 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 13:32:23 4450 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 13:47:24 4450 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:02:24 4450 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:17:24 4450 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:32:25 4450 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:47:25 4450 TPC_TempCA: Reloading the list of CAs and CRLs in directory
On fst-2 however, which is crashing regularly, we see many more instances of this message, just in the last 2 hours we see almost 800:
[root@fst-2 ~]# egrep -c '231024 1[34]:..:.. .* TPC_TempCA: Reloading the list of CAs and CRLs in directory' /var/log/eos/fst/xrdlog.fst
765
This is multiple times per minute:
231024 14:57:02 28385 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:57:09 28690 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:57:17 28938 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:57:24 29185 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:57:32 29434 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:57:39 29684 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:57:47 29936 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:57:54 30183 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:58:02 30431 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:58:09 30731 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:58:17 30985 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:58:24 31235 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:58:32 31480 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:58:39 31728 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:58:47 31976 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:58:54 32227 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:59:02 32478 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:59:09 32780 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:59:17 33030 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:59:24 33280 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:59:32 33532 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:59:39 33779 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:59:47 34032 TPC_TempCA: Reloading the list of CAs and CRLs in directory
231024 14:59:54 34282 TPC_TempCA: Reloading the list of CAs and CRLs in directory
Let us know if there is any more information that we can provide,
Best,
Erich
That’s very weird indeed… Can you add the following directive to the end of your xrd.tlsca config line? refresh 15m
EDIT: Actually it would not change anything. In the case the Reloading thread fails, it will retry 10 seconds later… No parameter can overwrite this…
I looked at the logs Uemit provided me earlier, unfortunately I don’t see this repetition of this log message. Do you think you can give me access to your machine so I can try to debug stuff on it?
Thanks again! ![]()
sure, I’ll follow up by PM
Hi all,
Just a heads-up on that issue and how it got solved.
The error came from the configuration of the FST. One should use the libraries that comes with eos-xrootd (located in /opt/eos/xrootd/..., not a mix between vanilla xrootd and the eos-xrootd one.
Replacing
xrd.protocol XrdHttp:9001 /usr/lib64/libXrdHttp.so
http.exthandler xrdtpc /usr/lib64/libXrdHttpTPC.so
By
xrd.protocol XrdHttp:9001 libXrdHttp.so
http.exthandler xrdtpc libXrdHttpTPC.so
solved the issue (as /etc/sysconfig/eos_env sets properly the LD_LIBRARY_PATH variable allowing xrootd to first look into /opt/eos/xrootd/ before trying to load the libraries from /usr/bin....
Thanks Elvin for helping out and thanks Uemit, Erich and all the people involved allowing me to have access to your machines ![]()