1 Jun 12:56 2010
Re: BibTeXU
Taco Hoekwater <taco <at> elvenkind.com>
2010-06-01 10:56:02 GMT
2010-06-01 10:56:02 GMT
Karl Berry wrote: > I gather this is what used to be bibtex8 in former years? > > bibtexu is/was a project by Yannis (and a student or two) to use the ICU > library with BibTeX. Peter also put in the massive efforts needed to > make this work in the TL build system and have bibtexu and xetex use the > same ICU library Bibtexu does not actually seem to work all that well, or at least it has some quirks on my linux 64 box. I experimented a bit because it sounds promising. Long email follows. I created a small test.aux file with just this in it: \citation{*} \bibstyle{plain} \bibdata{xampl} At first it complained that it could not find '88591lat.csf'. This is probably just a packaging error: as it stands, the bibtexu package should depend on bibtex8 (or the files have to be moved to the bibtexu package, I do not know whether bibtex8 needs them). I installed bibtex8, and that took care of that. But then, I got this: [taco <at> ntg tmp]$ bibtexu test The 8-bit codepage and sorting file: 88591lat.csf The top-level auxiliary file: test.aux The style file: plain.bst Database file #1: xampl.bib Terminated I killed it after about five minutes, and by then it had used 2minutes CPU time, Resident size was 1G, and Virtual size 2.3G (and growing). valgrind gives about a gazillion 'Conditional jump or move depends on uninitialised value(s)' messages. It seems \citation{*} is causing this trouble, because a test without it runs fine (changed to \citation{article-full}). Having found a working solution, now I wanted to see about that 'u' at the end of the program name. Big disappointment there: from the documentation in 'source', the 'u' apparently stands for 'Unified' or so and at first glance it has nothing to do with Unicode at all. (I could have stopped there because to me there would be little point to a drop-in replacement of bibtex8). Nevertheless, the line: The 8-bit codepage and sorting file: 88591lat.csf gave the impression that that csf file is configurable. 00readme.txt from the source says there should be a command line option: -c --csfile FILE but this option does not work nor is it listed in the -h output: I get the help text echoed back at me (there are more options listed in 00readme.txt that do exist, but I am not in the mood to list them all). The 00readme.txt from the source says you can set an environment variable (BIBTEX_CSFILE), so I tried that: [taco <at> ntg tmp]$ env BIBTEX_CSFILE=cp47lat.csf bibtexu xampl-latex The 8-bit codepage and sorting file: 88591lat.csf The top-level auxiliary file: xampl-latex.aux The style file: plain.bst Database file #1: xampl.bib Didn't work. Continuing on, it turns out that kpsewhich cannot find cp47lat.csf either, so I tried an absolute path: [taco <at> ntg tmp]$ env BIBTEX_CSFILE=/home/taco/texlive/2010/texmf-dist/bibtex/csf/base/cp437lat.csf bibtexu xampl-latex The 8-bit codepage and sorting file: 88591lat.csf The top-level auxiliary file: xampl-latex.aux The style file: plain.bst Database file #1: xampl.bib Doesn't work either. Then I remembered having seen a debug switch: --debug=search: [taco <at> ntg tmp]$ env BIBTEX_CSFILE=cp437lat.csf bibtexu --debug=search xampl-latex The 8-bit codepage and sorting file: 88591lat.csf The top-level auxiliary file: xampl-latex.aux The style file: plain.bst Database file #1: xampl.bib Also doesn't seem to do anything. Un-phased, try with --debug=all: [taco <at> ntg tmp]$ env BIBTEX_CSFILE=cp437lat.csf bibtexu --debug=all xampl-latex Lots of output this time, but _nothing_ related to file searching. Now I could have given up, but then I realized that perhaps the u in bibtexu is about *input*, not output or whatever is implied by '8-bit codepage and sorting file'. So I created a copy of xampl.bib and changed Aamport to "Aaämport", saved as UTF-8, and ran: [taco <at> ntg tmp]$ bibtexu xampl-latex Much to my surprise, the output is UTF-8! That is exactly what I wanted, but what is all this talk about 8-bit csf files about then? I don't understand that at all. Never mind, now for the real experiment (this is where old bibtex fails): \citation{article-full} \bibdata{xampl-utf} \bibstyle{alpha} The "Aaämport" above makes bibtex and bibtex8 generate invalid UTF-8 output in this case, because it takes the first 3 bytes of the surname instead of the first 3 sequences (an important difference in UTF-8). Here is what happens: [taco <at> ntg tmp]$ bibtexu xampl-latex The 8-bit codepage and sorting file: 88591lat.csf The top-level auxiliary file: xampl-latex.aux The style file: alpha.bst Database file #1: xampl-utf.bib 6there is a error: U_ZERO_ERROR[taco <at> ntg tmp]$ It reports an error, but it *did* generate a bbl file, and the content of that is correct UTF-8: \bibitem[Aaä86]{article-full} Then I tried "The ḠṈÄȚŜ and Gnus Document Preparation System". Output UTF-8: "The ḡṉäțŝ and gnus document preparation system" It does work after all! This now makes me believe that all this talk about csf files is just a bit leftover noise that does not actually mean anything. So what about that U_ZERO_ERROR report then? No idea. It happens once for each \citation in the 'alpha' style (as well as in the cont-xx.bst styles) but it seems harmless. In the end, what is left is the \citation{*} bug, and a lot of obsolete documentation, I think. (and it took me three hours figuring this out). Best wishes, Taco