Where in the Stack?
For one of my projects – not KDE-related – I have a parser, written in YACC / bison.
To build the project, the bison grammar needs to be compiled (by the bison command) to C, and then
the C can be compiled to the final executable. On my workstation, the bison step would fail when
the build was run one-process-at-a-time in KDE konsole.
Workarounds were really weird: build with a -j
flag to build with more processes at once,
or pipe the build-output to cat
, or run the build in xterm
instead of konsole. So where is the bug?
In konsole, in bison, or in something underneath? It’s definitely something to do with the terminal emulator:
here’s a screenshot of bison compiling a sample file successfully in xterm, and
crashing in konsole, roxterm and alacritty. It crashes in cool-retro-term as well, but leaves the
terminal itself in a messed-up-state. It also crashes on the FreeBSD text console.
The output from bison, once piped to od -c
– and then it does not crash – is intriguing. Here’s
a part of the output:
0000120 ** ** , u s e ‘ ** ** % d e f i
0000140 n e a p i . p u r e ’ ** ** [
0000160 033 [ 3 5 m 033 ] 8 ; i d = 5 b 1 f
0000200 0 f d 7 0 0 0 5 d 1 7 5 2 e 1 f
0000220 4 2 3 b 0 0 0 0 0 0 0 0 ; h t t
0000240 p s : / / w w w . g n u . o r g
This is colorized output, which uses the terminal-control sequences for xterm – there is a big long list of them – to manipulate colors and styles. Most terminal emulators understand these control sequences.
Huh, there’s a URL there. It doesn’t display in the terminal, though.
URL Highlighting in Konsole
Konsole has a neat feature where if you display a URL, it can be highlighted, and you can click on it from konsole to open the URL. That looks like the screenshot here; if it looks like a URL, it gets underlined when you mouse over it, and you can ctrl-click to open.
There is terminal-control sequence to hide a URL within terminal output.
This uses the ESC ] 8 ;
sequence (it probably has a name, but I haven’t
been able to find it). By default, this is ignored, as a security measure:
clicking on text in your text terminal should not invoke web browsers
on invisible URLs. Konsole can turn the feature on, though:
go to Settings -> Edit Current Profile -> Mouse -> Miscellaneous tab, then tick the box
Allow escape sequences for links. Read the WARNING that is there, because it’s on-point.
You can use the printf
command to construct this kind of terminal-control sequences.
From the shell, something like this:
$ printf '\e]8;;https://kde.org/\e\\GNOME\e]8;;\e\\\n'
GNOME
Notice how the URL isn’t shown (it is the terminal emulator that handles that), but the text is. Notice, too, how the URL doesn’t match the text: that is what the security warning is all about.
This is actually pretty cool: bison prints a warning, and links from the warning message (which mentions a specific command-line flag) to the documentation which explains what the flag is for and provides additional context. It makes the warning message a bit more useful – if, and only if, that security-sensitive setting is on. (Which makes this cool feature from bison a lot less useful in most settings .. maybe GNOME terminal has a similar setting, but switched on by default).
OK, but looking at the bison output, specifically, there’s more than just a URL
there: there is a id=
part as well.
Let’s add that to the printf command:
$ printf '\e]8;id=x;https://kde.org/\e\\GNOME\e]8;;\e\\\n'
GNOME
Up to, and including, KDE Gear 21.08.3 (the latest release as of this writing),
konsole eats the terminal-control sequence, displays only the
text, but now – after adding that id=x
part – the URL is
no longer recognized, and you can’t click on it.
I fixed this in konsole merge request #535. Konsole assumed that that id-part was empty, when clearly it isn’t (or at least, bison thinks it is reasonable to output something there).
Bug, squashed! But .. bison still crashes.
Bison Warnings
I rebuilt bison with debugging enabled. Using gdb
I found that the crash
was invoked from warnings_print_categories()
. The code
looks like this:
const char *warning = argmatch_warning_argument (&w);
char ref[200];
snprintf (ref, sizeof ref,
"%s#W%s", diagnostics_url, warning);
begin_hyperlink (out, ref);
ostream_printf (errstream,
"-W%s%s",
s == severity_error ? "error=" : "",
warning);
end_hyperlink (out);
Bison is Free Software, copyrighted by the Free Software Foundation. The code above is licensed under the GPL Version 3.
I can see what it’s doing here: composing a URL and sending that to begin_hyperlink()
,
then printing the name of the warning itself, then calling end_hyperlink()
.
Unpacking each of those functions takes me to GNU libtextstyle
which is part of GNU gettext. Bison, it turns out, isn’t doing anything strange,
but something is going wrong lower down in the stack.
Libtextstyle Code
Since messing around with all of bison at once was a bit inconvenient, and with
the code from warnings_print_categories()
as inspiration, I came up with this
bit of code (and some boilerplate):
styled_ostream_set_hyperlink (errstream, argc > 1 ? argv[1] : NULL, argc > 2 ? argv[2] : NULL);
ostream_printf (errstream, "hello world");
styled_ostream_set_hyperlink (errstream, NULL, NULL);
ostream_flush (errstream, FLUSH_THIS_STREAM);
This allows me to test calls into libtextstyle with different hyperlink formats, URLs, etc. In the meantime I built the library with debugging support. Here is a typical crash:
#0 0x0000000000000000 in ?? ()
#1 0x000000080038230a in delay_output_sp (sp=0x0, ms=<optimized out>)
at /usr/src/src/contrib/ncurses/ncurses/tinfo/lib_tputs.c:104
#2 0x0000000800382b81 in delay_output (ms=0)
at /usr/src/src/contrib/ncurses/ncurses/tinfo/lib_tputs.c:116
#3 tputs_sp (sp=<optimized out>, sp@entry=0x7fffffffbf88,
string=0x800a08083 "", string@entry=0x800a08080 "57b",
affcnt=affcnt@entry=1, outc=<optimized out>)
at /usr/src/src/contrib/ncurses/ncurses/tinfo/lib_tputs.c:422
#4 0x0000000800382cfb in tputs (string=0x800a08080 "57b", affcnt=1,
outc=0x800278c40 <out_char>)
at /usr/src/src/contrib/ncurses/ncurses/tinfo/lib_tputs.c:444
#5 0x0000000800278bb0 in out_hyperlink_change (stream=0x800a0d000,
new_hyperlink=0x800a790c0, async_safe=false) at term-ostream.oo.c:1586
#6 0x000000080027979c in out_attr_change (stream=0x800a0d000, new_attr=...)
at term-ostream.oo.c:1737
#7 0x0000000800278f3b in output_buffer (stream=0x800a0d000, goal_attr=...)
at term-ostream.oo.c:1906
#8 0x000000080027647a in term_ostream__flush (stream=0x800a0d000,
scope=FLUSH_THIS_STREAM) at term-ostream.oo.c:2052
#9 0x0000000800276f0b in term_ostream_flush (first_arg=0x800a0d000,
scope=FLUSH_THIS_STREAM) at term-ostream.c:2729
In this particular case, I have set the ID to “57b”. Why? Well, I originally used the same ID as bison (see the octal dump, above). That crashes, so I had a reproducer but I wanted to cut down the size of what was being printed.
This test-code crashes in konsole. Crashes in alacritty. Does not crash in xterm or urxvt.
In any case, you can see that libtextstyle, in the function out_hyperlink_change()
,
is calling the ncurses function tputs()
with a user-supplied string.
The code is here in git
as of this writing, but it looks like so:
tputs ("\033]8;id=", 1, out_ch);
tputs (new_hyperlink->real_id, 1, out_ch);
tputs (";", 1, out_ch);
tputs (new_hyperlink->ref, 1, out_ch);
tputs ("\033\\", 1, out_ch);
GNU libtextstyle is Free Software, copyrighted by the Free Software Foundation. The code above is licensed under the GPL Version 3.
Here tputs()
is being called to send part of an escape code,
and then again with the ID (“57b”) and then a semicolon, etc.
Actual documentation of tputs()
is rather hard to come by,
but manpages are fairly consistent in their wording of
what parameters there are for the function. Here’s a quote from
the OpenBSD manpage for tputs:
The tputs routine applies padding information to the string str and outputs it. The str must be a terminfo string variable or the return value from tparm, tgetstr, or tgoto. affcnt is the number of lines affected, or 1 if not applicable.
The link ID is not a terminfo string variable, nor is it the return value of any of those three functions.
Neither is the partial escape sequence, or the single semicolon, or the other two strings.
I’m going to conclude that libtextstyle is Doing It Wrong.
For nearly all cases, there is no problem with this! What tputs()
does is call out_ch()
repeatedly, so in the end the string gets written.
Anyway, let’s pipe the output to od -c
again to see what’s up:
$ ./example2 2>&1 | od -c
0000000 e x a m p l e . c : 033 ] 8 ; i
0000020 d = b ; h t t p : / / e x a m p
0000040 l e . c o m / 033 \ h e l l o w
0000060 o r l d 033 ] 8 ; ; 033 \
The ID is supposed to be “57b” (see the stacktrace). Where has it gone in output? Only the “b” is left.
A little experimentation shows me that whenever the ID starts with digits, they disappear, and the example program crashes (in konsole, in alacritty, … but not in xterm or urxvt). If it starts with anything else, it’s fine.
At this point I wrote a patch for libtextstyle to write the strings
directly, instead of via tputs()
. That seemed like the right thing
to do, given the wording of the manpage – and it solved my original
problem of bison crashing. I’ve got the patch lined up to apply
to FreeBSD ports, but I don’t know how to go about submitting it
to GNU gettext or GNU libtextstyle.
Fix submitted to the FreeBSD ports tree as PR 260016. I don’t know when (or if) that will land.
Bug, squashed! But .. why was it crashing in some terminal emulators, why is the ID being mangled?
ncurses code
I wrote another example program. It Does It Wrong, intentionally,
by calling tputs()
with a user-provided string, like so:
tputs("8675309jenny", 1, out_ch);
On FreeBSD, in all the terminals I tried, this outputs “jenny” without her number. It crashes in the usual suspects. So what’s going on here? The backtrace gives me a line number, let’s take a look at the implementation function, in the FreeBSD git repo. The link goes to the top of the function, but scrolling down a little finds this gem:
#if BSD_TPUTS
/*
* This ugly kluge deals with the fact that some ancient BSD programs
* (like nethack) actually do the likes of tputs("50") to get delays.
*/
ncurses is Free Software, copyrighted by the Free Software Foundation and Thomas E. Dickey. The code above is licensed under the MIT license.
Well, I am on a BSD, so presumably that define is set, and
it does something special with leading digits?
Yes indeed: leading digits are consumed
and treated as if they were in a $<>
block
(a not-particularly-well-documented feature where tputs()
interprets
some parts of the string it is outputting as delay-specifiers, from
back when you had a real vt100 on a 2400 bps serial line).
So there we have it: inside the ncurses library,
Jenny’s number is consumed and turned into a delay.
This is going to call delay_output()
, intending to delay for 8675309ms.
That’s a little over 140 minutes, enough to let Tommy Tutone
play the song thirty times! There’s plenty of decent covers,
but it’s still going to get pretty repetetive.
This reinforces my belief that libtextstyle is doing it wrong, in a context where the ncurses library has this kludge.
So we now know why the numbers are being eaten, but not at all why the application then crashes in some terminals and not in others.
Let’s turn back to gdb to step though the code inside ncurses.
(gdb) next
99 NCURSES_SP_OUTC my_outch = GetOutCh();
(gdb) next
102 nullcount = (ms * _nc_baudrate(ospeed)) / (BAUDBYTE * 1000);
(gdb) print my_outch
$1 = (NCURSES_OUTC_sp) 0x0
(gdb) next
103 for (_nc_nulls_sent += nullcount; nullcount > 0; nullcount--)
(gdb) print nullcount
$2 = 186509
So there’s two things of note here: my_outch
is a function pointer for printing
delays; something hasn’t been set up properly, or whatever: the function pointer
is NULL and we’re going to have a bad time when calling it; we’ll call it 186509
times to output those two hours of delays. (In retrospect, seeing the 0x0 address
in the stack trace should have tipped me off much earlier, but I was chasing this bug initially
as a konsole issue, so went diving from the top of the applications stack).
Stepping through the same code inside xterm gives me slightly different results:
(gdb) print my_outch
$1 = (NCURSES_OUTC_sp) 0x0
(gdb) next
103 for (_nc_nulls_sent += nullcount; nullcount > 0; nullcount--)
(gdb) print nullcount
$2 = -963
Same bad function pointer, but we’ll call it -963 times. That is convenient,
since the for loop is never entered and the bad pointer isn’t called.
Looking back at the calculation of nullcount
here, I checked the
value of ospeed
when running under konsole: 9600, that’s slow for
a terminal but not unreasonable. Under xterm, it says -27136.
Um .. so xterm is avoiding the problem by having an uninitialized (or otherwise garbage) ospeed
value,
and it’s nothing special about the terminal at all.
Well! Good thing we have the ol’ stty
command, which we can use to
manipulate speeds. We can use stty speed 9600
for regular-speed
terminals, or stty speed 38400
for fast-fast terminals.
Heck, I even have a 56k modem, which was fast-fast-fast! That’s bits per second, kids,
from back when you could beatbox the sound of a modem syncing and doing carrier detection.
When, in konsole, I run stty speed 9600
, and then the sample application, it crashes.
Crank up the speed, stty speed 38400
and it’s fine. Checking in xterm, the default
speed that xterm has set is 38400. Turning it down a notch: example application crashes.
Turning it back to 38400: it’s fine.
Say, -27136, that’s the 16-bit two’s complement value of 38400 (which overflows a 16-bit
signed integer). Taking a look at the definition of the ospeed
variable,
which lives in /usr/local/termcap.h
, shows me this:
#undef NCURSES_OSPEED
#define NCURSES_OSPEED short
extern NCURSES_EXPORT_VAR(NCURSES_OSPEED) ospeed;
And yes, on my platform short is probably a 16-bit integer.
Let’s summarize:
- bison uses libtextstyle,
- libtextstyle calls some ncurses functions and, it can be argued is Doing It Wrong,
- but ncurses then uses a NULL function pointer, and that’s not libtextstyle’s fault,
- except when the terminal baud rate overflows a signed 16 bit integer.
Bug, understood but not squashed. How are things elsewhere?
Looking at Linux
You know, debuginfo is pretty darn useful, and I’m glad openSUSE’s gdb downloads it automatically. It makes running my example programs much less complicated (I know how to wrangle all the bits on FreeBSD, but not on Linux).
None of the example programs crashed. Running with gdb and setting a breakpoint
on delay_output_sp()
showed me the following:
no_pad_char
is set, so the problematic code path isn’t touched at all,ospeed
has a value that is not directly the baud rate on the tty (9600 baud yields anospeed
of 13, 38400 baud yields 15, 115200 baud yields 4098).
Looking more closely at the ospeed
values, this looks like a mapping of baud rates to internal “enum” values.
There are constants defined in termios.h
. These have names like B38400
, the
internal value for 38400 baud. On Linux, the internal value is 017, which (it’s octal!) means fifteen.
There is a function in the ncurses library that maps internal values (e.g. B38400) to actual baud rates (38400 in the case of B38400). On Linux, that maps internal value 15 to 38400. And then in the calculation of how long to delay, it uses 38400 bits-per-second to figure out how many padding characters to send. It doesn’t hit that code path, but it if did, it would be calculating with the right speed.
Back to FreeBSD
On the FreeBSD side, the internal values are the same as the baud rates,
e.g. B38400 is defined as 38400 rather than 017. This gives us an interesting
tour of ncurses internals and C types, because we have a table of speeds,
with entries that get hammered to NCURSES_OSPEED
type (short .. signed short).
struct speed {
int given_speed; /* values for 'ospeed' */
int actual_speed; /* the actual speed */
};
#define DATA(number) { (NCURSES_OSPEED)B##number, number }
static struct speed const speeds[] =
{
DATA(38400),
};
There’s even a comment in the source about older FreeBSD versions that had
old-style versions of the defines, and how those do fit into a short.
In any case, we end up with initializing an element of the speeds
array with B38400, cast to short, then promoted to integer.
Because 38400 overflows the short, it’s a negative value, which is promoted
to a negative integer: -27136.
When looking up internal values – 15 on Linux, 38400 on FreeBSD –
ncurses goes through that table, matching the given speed
against the speed value obtained from the terminal (as an internal value, in a short, passed as an int).
This happens in the function _nc_baudrate()
.
So we’re going to be comparing -27136 with values in the table, right?
Not so fast: the authors know about overflow,
and so there’s specific code in the library:
if (OSpeed < 0)
OSpeed = (unsigned short) OSpeed;
So here we take the passed-in value (-27136), mash it to an unsigned short, so in 16 unsigned bits that comes out
to 38400, then promote that to int again and assign it back to the variable Ospeed
. So we are now looking
for a speed value of 38400.
The table contains -27136, and when the comparison is done, it is done with (signed) integers, not shorts, and since those two values differ by 65536 – I am assuming that ints are longer than shorts and more than 16 bits – all the comparisons fail. For any internal value that exceeds 32767, the table-lookup will always fail.
In PR 256731 there is an commit, from an upstream commit, that avoids the NULL-pointer and the crash. Issues with terminal speed remain, so don’t wait around for Jenny.
Bug, mostly-squashed by an update to FreeBSD.
Takeaway
- Search in the bug database before starting a days-long-debugging-marathon,
- ncurses and libtextstyle have some neat features that might make sense to use also in other text-based applications of mine.