Is it a Lexer’s Job to Parse Numbers and Strings?

Question

The simple answer is “Yes”.

In the abstract, you don’t need lexers at all. You could simply write a grammer that used individual characters as tokens (and in fact that’s exactly what SGLR parsers do, but that’s a story for another day).

You need lexers because parsers built using characters as primitive elements aren’t as efficient as parsers that break the input stream into “tokens”, where tokens are the primitive elements of the language you are parsing (whitespace, keywords, identifiers, numbers, operators, strings, comments, …). [If you don’t care about efficiency you can skip the rest of this answer and go read about SGLR parsers].

Good lexers typically take sets of regular expressions representing the language elements, and compile them into an efficient finite state machine that can segment the input stream into such language elements quickly. (If you don’t want to user a lexer generator, for simple languages you can code the FSA yourself). Such compiled FSAs execute only a few tens of machine instructions per input character (get character from input buffer, switch on character to new state, decide if token is complete, if not do it again), and can thus be extremely fast.

The output of such lexers is typically a code representing the langauge element (or nothing for whitespace if the parser would ignore it anyway) and some position information (starts in file foo, line 17 column 3) to enable error reporting.

One can stop there and have useful lexers. It is often useful to do a conversion step, that converts the character string into the equivalent native machine value for that token, either as the characters are collected, or when the token is complete, because one still has knowledge of the specific characters involved in the token. This is used to convert numbers (of varying radixes) in the target language to their native binary equivalent, to convert literal strings containing escape sequences into the actual characters making up the string, and even taking identifier names and looking them up in a hash table so that identical identifiers are easily determined. The parser typically isn’t interested in these converted values, but steps beyond parsing (semantic analysis, checking for optimizations, code generation) needs the converted values anyway, so you might as well convert them as you discover them. (You could delay this conversion until their binary value was needed, but in practice you almost always need the value so delaying the conversion doesn’t buy very much).

Leave a Comment Cancel reply