lexer – Tarik Billa

What does an escaped ampersand mean in Haskell?

September 22, 2023 by Tarik

It escapes… no character. It is useful to “break” some escape sequences. For instance we might want to express “\12” ++ “3” as a single string literal. If we try the obvious approach, we get “\123” ==> “{” We can however use “\12\&3” for the intended result. Also, “\SOH” and “\SO” are both valid single … Read more

Poor man’s “lexer” for C#

September 22, 2023 by Tarik

The original version I posted here as an answer had a problem in that it only worked while there was more than one “Regex” that matched the current expression. That is, as soon as only one Regex matched, it would return a token – whereas most people want the Regex to be “greedy”. This was … Read more

When parsing Javascript, what determines the meaning of a slash?

September 22, 2023 by Tarik

It’s actually fairly easy, but it requires making your lexer a little smarter than usual. The division operator must follow an expression, and a regular expression literal can’t follow an expression, so in all other cases you can safely assume you’re looking at a regular expression literal. You already have to identify Punctuators as multiple-character … Read more

Lexer written in Javascript? [closed]

September 9, 2023 by Tarik

Something like http://jscc.phorward-software.com/, maybe? JS/CC is the first available parser development system for JavaScript and ECMAScript-derivates. It has been developed, both, with the intention of building a productive compiler development system and with the intention of creating an easy-to-use academic environment for people interested in how parse table generation is done general in bottom-up parsing. … Read more

Is it a Lexer’s Job to Parse Numbers and Strings?

September 3, 2023 by Tarik

The simple answer is “Yes”. In the abstract, you don’t need lexers at all. You could simply write a grammer that used individual characters as tokens (and in fact that’s exactly what SGLR parsers do, but that’s a story for another day). You need lexers because parsers built using characters as primitive elements aren’t as … Read more

Communication between lexer and parser

August 25, 2023 by Tarik

While I wouldn’t classify much of the above as incorrect, I do believe several items are misleading. Lexing an entire input before running a parser has many advantages over other options. Implementations vary, but in general the memory required for this operation is not a problem, especially when you consider the type of information that … Read more

Should I use a lexer when using a parser combinator library like Parsec?

May 17, 2023 by Tarik

The most important difference is that lexing will translate your input domain. A nice result of this is that You do not have to think about whitespace anymore. In a direct (non-lexing) parser, you have to sprinkle space parsers in all places where whitespace is allowed to be, which is easy to forget and it … Read more

Where can I learn the basics of writing a lexer?

January 19, 2023 by Tarik

Basically there are two main approaches to writing a lexer: Creating a hand-written one in which case I recommend this small tutorial. Using some lexer generator tools such as lex. In this case, I recommend reading the tutorials to the particular tool of choice. Also I would like to recommend the Kaleidoscope tutorial from the … Read more

Looking for a clear definition of what a “tokenizer”, “parser” and “lexers” are and how they are related to each other and used?

October 23, 2022 by Tarik

A tokenizer breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines). A lexer is basically a tokenizer, but it usually attaches extra context to the tokens — this token is a number, that token is a string literal, this other token is an equality operator. A parser takes … Read more

lexers vs parsers

September 25, 2022 by Tarik

What parsers and lexers have in common: They read symbols of some alphabet from their input. Hint: The alphabet doesn’t necessarily have to be of letters. But it has to be of symbols which are atomic for the language understood by parser/lexer. Symbols for the lexer: ASCII characters. Symbols for the parser: the particular tokens, … Read more