Basic, general lexer for a programming language
I've been working on a general lexer for a programming language for a couple days now. I don't know if the code I've written is overcomplicated, if there is a better way to parse the code into tokens or if my code could be made easier with a library.
Code:
In several places you have code similar to:
You are creating a list whose elements will be the return values from calling the print function. And what are you doing with this list? Nothing, of course. Then why create the list? This style of coding is non-Pythonic (not to mention inefficient). Just do:
Your code has several print calls that appear to be for debugging purposes. For example:
If so, it would be preferable to use the logging API. For example:
You have:
There is a huge difference between a lexical analyzer, which generally recognizes tokens that can be defined by using a regular language (or equivalenetly, a regular expression) and the parser that uses those token to parse those tokens into, for example, an abstract syntax tree based on a grammar that, in general, is not regular. To make a long story short, a better name for a Lexer instance would be lexer. Likewise, I suggest renaming method parse to lex (something other than parse).
You have:
We don't really need to create a list for use with join:
Note also I have renamed variable i to the more meaningful token (and parser to lexer.
Are we just generating tokens and never using them in a parser? Probably not. You generally start out defining a grammar for your language that will be parsed by your parser. Let's say that our parser will be based on LALR(1) grammars. An LALR(1) parser is driven by a two dimensional table indexed by "the current state number" and "the current lexical token", both of which are integers. Your definition of a Token is (in part):
Where you claim that the type attribute is an int. But when you instantiate a token, you code is:
But TokenTypes is a subclass of enum.Enum:
Consequently TokenType.NUMBER is not an int. This is not a big deal; it's just something I happened to catch. Anyway, you can derive an int value with the expression TokenType.NUMBER.value, which is what is important and so this definition of TokenType is perfectly fine. But one normally starts with a grammar, from which the tokens can be derived. The person coding the grammar and the person coding the lexer may not be the same person. So one of the outputs from running a parser generator is usually a module that can be imported defining the tokens. Typically, it might be a file looking like:
So in real life your token definitions would probably be within a module you import.
The parser generally does not request all of the tokens at once but rather calls a method such as lexer.next_token() to get the next token. It would be nice if you provided this. This could be implemented simply as:
You should also define an additional token, which we might call EOF, representing the end of input:
So our next_token method will now be:
The purpose for this is to make sure there is no extraneous tokens following a stream of tokens that can be parsed into a sentence of the grammar. For example, if we had the grammar:
Valid input might be: "1 + 3" but not "1 + 3 + 4". Just because we were able to reduce the token "1", "+" and "3" to an expression, does not mean the input was valid. So we normally use an "augmented" grammar:
We now have a new goal symbol, expression_prime, which will reject "1 + 3 + 4 EOF" but accept "1 + 3 EOF".
What if in doing your lexical analysis you come upon an ill-formed toke (e.g. a string that is not properly terminated)? The lexer and parser should have an additional token type called ERROR, which the lexer returns when it finds an illegal token. The grammar will generally include special productions to handle errors discovered by the lexer or parser that will include the ERROR token.
I honestly did not spend too much time trying to fully understand the code; it just seemed a too complicated. I can't speak for others, but I would have a difficult time maintaining it. Here is an example of how I might code a lexer for a desk calculator that allows arithmetic expressions and assignments:
File tokens.py
File lexer.py
Prints:
Overall, the code looks good. The first suggestion I have is ...
Start off by adding documentation to the top of the code
to summarize its purpose. The most important things to
describe are:
The users of your code, and especially the maintainers of your code,
will appreciate a clear high-level description of what it does.
Six months after you started writing the code, you will even thank yourself
for putting in the effort now.
The PEP 8 style guide recommends
using docstrings:
The PEP-8 guide also recommends docstrings for classes and functions.
Lexer is a fine name, but it could use some elaboration. For example:
For the _main_parse_loop function, it would be helpful to
document that it is printing output:
To reduce clutter, delete all commented-out code and notes to yourself about
future features:
You can keep track of future enhancements in a separate file in your
version control system.
The style guide recommends separating multi-word constant names with an
underscore. For example:
become:
This makes the code easier to read.
I recommend adding a comment for abbreviations like:
It would be nice to align these lines to the = operator:
Also, it would be nice to add some blank lines between sections of code:
The range function:
is simpler without the optional 0:
Long string like this are error-prone:
The string.ascii_letters
function provides this for you.
string.digits also gives you 1234567890.
Thank you for sharing this lexer.
There's a lot of good code in it.
The biggest thing missing from it is a
Bakus-Naur
grammar, as the _main_parse_loop is definitely not
self-documenting.
The first thing any user of your language will want
to know is, "what can I write?"
That is, what is the set of valid source code inputs?
Decades of experience shows that describing this through
code, or through English prose, will not adequately serve
this need.
So you will need a grammar (or perhaps the equivalent
railroad
diagram) to show to the user.
You could attempt to maintain an evolving grammar alongside
an evolving parse loop, but that sounds very error prone.
Much better to have code consume the grammar and follow
its instructions.
Consider adopting
Lark,
or perhaps one of the
PEG
parsers.
Those first two lines are not good.
A C programmer might write --regular,
but this is python code.
Much better to express that intent with ... -= 1,
or simply assign ... = False.
The OP code requires a maintenance engineer to
go hunting through the code for these surprising
definitions, which have non-traditional semantics:
Nesting ModeBool within Modes seems slightly clumsy,
as revealed by the ... and i != "ModeBool" conjunct.
Prefer to organize these in a separate mode.py module,
with both up at module level.
I thank you kindly for optional type annotations -- they
help me to better understand your intent.
But it's clear you've not asked the machine to read
those annotations.
Always use a type checker if you go to the trouble
of writing them, to verify they make sense.
It's an aid to the author, allowing you to avoid silly errors
before even running any unit tests.
If you've written an >>> expression comment in a docstring,
definitely verify with $ python -m doctest *.py.
If you've written annotations, verify them with
pyright or
mypy --strict.
On which topic: adding docstrings to some of the
methods and all of the classes would benefit this project.
I would expect a type checker to infer the type from a
simple parser = Lexer(code) assignment.
This looks like a # XXX restrict to just string TODO comment,
the sort of thing we try to clean up prior to submitting
a Pull Request code review.
Write it as just Any.
Change it when your codebase expects only string input.
Consider making Token a
@dataclass,
to slightly simplify the constructor.
The while True: ... print ... approach to serving up
parsed results won't scale beyond your initial PoC testing,
and the .tokens result list is very batch oriented,
not well suited to online immediate error reporting.
The caller should be requesting tokens from an iterable:
The parse loop will yield each token.
That sets you up for raising a diagnostic exception
at the point that a parse fail occurs.
There is a great deal of print() debugging output,
which can be fine during initial development.
But do yourself a favor, and
import logging.
Then you can hang onto INFO level debug statements
in the codebase, while typically running at WARN level.
I mostly agree with the existing answers, let's hope I don't add too many duplicated notes.
You probably wanted to define __invert__ instead. Fortunately you never use bitwise inversion later, so this went unnoticed. But this shouldn't matter as...
Using +x, -x and ~x for inplace modification is a bad idea, see @J_H answer. However, let's venture one step further: your modes are all mutually exclusive, that is, one and only one mode is active at any time (counting -x; +y as a single "switch mode to" operation). Well, then it isn't Modes, it's just a class Mode(Enum), and a lexer should keep a reference to the current mode. This prevents accidentally having multiple modes turned on (or none). Spell "switch to mode X" as self.mode = Mode.X.
If that was a "future-proof solution", then explain that. You should have pretty solid understanding (at least similar to "here are two modes I'll introduce next month - X and Y - and they must be able to coexist"). Even then it might be better to add more such mutually exclusive enums, but depends on the usecase. Don't add premature complexity - YAGNI.
Using a Enum will also provide a built-in solution similar to your __repr__ (print(Mode._member_map_)).
self.currdata uses a dictionary (a data structure mapping arbitrary hashable keys to arbitrary values) to represent a storage for a few fixed attributes. That's a problem: once you make a typo somewhere, it'll be difficult to spot.
I only see two such attributes (notfirst and str). They can just become lexer's own properties. If you need more, you can always extract a helper @dataclass to store such state in a structured fashion. That might be even more desirable to replace self.currdata = {} with something like self.state.clear().
I'm pleasantly surprised to see this code almost passing ruff's auto-formatter modulo quotes and line length in one place, were you using something like blue with a wider (120?) linelength?
Consider also running flake8 with plugins or ruff on your code to catch some common problems. That would have prevented some (or even all) of my style comments at the end of this answer.
Finally, there's no way to ensure that your code does what intended. Uhm. Its output takes more than 5 screens, I'm physically unable to check that in any way other than freezing the output, making changes and diffing the files. If any diff is found, I'm left trying to figure out what could have caused it. Instead, prefer a testing framework (built-in unittest or most popular pytest) and smaller tests, exercising all corner cases of the intended behaviour. This will make everyone much more confident when refactoring - I won't provide a "here's how I would write this" snippet in this answer, because I can't verify its correctness.
Yes, your values are now unique, great. Consider asserting that with @enum.unique or dropping numeric values altogether and using enum.auto().
or
A lot was said about naming, but nothing about this bit. Constants STRCHARS and STRINGCHARS look like aliases, but in fact they have nothing in common. Please rename one to QUOTE_CHARS, for example.