Intermediate Form

POTW: Python2html

Previous Entry | Home | Next Entry

This post is the first of what I hope will be a new weekly feature, the program of the week. To give myself something to write about, I'm going to be posting a program to this blog once a week, and writing about it. The idea here is to get me writing more, and at the same time to get me to release some of the many programs and scripts I've written for my own purposes.

The programs I release are ones I use on a regular basis. Error checking, especially in cases where the arguments are invalid, may be non-existent, although I try to have the program just crash in such a case, rather than actually damaging a system. Still, these programs come with no warranty. If you find the programs useful, let me know.

When I write a program intending to post it to POTW, I will probably also write about why I made the choices I did. Some weeks, I may just post a program I have already written, in which case the post will be much shorter.

→ http://tom.idealog.info/potw/python2html.1.py
→ http://tom.idealog.info/potw/python2html.1.py.html

Python2Html is a script that converts python source code to html. The first link above is to the raw source code, while the second is to the output of python2html when run on itself. I find the latter far more pleasant to look at than the former, which is good because my motivation for writing python2html was to be able to place nice-looking code on this web site.

Usage

Because I use newer features like generators and optparse, python2html requires python 2.3 to run. In general, the python code I write needs the latest version of python to run. Newer versions of the python language and libraries are significantly better than older ones, and I don't see a good reason for me to write new code in a way that is ugly, just to be backwards-compatible.

usage: python2html.1.py 

For each python source file listed in , generates file.html
containing colorized html, anchors, and optional line numbers.


options:
  --version           show program's version number and exit
  -h, --help          show this help message and exit
  -n, --line-numbers  Prefix each line with its line number.

Running python2html is fairly easy, as one can see from the help above. Just give it a list of source files as command-line arguments, and for each file, it leaves the html version of that file in file.html. There's a single option, -n, which enables the generation of line numbers. I always use it when colorizing source code, but I decided to make it an option so python2html can be used to colorize fragments of source code, where line numbers may not make as much sense.

Motivation and Alternatives

As I mentioned above, python2html was written to let me post code to my blog in a nicely colored manner. I tend to try to avoid reinventing the wheel where possible so I first looked for other packages that I could use.

Py2html was apparently a program that could do this. I say was, because I couldn't find a web page where I could download it from. I also saw some examples of its output, and it wasn't what I wanted, so I passed. I also decided not to use enscript's html conversion for the same reason--- it wasn't all that close to what what I wanted, and it seemed more difficult to change it than it would be for me to write a new program.

I also spent a little bit of time looking for things that would make my life easier. The tokenize module in the python standard library seemed like it could be of benefit. Unfortunately, I found on experimentation that it tends to throw away much of the formatting information found in the actual source code. It seems to be designed for writing compilers more than colorizers, even though the documentation claims it is suitable for the latter task. After a few minutes of experimentation, I rejected it and decided to start from scratch.

Goals

After deciding to start from scratch, I had to decide exactly what it is that I wanted. I started off with the following ideas and goals:

Lexical analysis is enough. The first thing I decided was that for the types of syntax highlighting I want to do, lexical analysis is powerful enough. (Here, I use lexical analysis to mean I match the first thing, the match the second without regard to the first one, and so on.) Originally, I wanted to apply colors to strings, comments, and keywords. I figured I could do this using regular expressions, similar to the way in which a lexer works, and avoid having to keep much in the way of state. I definitely wanted to avoid a complex state machine, or having to write up a grammar using a parser generator.

Later, I wanted to add highlighting of function and class definitions, so a straight-up lexer was actually not powerful enough. I had to cheat a bit and keep some state, but I think the final code is much simpler that it would have been had I used a parsing approach.

I want the output to stand alone. Early on, I decided not to use a stylesheet, because I want to be able to copy and paste the output into a web page, and have it work. Later, I decided that I wanted to be able to copy and paste individual lines of code as units, and not have to worry about opening and closing the tags that are needed to make it work.

I want line numbers. I like the line numbers, since it means that I can tell people what to look at. Line numbers were the last thing I thought of, though, after I was mostly done with the coding.

It's interesting that the last two goals are somewhat in conflict with the first. Using a lexing approach, along with some of the large python constructs like multi-line strings, implies that tokens can span multiple lines. But the other two goals, especially the last, require that lines be handled individually. I didn't recognize this tension until I was most of the way through the coding process.

Coding

When I started coding, only the first goal and the first half of the second one were firm. My initial thought was that I would have a list of (regular expression, function) pairs. A main loop would try each regular expression at the start of the string, find the first one that matches, and would then call the corresponding function with information about the match. That function would return formatted html, which would then be printed. We would then try to do this again at the end of the first match, at the end of the second, and so on until the entire source string was matched.

The general concept here was sound. patterns is indeed a list of such pairs (actually, it starts out as a list of (string, function) pairs, but is then immediately compiled into (regex, function) pairs). Likewise, tokenize does perform the matching and the function calling.

Having the various tok_ functions return html seemed problematic, however. For source code to become html, it needs to have some quoting done to it. As many people know from blogging, the character "<" needs to be written as "&lt;", and "&" needs to be written as "&amp;". I have one function per token type, for each of the 6 token types the program handles. One of these functions can return three different things, which would mean duplicating the quoting code 8 times. Possible, but ugly.

By moving the html out of the tok_ functions the following code:

   58 def tok_comment(m, state):
   59     """Returns a formatted comment."""
   60     return '<span style="color: #800000;">%s</span>' \
                 % quote(m.group(0))

is simplified to:

   58 def tok_comment(m, state):
   59     """Returns a formatted comment."""
   60     return [ ('comment', m.group(0)) ]

Another problem would have been encountered later. For the second half of the second goal to be implemented, the style needs to be applied to each line separately. This is tough to do if the html is applied before this happens.

I decided to change the way the tok_ functions work. Instead of returning HTML, I had them return lists of (style, text) pairs. The style and quoting can then occur in a single place, rather than being scattered throughout the program.

At this point, python2html is able to turn a program into a series of tokens, each with an associated style. Some tokens may be longer than one line, most are shorter, with a few being exactly one line in length. The style needs to be applied to the token text on a per-line basis, with line numbers inserted after newlines.

There were three ways this could be done. The first, and the one I found least attractive, was to place the code to do it into tokenize. I found that option the least attractive. tokenize is already complicated, and adding the code there would just make it more so.

I strongly considered putting this functionality into an object. If I wasn't programming in python, this would probably be the best choice, although it would be somewhat annoying to have to create a new class for something this simple. It also requires the user to think about what state needs to be saved between method invocations. Not hard, but not ideal either.

Thankfully, I am programming in python, and python has something called generators. A generator is a function that can be iterated over like a list can. It runs until a yield statement is reached (like the one on line 255, of tokenize), then saves its state until it is asked to yield another value. This lets us place the code that handles styling into another function, format, repeatedly gets a token (in the loop on line 261), splits it into newlines and token bodies, styles the bodies, and outputs anchors and line numbers at the start of each line.

The result is a program that's actually fairly clean, with each function doing one thing well. I did wind up adding in a state object which is used in tok_name to highlight function and class names, when they are defined.

The Best and Worst Parts

My favorite part of this program is the patterns data structure. It glues together a slew of functions in a way that allows all of them to be clean. It would be hard to do this in a language that can't treat regular expressions as first class values.

I also like the way that generators provide for a nice way of having an interface between the portions of the program that work on tokens and the part that works on lines.

The worst part was probably the regular expression used to match strings, which is built up for each kind of string token in stringpat. It's a mess that tries to match strings that do not have an odd number of backslashes before the end delimiter. I had to write it like that because a saner regexp caused python's regular expression engine to die with a segfault after an insane level of recursion. (I understand this will be fixed in python 2.4)

Overall, I like the way in which python (and other scripting languages, but especially python) lets me incrementally write the program. It's very easy to make changes, even large changes like turning a function into a generator. I think this is because when a scripting language makes the length of a program shorter, it also makes the amount of work that needs to be done to change that program correspondingly lest, allowing an incremental style of development that is harder in Java or C++.

Sometime next week, expect a (hopefully) shorter post about another program of the week. It probably won't be as long as the program was. :-)

- Tom | permalink | changelog | Last updated: 2003-12-05 01:04

Previous Entry | Home | Next Entry

Comments

Posted on Tuesday, December 09, 2003 by Jareeedo:

Hey Tom, Just letting you know that I enjoyed this post. Take care, -Jared

Posted on Sunday, January 09, 2005 by Thomas SMETS:

Cool ! ! ! I finally made a set of scripts to "publish" the code I write from home, so I can reuse it ... Of course python is now one of my languages ...

Just one detail... making you code 2.3 only is fine but this means it is not usable from Jython (still 2.2 compliance) and Jython is in fact my scripting language to do E2E testing of my Java code.

In the list of nice to have :

_ Specify an output directory

_ Search directory recursively

Just 'coz I am lazy ... !

Commenting has been suspended due to spam.