Towards fast PGN Parsing

PGN is the de-facto standard for chess games, especially when it comes to interoperability. Unfortunately, PGN is somewhat misdesigned. Apparently, it’s not just me who thinks so. PGN is designed to make it easy for humans to read PGN files, and edit or write them manually with a text editor. At a cost, namely that it’s difficult to parse it with computers.

The odd thing here is that PGN files are rarely created manually, almost everyone uses a chess program to enter or edit moves, and then save the game afterwards. That’s the basic mis-design.

The main difficulty in parsing PGN files – aside from a lot of ambiguities – is SAN notation, i.e. the source-field of a move is missing. For a human it’s easy to spot the source-field, but for a computer it means that the program has to know the chess-rules to figure out which move is executed (there are shortcuts, as described below).

Nevertheless, PGN is here to stay, and so my goal is to quickly parse huge PGN files in order to be able to search for arbitrary positions. Even though Jerry is not primarily database program, such basic functionality should be present.

Jerry 3.0.1120593
New Parser, compute only pseudo-legal moves28791
above plus -O3 optimization turned on24000
above plus pool allocation for nodes16000
above plus piece lists14016
above plus no pseudo-legal move generation if only once piece11842
above plus no QString uci as class member of Move7352
PGN parse speed (KingBaseLite2016-03-E60-E99.pgn)

When I originally implemented PGN parsing, I took a look at python-chess – really excellent, easy to read code. However not optimized. The algorithm was like this:

read each line in the PGN file
split the line into tokens (like a move token, annotations, comments etc.)
for each move token
    extract the target file and piece type (and row/line disambiguity markers)
    generate all legal moves in a position
    find the unique move among all legal moves of that piece type and
    target file

I benchmarked this by reading each game from a subset of the KingBaseLite database and constructing the full underlying game tree for each game. That’s the first line in the table above. Incredibly slow. Can we do faster?

Initially I tried to reuse an exising fast parser from other open-source projects, but realized after some time that this would essentially result in a complete rewrite of Jerry, due to very different underlying data-structures and game representations. Realizing cost me approx. three month, so this was quire annyoing. In the end, I decided to optimize the existing implementation.

So why is the above version slow? I split the line into tokens using QRegularExpression matches. After tokenizing a line, I again made heavy use of QRegularExpressions to get the target file. It’s likely faster to manually create a parser by checking character substrings in a line and then continue accordingly.

Another point is that we don’t need all legal moves. First note that for pawns, we can directly determine the source-field of a pawn move (e.g. for e5, the pawn has to be on e4, for e4 it has to be either on e3 or if e3 is empty then it’s on e2 etc. So we can concentrate on piece moves.

Suppose that we can already uniquely identify a piece move by just computing the pseudo-legal moves. Then either it’s that move, or the PGN is incorrect. In either way, it makes sense to just execute the move, it’s our best bet. The next version (line 2) was thus:

manually parse each line and check for move tokens
for each move token
    if it's a pawn move, directly create a move by considering the target 
    square of the pawn
    if it's a piece move, then create all pseudo-legal moves, taking into
    account the piece type and the target square.
    if we uniquely find one move among the pseudo-legal moves, execute it, 
    otherwise generate all legals moves. If we still can't identify the move,
    abort

An improvement of factor 4, but still: Can we do faster?

A simple speedup is to turn on compiler optimization (O3), cf. line 3 in the table. Ok, but can we do faster?

In the implementation of Jerry, a game is a tree-structure, where each node represents the current board position, the move leading to it, comments. Children of the node are the main line as well as other variations. When readin each game I had a loop:

while(hasNextGame()) { 
    Game *g = PgnReader.read();
    // do something with the game, e.g. search for a position
    delete g;
}

and the PgnReader would of course always create a new game with Game *g = new Game() and then add nodes to it while parsing. New is malloc in C++, and that’s costly. Thus I considered a custom allocator. There are fancy ways of doing, but since this is here limited to only nodes, a quick and dirty approach was to create a FreeList of nodes and initially create a decent number of nodes (thus reserving memory once at program startup). Then whenever a node is needed, it is taken from the FreeList. Instead of calling delete after reading the game, the nodes are put back on the FreeList and re-used. I call this pool-allication, even though it’s not a reall pool-allocator.

This again almost halfed the time required. But – can we do faster?

Jerry internally uses a mailbox-structure (i.e. a 120 array) to represent a board. (Pseudo)legal move generation for a particular piece-type is then as follows. Suppose we are looking for all white knight moves:

for each of the 64 squares (a subset of the 120 array):
    if the square is not empty, and the square contains a white knight: 
        create all (pseudo)legal moves for that knight

Another approach is to use piece-lists in a addition to this board array. This is then simplified to:

  • get the piece-list for white knights
  • for each entry in that list: create all legal-moves

The penalty here is then whenever a move is applied, in addition to updating the board-array we also have to update the piece-lists. However if we implement the piece lists carefully as arrays with fixed size (i.e. no new, and thus no malloc) and carefully update only the piece that moved when applying a move to the board, this gave another small improvement (line 5).

Another small optimization is that with piece-lists it’s easy to check if there is only a single piece (i.e. a single white knight) in the position. If so, then we can even skip pseudo-legal move generation. Either the PGN is correct and it’s this only white knight that is moving to the target square, or the PGN contains an illgal move and then it’s still better to just continue compared to just aborting to read the PGN altogether.

That’s the version in line 6. Ok, but can we do faster?

Well after a lot of performance analysis with valgrind/massif, it turns out there was a hidden malloc all the way along. I should have spotted this earlier, but this is tricky. Jerry relies heavily on Qt datastructures, and especially on QStrings, which simpify handling encodings (such as UTF8) a lot. My datastructure for representing a move was basically like this:

class Move
{

public:

    uint8_t from;
    uint8_t to;
    uint8_t promotion_piece;
    QString uci_string;
...
}

Move generation is then similar to this:

QVector<Move> pseudo_legals;
for ....
    pseudo_legals.append(Move(...))

But QString has implicit sharing, and that means it’s content is stored on the heap. So each Move creation, even if it’s seemingly on the stack, has a hidden malloc, when the uci_string is created as a class member!!!

Same thing with a QMap that I used to store a transposition table as part of the game-board class.

After removing the uci_string from the Move class (we can create uci_strings on the-fly when required, that’s no real slow-down as uci-strings are only required for engine-interaction), we arrive at the last version (line 7). And I am out of ideas.

There are other small optimizations that I applied along the version, e.g. using QString::fromLatin1( ), better code-flow etc., but all in all, I am out of ideas.

Let’s see, how we compare. The test is to take millionbase, and, depending on the functionality of each program/library, either just load each game from the PGN, or search for a particular position. The current implementation isn’t optimal, and likely will never beat Scid, but it’s sufficient to work with it in practice, I guess.

Scid0:50
Fritz 174:02
Jerry, optimized4:12
ChessX> 30:00
chesslib (Java)> 30:00
PGN parse speed (millionbase-2.22.pgn)

Still, lot’s of clean-up and edge-cases w.r.t. the PGN format to consider, but we’re getting there.

All in all, this profiling and optimzation took quite some time and ressources.

I recently read an excellent series of blog posts. This is a good entry point. Basically Raymond Chen (the „Mr. Windows“ Raymond Chen) created a small Chinese dictionary application in C++. Rico Mariani did a quick port to C#. It turned out that the quick port was significantly faster than the C++ version. Raymond Chen had to do some heavy optimizations (including writing a custom allocator), rewriting and profiling to get a faster version. A good overview is here.

This looked very similar to my problem, and so I often get envious when looking at managed code, i.e. languages like C# and Java. On the other hand, there there is either no decent cross-platform GUI toolkit (C#) or quite problematic deployment (JavaFX). This is why I added chesslib in the above comparision. This seems to be clean-looking straight-forward but non-optimized code. Just for comparision and to self-assure me that writing Jerry in C++ was the right decision.

Or was it? 😉

Towards fast PGN Parsing

Kommentar verfassen

Trage deine Daten unten ein oder klicke ein Icon um dich einzuloggen:

WordPress.com-Logo

Du kommentierst mit Deinem WordPress.com-Konto. Abmelden /  Ändern )

Google Foto

Du kommentierst mit Deinem Google-Konto. Abmelden /  Ändern )

Twitter-Bild

Du kommentierst mit Deinem Twitter-Konto. Abmelden /  Ändern )

Facebook-Foto

Du kommentierst mit Deinem Facebook-Konto. Abmelden /  Ändern )

Verbinde mit %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.