Adventures building a spreadsheet engine in Python

Spreadsheets are a fascinating tool. With a spreadsheet, you can both store and structure data, and include formulas that run computations on the contents of a sheet. Every so often, I ask myself how a spreadsheet works. How do spreadsheets turn data and a list of formulas into a complete table of data? This week, I set out to build a spreadsheet engine in Python.

In this blog post, I am going to share some of my experiences and notes from building the engine. My spreadsheet code is open source and available on GitHub.

Spreadsheets as graphs

When I set out to build a spreadsheet engine, I started to think about how I can efficiently represent the data in the document. How do I model the tables and rows? I realised that I should represent a spreadsheet as a graph (for which the most related data type in Python is a dictionary). I created an example data structure in the following form:

{
	"A1": 5,
	"A2": "=A3",
  	"A3": 10
}

In this data structure, the keys are the names of cells and the values are the values of those cells. Keys must be one or more letters followed by one or more numbers. If a value starts with an equals sign, it indicates that the cell contains a value that must be computed. If a value does not start with an equals sign, it means the value can be represented literally (i.e. as an integer, a floating point number, or a string).

Retrieving dependencies

Every value may depend on one or more values. Spreadsheets need to be able to resolve dependencies to figure out the value for each cell. For example, in the above “spreadsheet”, the value of A1 is 5. The value of A3 is 10. The value of A2 is A3. This means that A2 should be equal to 10, because A3 is equal to 10.

Every cell could either be:

A literal value, which will not be evaluated as a formula, or;
A formula, which will be turned into a tree and evaluated.

I wrote a grammar that would read a formula and turn it into a tree. A grammar is a set of formal rules that indicate the expected structure of a string. A grammar can be used by a parser to construct an Abstract Syntax Tree, which is structured according to the rules of the grammar.

To write the grammar, I used the Lark Python package, which I have used to write similar grammars in the past.

Before evaluating a tree, the grammar is used to process a string and find its dependents. This is essential because I need to know the dependents of a formula before I can evaluate it.

Using Lark, I wrote a Visitor that would visit all cells in a formula, which is what I called an individual reference (i.e. A1, B7):

class GetDependencies(Visitor):
    dependencies = set()

    def cell(self, tree):
        self.dependencies.add(tree.children[0].value)

To compute the dependencies for a whole spreadsheet, this Visitor is applied individually to every cell.

At this point, I knew the dependents for a cell – what needed to be computed in order to evaluate a cell. The next problem was to figure out in what order dependents should be evaluated. For example, if I have five cells that each depend on five other cells which depend on five other cells, how can I know in what order I should evaluate each cell to retrieve its response?

Calculating cell evaluation order

As I reasoned with my spreadsheet as a graph, I realised there was an algorithm that could help: the topological sort. I used this algorithm to calculate what pages depend on each other when building my static site generator.

A topological sort takes in a structure that connects an item to all of its dependents, then creates an ordered list. Every value in the list appears after all of its direct dependents. The result of a topological sort is the order in which cells need to be evaluated.

Here is an example of the representation of all cells and their dependents:

{
	"A1": [],
	"A2": ["A3"],
	"A3": []
}

After running a topological sort, we get the evaluation order:

A3 appears before A2 because A2 depends on A3.

The topological sort can process any spreadsheet and calculate its dependencies, unless there is a circular reference. This happens when two cells depend on each other. In this case, the spreadsheet will need to skip evaluating the cells with a circular reference and any cells that depend on them. In spreadsheet software like Google Sheets, there is an error like #REF! that tells you that there is a circular reference. Users need to reconcile these by removing a dependency that causes the loop.

Evaluating cells

The results of the topological sort dictate in what order cells must be evaluated. The next step was to write evaluation logic. For this, I used the Lark Transformer class. This class reads a tree bottom up, evaluating each cell according to functions defined in a Python class whose names match the tokens in a grammar.

To get started, I coded the ability to retrieve the results from another cell. This involved a bit of recursive logic, since a cell can depend on another cell that itself depends on another cell. When an individual cell is referenced, the following recursive function is called:

def recursively_get_cell_value(self, cell, depth=0):
    if depth > 100:
        raise Exception("Nested too deep.")

    if cell in self.cells.keys():
        return self.recursively_get_cell_value(
            str(self.cells[cell]).strip("="), depth + 1
        )

    return str(cell).strip("=")

def cell(self, args):
    return float(self.recursively_get_cell_value(args[0]))

self.cells contains all of the cells in the spreadsheet. The function recursively goes through cells until the final value is identified, then that value is propagated back up the function to be assigned as the value corresponding with a cell.

With the ability to evaluate cells, I could write a formula like:

And return the following result:

If A1 was equal to the value of A2, the recursive function would get the value of A2, then return it as the result for the formula.

Adding formulas

At this point, I had the ability to reference the contents of one cell in another cell. From here, I started to work on adding some formulas to the grammar. I started with four essential mathematical expressions: addition, subtraction, multiplication, and division. I updated the grammar to follow this structure:

Where value is either a cell reference (i.e. A1), a number, or another mathematical expression (i.e. ((A1 + 1) * 2). OPERATORis the function that is applied. To limit complexity, I constrained the first implementation to take two terms, so5 * 5 * 5` would not be valid.

First, I declared a list of Python functions that map to each operator:

operators = {
    "+": lambda x, y: x + y,
    "-": lambda x, y: x - y,
    "*": lambda x, y: x * y,
    "https://jamesg.blog/": lambda x, y: x / y,
    ">": lambda x, y: x > y,
    "<": lambda x, y: x < y,
}

Then, I defined a function that would accept the child nodes from a part of a formula (which is a full statement like (value OPERATOR value)) and apply the operator function to those nodes:

def part(self, args):
    if len(args) == 1:
        return args[0]

    left = args[0]
    operator = args[1]
    right = args[2]

    return operators[operator](left, right)

Consider the formula =A1 + 1. args would contain:

The number 5. This is the value of A1, which has already been evaluated since it is further down in the tree.
+, the operator.
The number 1.

This function could later be updated to support multiple terms. Work would need to be done to ensure that the correct mathematical evaluation order is followed (brackets before order, etc; BODMAS is the order acronym for this that I learned in school).

With the above formula logic, I could write:

And retrieve the following value:

Conclusion

Building a spreadsheet has been a fascinating exercise. The breakthrough that turned my field of view from I don’t know where to begin to I’m ready to build this was when I saw this as a graph problem.

My experience working on my static site generator helped me realise that if I reason with a spreadsheet as a graph, I can use topological sorts to calculate dependencies. Then, I would be on to the next challenges: building a grammar that decides what a valid formula is and how formulas work, and writing an evaluation engine for the grammar.

There are many intricacies of spreadsheets that I am encountering; every new formula to add brings new challenges through which to think. For example, I implemented a range query, which lets you set the value of mutiple cells to the value of other cells. For example, =A1:A3 would set the value of the cell with that formula to A1, then the cell below it to A2, then the cell below it to A3. To build this logic, I needed to write code that would find all cells in the range, then set those values.

As I write, I realise I have a lot more work to do on range queries. I want to add support for multi-column range queries (i.e. =A1:B2). This also complicates dependency computation. For example, if A3 depends on B2, and B2 is part of a range query, I need to know that before I start evaluating any cells. I can use the code I have already written to expand a range query into its cells (i.e. A1:A3 becomes ['A1', 'A2', 'A3']), except it will need to be applied during the dependency calculation as well as at runtime.

If you are interested in reading my code for this project, the code is available on GitHub. My intent is not to build a complete spreadsheet. Rather, I want to learn more about reasoning with trees and graphs!

Source link