Recalculation algorithm

Overview

The current recalculation algorithm consists of marking cells and their dependents dirty based on volatility and written cells, and then iteratively enqueueing dirty cells onto a priority queue ordered by their static dependencies, calculating each cell’s value in order off the front of the queue, correcting the queue order as dynamic dependencies are discovered by cell evaluation, and detecting and handling circular dependencies.

This algorithm guarantees that:

a cell does not get recalculated unless its formula is (a) volatile, (b) depends in some way on a volatile cell, or (c) depends in some way on a cell whose value changed since the last recalculation
all cell values potentially impacted by (a) cell value changes since the last recalculation, and (b) volatile cell values, will get recalculated, taking the up-to-date values of their dependencies correctly into account
no cell value is updated more than once (but formulas containing dynamic dependencies may be partially evaluated more than once)
the order in which cell values are updated is a topological ordering of all cells by all cell-to-cell dependencies, static and dynamic

(That last item is probably necessarily implied by the previous two.)

Definitions

Static dependency: a cell A depends statically on cell B if A has a formula which references B directly, by defined name or A1 address

Dynamic dependency: a cell A depends dynamically on cell B if A has a formula which does not reference B directly (or does so only in the first parameter of an OFFSET function call) but which contains a call to a reference-valued function (OFFSET or INDIRECT) which at evaluation time resolves to the address of cell B or a range containing it.

Indirect dependency: a cell A depends indirectly on cell B if there exists a path of (direct, i.e. cell-to-cell) dependencies from A to B.

Indirect static dependency: an indirect dependency in which every dependency in the path is static.

Circular dependency or Dependency cycle: a cell A depends circularly on itself if there exists a path consisting of static dependencies from it that leads back to it.

Static circular dependency: a circular dependency in which every dependency along the path is static

Dynamic circular dependency: a circular dependency in which one or more dependencies along the path are dynamic

Written cell or Pending recalculation root: a cell which has been written (had its value changed) since the last recalculation.

Algorithm design and implementation

In each cell we track the cell recalculation state, which is one of UPTODATE, DIRTY, ENQUEUED and ENQUEUEING. On initialization, and after each recalculation, all cells are in state UPTODATE.

Implement recalculation with two phases: the mark phase and the evaluation phase.

Mark phase

Set state to DIRTY on each cell that either

has a direct or indirect static dependency on a written cell
or is volatile, or depends directly or indirectly on a volatile cell

Notes:

the “direct or indirect” here means the DIRTY state is propagated along static dependency relationships to every cell that has an indirect static dependency on a volatile cell or a written cell. This is done by recursing on the “incoming” graph edges of each cell, to the direct dependents of that cell.
the DIRTY state is not propagated along dynamic dependency relationships, because they are not known until evaluation time. Instead, any cell whose formula calls a function that can yield a dynamic dependency (OFFSET or INDIRECT) is always marked volatile, and thus is always “implicitly dirty” and will always be enqueued and evaluated, such that its dynamic dependencies get discovered in the evaluation phase, see below.
written cells are never themselves marked DIRTY, and if they have a formula, it is disabled; thus they are not volatile even if their formula has a volatile function call. Thus, dependency relationships will never cause a written cell to be overwritten by recalculation, as the cell will never be enqueued for evaluation (see below).

Evaluation phase

Maintain a queue, which is initially populated by invoking the enqueueCellPrecededByDirtyDependencies operation (see below) on:

(a) each direct dependent of a written cell.

(b) each volatile cell.

Process the queue in order: for each cell in the queue, evaluate its formula, set its value to the result, mark it UPTODATE, and then invoke the enqueueCellPrecededByDirtyDependencies operation (see below) on any DIRTY cells that statically depend directly on it.

This constitutes a topological sort of the static dependency graph: a breadth-first traversal of static dependent relationships, at each layer propagating recalculation to all direct static dependents of any cells already updated.

The enqueueCellPrecededByDirtyDependencies operation on a cell C consists of:

putting C in state ENQUEUEING,
invoking enqueueCellPrecededByDirtyDependencies recursively on each of C’s DIRTY direct dependencies
if C is still in state ENQUEUEING, put it in state ENQUEUED and append it to the back of the queue.

The state ENQUEUEING is only ever used during this operation on a particular cell — so if the recursion encounters a cell that is already in state ENQUEUEING, then a static circular reference has been detected. It is treated the way Excel does: leave the cell’s existing value if it has one, else set it to 0, and either way mark the cell UPTODATE. (See more about circular dependency handling in a subsection below.)

The evaluation order resulting from the above is a breadth-first traversal of the static dependency graph. Only the static dependencies determine the order in which cells are placed onto the queue, (a) initially, and (b) at the end of processing each cell off the front of the queue.

That leaves dynamic dependencies, which are handled as follows.

A dynamic dependency is discovered upon evaluating a formula containing a “dynamic-reference” function: one that produces a cell/range reference based on its parameters. Those parameters themselves can be the results of formula evaluation, so the resulting dependency (edge in the dependency graph) cannot be known until evaluation time. These functions are OFFSET and INDIRECT.

These functions output a Reference object with the dynamic property set to true. When such a reference is about to be passed as an argument to another function, the cells it references are checked and if any of them is not UPTODATE, then the cell evaluation is aborted by throwing EvaluationOrderException. Control thus passes back up the stack to the evaluation queue loop, with information about the discovered dynamic dependency.

This dynamic-dependency check is not performed when the dynamic reference is to be passed as:

the first argument to an OFFSET function, as then it’s not the reference that will eventually form a dynamic dependency for this cell. The output of that OFFSET function call is (unless that in turn is also passed as the first argument to an OFFSET function call)
the THEN argument of an IF call whose condition argument evaluated as false
the ELSE argument of an IF call whose condition argument evaluated as true

An EvaluationOrderException is handled in the evaluation queue loop by injecting the discovered dependency and all its DIRTY static dependencies at the front of the queue, as follows:

store the existing queue contents to the side, and empty out the queue
enqueue each cell of the dynamic dependency range using the enqueueCellPrecededByDirtyDependencies operation
append the aborted cell onto the end of the queue (so its evaluation will be retried after its discovered dynamic dependency has been evaluated)
append onto the end of the queue all cells that were previously in the queue and did not get re-enqueued ahead of the current cell by the above.

Finally, this dynamic-dependency handling keeps a record, dynamicDependenciesAlreadySeen, of any dynamic dependencies already encountered during this recalculation. If any of them are encountered again (with the same from-cell and same to-cell), then a dynamic circular dependency has been discovered. (It wasn’t discovered earlier, in the initial recursive enqueueing of DIRTY cells, because at least one of the dependencies in the chain was dynamic.) It is handled, like static circular dependencies, by keeping the cell value if there is one, else assigning the value zero, and marking the cell UPTODATE and carrying on.

When the queue is empty, the recalculation is complete.

The end result is that the order in which cell values have been evaluated is a topological ordering of the full dependency graph including dynamic dependencies (filtered to the set of cells impacted by volatile and written cells), made gradually correct by the above reordering in response to discovery of dynamic dependencies.

Notes on circular dependency handling

The “circular dependency handling”, for a given cell on which a circular dependency is detected, is:

set the cell value to 0 if it does not already have a value
set the cell state to UPTODATE
add a note of a circular dependency at this cell, at NOTICE level, to the workbook details (visible to authors).

Excel and Google Sheets support, and the Office Open XML spec describes, an alternative mode of dealing with circular dependencies, called “Iterative Calculation”. We do not currently support that, and do not detect whether a workbook has that enabled. We unconditionally apply the above, default, mode of circular dependency handling.

Any circular dependency, static or dynamic, is shared among all the cells in the circular path of dependency relationships: no one cell is more “the circular dependency cell” than the others.

But the circular dependency is encountered at exactly one cell in that circular path, and only that cell gets the “circular dependency handling” applied to it. Which cell that is depends pretty much arbitrarily on the path along which updates propagate into the circular dependency path, and thus can depend, even for identical workbooks, on:

which cells have been written (and in which order, if more than one was written since the last recalculation, such as when applying initial state on load, or UI state on workbook refresh)
the order in which volatile cells get evaluated.

This arbitrariness may lead to different results between GRID, Excel and Google Sheets. Such inconsistency is hard to eliminate (or impossible, if Excel and Google Sheets themselves differ on this; we haven’t checked whether they do), and it is probably acceptable for almost all use cases, under the assumption that users rarely rely on this happening at a specific cell in the cycle.

Finally, we do not currently reproduce one behavior of Excel, which is to expose the full path of circular dependencies. Excel presents this visually, with arrows pointing between the cells, if the path is within one worksheet. We don’t try to collect the full path and thus do not show it in any way. Google Sheets does not appear to support this either.

Shortlist of optimizations to do

Reuse dynamic dependency information

ENGINE-74

At the end of recalculation, use the dynamicDependenciesAlreadySeen information to populate a model attribute previousDynamicDependencies, keyed by cell ID. The value for each key should be the set of cells (or cell IDs) that were discovered to be dynamic dependencies of that cell, in that recalculation. (And probably omit those that yielded a circular dependency.)

In enqueueCellPrecededByDirtyDependencies for a cell C, in addition to any static dependencies of the cell, also refer to previousDynamicDependencies to find any cells that were discovered to be dynamic dependencies of C in the previous recalculation.

For each such cell D that is now DIRTY and not already ENQUEUED, recurse into D --- but with this twist: if a circular dependency is detected, then don’t handle it (don’t set value to 0 and don’t set state to UPTODATE). Instead just abort the recursion on D, ignore D and keep going. This is because enqueueing the dependency is a heuristic guess: we don’t actually know that the dynamic dependency applies this time around, … and if it doesn’t, then triggering circular dependency handling would be incorrect. (And if it does, then it’s fine to let that be handled with an EvaluationOrderException, as usual.)

This proactively orders the evaluation queue so to minimize EvaluationOrderException throwing (and thus minimize throwaway effort on partial evaluations that get aborted, and on exception handling stack labor) in the case where dynamic dependencies are the same as last time — which is probably most common.

This will particularly help for workbooks with a lot of OFFSET calls, and consequently a lot of dynamic dependencies.

The case in which this might hurt performance is where the reused (and putatively recurring) dynamic dependency will lead to a circular dependency — so we end up doing extra graph traversal and then just aborting it. I expect this is case is neither very common nor very impactful (but that’s just a hunch).

Propagate updates only along paths that lead to cells that get used

ENGINE-68

Collect the output cells, i.e. the set of cells that belong to either

the target range of an output element in the GRID document
or, in edit mode, the active sheet, or even just the scrolled-into-view subrange of the active sheet, in the workbook area

Determine a set of cells whose value will not have any effect on any of these cells (i.e. whose set of direct and indirect dependents has no overlap with the output cells) for the user. And then simply don’t enqueue any of those cells.

This will particularly help for large workbooks that haven’t been pruned down to what the GRID model needs — a common case where people “try GRID out” with a workbook they already have. The workbook will still be slow to load and parse formulas and construct the static dependency graph, but recalculation can be significantly quicker if much of the model does not need to be calculated at all.

In edit mode we will need to recalculate again when the user switches the workbook area to another sheet (or scrolls in it).

We might even be able to optimize the recalculation that occurs when the set of output cells changes (at least when the edit mode workbook area is scrolled or flipped to a different sheet — maybe it doesn’t make sense when adding or retargeting output elements) by:

starting the recalculation at just those cells that we previously didn’t enqueue because they wouldn’t be needed
not re-evaluating volatile cells

… i.e. this could amount to just lazily evaluating cells whose value we didn’t need before (but which will evaluate exactly as they would have if we had evaluated them before, because none of them are volatile).

So this “lazy-evaluation” recalculation may be, most of the time, a pretty lightweight operation.

Two possible complications:

workbook errors may be encountered during the lazy evaluation, which may surprise the user when scrolling the workbook area. Probably not a big deal, just means that this lazy-evaluation optimization probably cannot be completely hidden from the user.
this lazy-evaluation approach may not interact well with the ever-increasing integer cell state approach to “Propagate updates to dependents only if cell value changed” described above. At least care would need to be taken to make sure these work well together.

Store range nodes in dependency graph in an R-tree

ENGINE-154

Currently range nodes in the dependency graph are stored in a list which is simply sequentially scanned to look for range nodes containing a given cell address, when looking for dependents or dependees of that cell.

This means we need to store only rather large ranges as range nodes, in order to keep this list small so that the O(N) sequence of range lookups doesn’t get too expensive. And that means we still materialize a bunch of range dependencies into all their single-cell dependencies, so the dependency graph is bigger than it needs to be and takes a longer time to construct at init time.

We can address this by collecting the ranges into an R-tree, for very efficient lookups. This ought to significantly bring down the optimal threshold of range size for whether to treat dependencies per-cell or per-range, letting us take less memory for large dependency ranges and initialize more quickly — particularly later on when we get to view mode optimization so that the R-tree construction has already been performed ahead of time, so it doesn’t even have an init-time cost.