Optimisation
Simulating and displaying massively-parallel hardware systems - such as cellular
automata - on conventional, serial VN-architecture computers can easily mop up
CPU cycles. This can result in performance difficulties rather rapidly.
This article looks at some methods that have been developed historically in
an attempt to cope with such problems.
It attempts to concentrates on issues that transcend programming languages - and
tries to avoid giving too much optimisation advice that is not specific to
cellular automata.
Some of the methods described are simple and obvious. Others require
significant programming skill to implement well.
Before we start:
Cautionary notes
Knuth
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
- Donald Knuth
I'm sure you've heard this before, but I have to say it again.
Optimisation is probably the last thing you should be doing. Get your program
to work properly first, and then profile it, and see if it needs further work.
The Pareto Principal
This is often known as the 80/20 rule. Some think that should be
called the 90/10 rule.
There are various statements of it that relate to computer programming:
- 80% of the bugs lie in 20% of the code;
- 80% of the budget is spent on 20% of the project;
- 80% of the code is written by 20% of the programmers;
...I'm sure you get the idea.
In the context of optimisation we have:
- 80% of the time is spent in 20% of the code;
- Moral: profile your code - and concentrate your efforts where they can do some good;
- 80% of the improvement comes in with the first 20% of the effort;
- Moral: give up early - optimisation can rapidly run into diminishing returns;
Maintenance
Optimisation is the sworn enemy of maintainable, comprehensiblle code. In some
cases these aims coincide - but this occurs more by chance than anything else.
If you think you (or anyone else) is going to want to maintain, modify or
understand your code in the future, don't optimise it too far.
Meddling
Optimisation can introduce bugs that were not present before. To a first
approximation, if your program works, leave it well alone.
Optimising as-you-go
The general rule of thumb is don't optimise as you go along.
You might think this will avoid a costly rewrite when you find the program does
not perform - but in practice optimisation can be applied more effectively and
efficiently to an existing working program.
It's harder to predict where the bottlenecks are going to be than it is to
use tools to measure what the code is doing once it has been written.
There is sometimes a place for optimising as you go - but it is usually
not a good idea.
Cost
Programmer time is expensive. Hardware is often relatively cheap - and Moore's
law suggests it will get cheaper as time passes. Make sure you don't make the
wrong tradeoff here.
Proverbs
- The best is the enemy of the good.
- Working toward perfection may prevent completion.
- Complete it first, then perfect it.
- The part that needs to be perfect is usually small.
OK - if you have digested that lot and still want to proceed, then
we'll begin.
Optimisations
- Naive implementation
- Buffer swap
- Bit packing
- Lookup table (LUT)
- Single buffer
- Greedy reading
- Gigantic lookup table (GLUT)
- Sparse matrix techniques
- Only redraw changed regions
- Update as you go along
- Don't redraw too fast
- Adaptive runtime optimisation
- Avoid unnecessary boundary checks
- Run directly in the display buffer
- Avoid simple boundary implementations
- Use low-level programming
- Take advantage of the hardware
- Naive implementation
- Buffer swap
- Most obviously, copying the contents of the new buffer over the old one is
an unnecessary step. All you really need to to is swap the buffers around by
exchanging references to their storage. This eliminates a time-consuming array
copy operation.
- Bit packing
- Lookup table (LUT)
- Generally a cell takes on a new value which is computed on the basis
of information from the cell's neighborhood. Consider packing the
information in the neighbourhood into an integer, and using that integer to
index a look-up table containing the new value of the target cell.
This way the computation corresponding to the update rule only needs to
be done once - at the start of the run - and the results can be cached
and used repeatedly while updataing the grid.
This approach can burn memory. Think about how big the LUT will be before you
embark on this route.
- Single buffer
- Typically using two buffers to manage your automaton is wasteful on memory.
This chews up your cache and slows down execution.
After you have processed cells they can generally be placed in areas which have
already been processed. This sort of technique typically halves the memory
requirements.
When you are processing cells, you generally can't put them back
exactlty where you found them - since they are part of the
neighbourhood of other cells. The trick is to shift the entire grid
sideways as you process it.
An example should help clarify things.
First take a copy of the cells on the upper left sides:
Edge cells need preserving
Then process cells and place the results back in the same array one cell up - and
one cell to the left:
|
-> |
|
Neighbourhood |
|
Result |
Repeat this on every grid cell, working downwards and to the right:
Raster downwards and to the right
Assuming periodic boundary conditions are employed you will need to deal with
the bottom and right edges separately - using the information you initially
stored about the top and left edges.
Note that now, the cells used in the domain are not ever ones that have
been overwritten with previous results.
The result is that the new grid contains the correct version of the next
generaion - but shifted upwards and sideways one cell. You will need to
compensate for that translation in your display routines.
- Greedy reading
- This is a generic term used to refer to preventing multiple reads from the
same neighbourhood during the update cycle - by caching previous results.
Ideally you munch on as large an area at once as you can easily fit
into fast local storage.
You should usually attempt to maximise the number of cells you deal with at once
while minimizing the size of the neighbourhood you need to consider in order to
calculate the results. This is similar to maximising area while minimising the
perimeter - and results in chunking space into roundish regions.
Historically, this technique has been called "the flywheel". That term came
from a circular bitshift used when processing a CA by working in one
dimension.
I'm not sure it's appropriate in the light of the fact that a raster-like
parsing technique actually works badly with it.
- Gigantic lookup table (GLUT)
- This is a generalisation of the look-up table approach.
Rather than processing one cell at a time, you combine multiple cells into a
block, combine their neighbourhoods, and use that to index into an array.
The same comments that were given about cache usage apply (in spades) here.
As in the greedy reading approach you
should ideally attempt to maximise the number of cells you deal with at once
while minimizing the size of the neighbourhood you need to calculate the results
- all while remaining within your memory constraints. This results in dealing
with near-round regions in one chunk - rather than dealing with long thin
ones.
- Sparse matrix techniques
- Sparse matrix techniques involve only processing cells in the vicinity of
active regions.
You keep track of which regions are likely to change and only update cells in the
immediate vicinity.
The most common use of this is to avoid processing empty cells with all empty
neighbours.
However, in principle sophisticated AI techniques could be used to
determine what constitutes an active region:
- Still life's could be identified - and marked as unchanging;
- Periodic oscillators could be dealt with in a similar way - writing in its states automatically if nothing has changed;
- Even gliders might in principle be dealt with in this manner;
Then there is the fact that - in many automata - the probability of a region
becoming active depends strongly on whether it has just become inactive. As a
result of this it can be worth keeping recently-inactive regions around for a
while - on the grounds that the effort of repeatedly allocating and deallocating
them is better spent allowing them to sit idle for a while occasionally.
Alan Hensel describes his own techniques along such lines [here].
George Maydwell briefly discusses some related techniques [here].
There are some resources related to optimising the Game of
Life along these lines [here>] and [here].
- Only redraw changed regions
- Often, redraw code is some of the most expensive. By keeping a not of what
has changed and what remains the same, you can often avoid redrawing significant
areas by only redrawing areas that have changed since the last redraw.
- Update as you go along
- Don't redraw too fast
- If you are reading this page, too-rapid redraw may not be your problem ;-)
However, you can sometimes speed up the percieved operation of your CA by
redrawing alternate frames - or by even less frequent updates.
The human eye can work acceptably well at fairly low frequences - and you might
be better off spending your cycles in the update code, rather than the redraw
code.
- Adaptive runtime optimisation
- Sometimes different optimisation techniques can apply at different
circumstances at runtime.
The most common place where this happens is when using sparse matrix techniques.
These work well when the CA is relatively empty - but if it fills up with
material the overhead of keeping a list of the active cells comes to outweigh
the benefits associated with ignoring any empty space - and the automaton
actually runs slower than it would do if you just mechanically churn through
every cell.
A common solution to this problem is to monitor the "sparseness" of your
automaton - and to apply different update strategies depending on the results.
Levels of sophistication run from keeping a cell-count and switching on some
hard-wired threshold to doing runtime profiling - occasionally timing how long
it takes to perform updates with various techniques, and then choosing the
fastest dynamically (at runtime) based on the findings.
- Avoid unnecessary boundary checks
- You may find you have code the reads something like:
loop(x from 0 to n-1) {
c = buf[x]
if (x > 0 ) l = buf[x - 1] else l = buf[n-1]
if (x < n-1) r = buf[x + 1] else r = buf[0]
out[x] = process(l,c,r)
}
Those if statement inside the loop can be dispensed with:
loop(x from 1 to n-2) {
c = buf[x]
l = buf[x - 1]
r= buf[x + 1]
out[x] = process(l,c,r)
}
...and handle the cases at either end separately.
- Run directly in the display buffer
- Sometimes, you can combine the buffer used to hold the CA state and the
image used to display it into a single data structure.
This can sometimes economise on your read cache (sometimes at the expense of
your write cache).
Unfortunately, It's often hard to change the display format much if you do this.
- Avoid simple boundary implementations
- Use low-level programming
- Take
advantage of the hardware
- One piece of commonly-available hardware can be of
interest to CA programmers - namely the graphics card.
GLLife
suggests that there are possibilities here for hardware
acceleration of cellular automata.
There are also the CPU's graphics-acceleration instructions
to consider - which can include matrix operations and the
like. Such resources may be available to you if you use
low-level programming - or a particularly smart
environment.
If you have got as far as this, you can probably noticed that the list got a bit
anal towards the end.
I list some of these techniques for the sake of
completeness - not because anyone should necessarily be performing them.
I wish you happy optimisations.
Don't forget to count to ten before embarking.
|