Skip List

Files:  skip_list.c, skip_list.h


Binary trees work well when elements are inserted in random order.  In the
average case, find(), insert(), and delete() take O(log n time) in a binary
tree.  However some sequences of operations, such as inserting the elements in
order, produce degenerate tree structures which perform poorly.  In such cases,
find(), insert() or delete() may take O(n) time.  Balanced trees, such as the
AVL tree and 2-3 tree, overcome this problem by maintaining the trees balance,
so that find(), insert(), and delete() all have O(log n) worst case time.

A skip-list is a probabilistic alternative to balanced trees.  A skip-list is
balance probabilistically, by using a random number generator, so on average
find(), insert(), and delete() operations are take O(log n) time.  The
probability of encountering a performance significantly worse than the average
of O(log n) is very small.  Because balance is chosen randomly, the chance of
encountering the O(n) worst case performance in a large list is very small.
Also, no input sequence into a skip list will consistently produce the worst
case performance.


The Structure of a Skip List
---------------------------

In a standard linked list nodes hold a data item, and have a pointer, next,
which points to the next node in the list.  For this simple linked-list
structure, on average O(n) time is spent traversing the list when searching for
a particular data item.  The skip-list extends this idea, by allowing nodes to
have several forward pointers.  The `height' of a node is defined as the number
of forward pointers it has.  The additional forward pointers are used to `skip'
over nodes to nodes further ahead in the list.  This allows the find()
operation to proceed more efficiently compared to the simple linked list.  The
node structure type used in the file skip_list.h is shown below:

typedef struct skip_node {
    void *item;                  /* The item held by the node. */
    struct skip_node **forward;  /* An array of pointers to other nodes. */
    int size;                    /* The size of the forward[] array. */
} skip_node_t;

Forward pointers are stored in an array which is allocated to the node when the
node is created.  In the following description,  the pointers are denoted as
forward[0], forward[1], ... , forward[h-1], where h is the height of the node.
The smallest node height allowed is 1, where the node has just one forward
pointer, forward[0].  The forward[0] pointer points to the next node in the
list with height 1 or higher (which is just the next node in the list).  The
forward[1] pointer points to the next node in the list of height 2 or higher.
In general forward[i] points to the next node in the list with a with height
i-1 or higher.  Thus, the pointer forward[i] can be used to skip over all nodes
that have a height less than i-1.

At the head of the list is a set of head pointers head[0], head[1], ... , head
[h_max-1], where h_max is the height of the tallest node in the list.  In an
actual implementation, an appropriate value for h_max can be calculated when
the list is created and node heights be capped so that no node is allocated a
height greater than h_max.  The head[] pointers can be implemented using the
forward[] pointers of a special node of height h_max at the head of the list.

When a forward or head pointer is unused, depending on the implementation, its
value is set to NULL or a special NULL node at the end of the list can be used.
When a special NULL node is used, it must have a key value greater than that of
any other node in the list.  The use of a special NULL node can give more
efficient code, by preventing the need for checking for NULL pointers during a
find() operation.  The skip list implementation provided by the source files
skip_list.c and skip_list.h just use NULL pointers for flexibility.


Allocation of Node Heights
--------------------------

A nodes height is determined when it is created, and remains the same
throughout the lifetime of the node.  Node heights are allocated with a
probability p, defined as follows.  A fraction, p of the nodes with forward[i]
pointers also have forward[i+1] pointers.  This allows a node heights, h, to be
chosen randomly, as follows:

    Set h = 1 initially;
    Repeat with probability p {
        h = h + 1;
    }

Suppose p = 0.5, then a 1/2 of all nodes will have height 1, 1/4 height 2, 1/8
height 3, and so on.  It can be seen that there are fewer nodes with larger
node sizes.  This allocation of node heights, allows efficient searching, by
following the higher forward pointers in order to skip many smaller nodes at
once.

In practice, when implementing a skip-list, the maximum node height needs to be
known.  For simplicity it is best to limit node heights to a maximum height
allowed, h_max.  If we know the expected maximum number of nodes, n_max, to be
stored in the skip_list, then the value to use for maximum height can be
calculated as follows:

       h_max = -log(n_max)/log(p)

    or

       log(n_max)   (where the log base is 1/p)


The capped node heights are generated as follows:

    Set h = 1 initially;
    Repeat with probability p {
        h = h + 1;
        if h > h_max then stop;
    }


The find() Operation
--------------------

In the following pseudo-code description, h_max denotes the maximum node height
for the list, and k denotes the key value (or value of the data item) being
searched for.  This description of the find() operation assumes a special head
node, head, but does not use a special NULL node.
        

    forward = head->forward;
    i = head->size - 1;
    loop {

	/* Ignore NULL forward pointers. */
	while forward[i] == NULL do {
	    i = i - 1;
            if i < 0 then goto exit_loop;
	}

	/* Don't traverse towards nodes which are not smaller than the
	 * item being searched for.
	 */
        while key(forward[i]->item) >= key(k) do {
	    i = i - 1;
	    if i < 0 then goto exit_loop;
	}

	forward = forward[i]->forward;
    }
  exit_loop:


In the code above, the variable forward points to the current nodes forward
array, and is reassigned each time a link in the skip-list is traversed.  When
the above code exits the loop, forward[0]->item will point to the first item in
the list with a key greater than or equal to k.  When the loop exits, it must
be checked that forward[0] is not NULL.  Then it can be checked if
key(forward[0]->item) == k, to determine if a matching item was found.

For each cycle of the main loop, a link is followed to reach a node with a key
smaller than k.  The index i specifies which forward pointer is being
traversed.  Within the main loop, the first `while' loop is used to bypass
forward pointers which are NULL.  A pointer forward[i] is NULL when there are
no more nodes of height i or greater in the list.  This check for NULL pointers
could be avoided if the skip_list is implemented using a special NULL
terminator node at the end of the list.  In such an implementation, the NULL
terminator node would need to have a value greater than any other node in the
list.  The second `while' loop is used to bypass pointers to nodes with items
that have a key which is greater than or equal to k.  Within both of the
`while' loops, the index, i, is decreased.  When i reaches -1, the search is
finished.

The find() operation takes O(log n) time on average.  In the worst case, the
time complexity is O(n).  However, the probability that the time taken by a
find() operation is significantly worse than O(log n) is very small.


The insert() Operation
----------------------

The insert() operation inserts an item, x, into the list.  The insert()
operation must locate the insertion position for x, using the same search
process as the find() operation does.  During this find() process it is
necessary to keep track of which pointers will need to be updated after x is
inserted into the list.  This is achieved by maintaining an array update[]
during the find process.  Entry update[i] points to the node whose forward[i]
pointer will be updated.  A pseudo-code description of the find part of the
insert() operation is given below:

    /* Locate insertion position. */
    node_ptr = head;
    forward = node_ptr->forward;
    i = node_ptr->size - 1;
    loop {

	/* Ignore NULL pointers at the top of the forward pointer array. */
	while forward[i] == NULL do {
	    update[i] = node_ptr;
	    i = i - 1;
            if i < 0 then goto end_find_loop;
	}

	/* Don't traverse toward nodes which are not smaller than the
	 * item being searched for.
	 */
        while key(forward[i]->item) >= key(x) {
	    update[i] = node_ptr;
	    i = i - 1;
	    if i < 0 then goto end_find_loop;
	}

	node_ptr = forward[i];
	forward = node_ptr->forward;
    }
  end_find_loop:


When the insertion position has been found, a new node, new_node can be created
to hold x.  The new node is allocated a random size for its forward[] array
(see "Allocation of Node Heights" above).  Then the appropriate pointers are
updated as follows:

    /* Update pointers in the list. */
    forward = new_node->forward;
    for(i = 0; i < new_node->size; i++) {

        /* new_node takes over another nodes forward pointer. */
	forward[i] = update[i]->forward[i];

        /* the other nodes forward pointer is updated to point to new_node. */
	update[i]->forward[i] = new_node;

    }

As for the find() operation, the insert operation takes O(log n) time on
average.


The delete() Operation
----------------------

The delete() operation removes the item with key k from the list.  To locate
the item to be deleted, the delete() operation uses the same find process as
the insert operation.  The update[] array is maintained the same as before.
After the find process, the node which holds the item with key k is to be
removed.  Let remove_node be the node to be removed.  Then pointers are updated
as follows:

    forward = remove_node->forward;
    for(i = 0; i < remove_node->size; i++) {
        update[i]->forward[i] = forward[i];
    }

In each cycle of this loop, the forward[i] pointer which points to remove_node
is updated to the value of remove_node's forward[i] pointer.  When these
pointer updates have completed, remove_node can be removed from the list.

As for the find() operation, the delete operation take O(log n) time on
average.


The find_min() and delete_min() operations
------------------------------------------

The find_min() returns the minimum item, which is always the first item in the
list.  The first item in the list is easily accessible, so find_min() operation
has O(1) worst case time complexity.

Although the minimum node is located in O(1) time, the delete_min() operation
uses time proportional to the size of the minimum node for updating the head
pointers of the list after the minimum node is removed.  This will take O(1)
time on average, and O(log n) time at worst.