Skip List Files: skip_list.c, skip_list.h Binary trees work well when elements are inserted in random order. In the average case, find(), insert(), and delete() take O(log n time) in a binary tree. However some sequences of operations, such as inserting the elements in order, produce degenerate tree structures which perform poorly. In such cases, find(), insert() or delete() may take O(n) time. Balanced trees, such as the AVL tree and 2-3 tree, overcome this problem by maintaining the trees balance, so that find(), insert(), and delete() all have O(log n) worst case time. A skip-list is a probabilistic alternative to balanced trees. A skip-list is balance probabilistically, by using a random number generator, so on average find(), insert(), and delete() operations are take O(log n) time. The probability of encountering a performance significantly worse than the average of O(log n) is very small. Because balance is chosen randomly, the chance of encountering the O(n) worst case performance in a large list is very small. Also, no input sequence into a skip list will consistently produce the worst case performance. The Structure of a Skip List --------------------------- In a standard linked list nodes hold a data item, and have a pointer, next, which points to the next node in the list. For this simple linked-list structure, on average O(n) time is spent traversing the list when searching for a particular data item. The skip-list extends this idea, by allowing nodes to have several forward pointers. The `height' of a node is defined as the number of forward pointers it has. The additional forward pointers are used to `skip' over nodes to nodes further ahead in the list. This allows the find() operation to proceed more efficiently compared to the simple linked list. The node structure type used in the file skip_list.h is shown below: typedef struct skip_node { void *item; /* The item held by the node. */ struct skip_node **forward; /* An array of pointers to other nodes. */ int size; /* The size of the forward[] array. */ } skip_node_t; Forward pointers are stored in an array which is allocated to the node when the node is created. In the following description, the pointers are denoted as forward[0], forward[1], ... , forward[h-1], where h is the height of the node. The smallest node height allowed is 1, where the node has just one forward pointer, forward[0]. The forward[0] pointer points to the next node in the list with height 1 or higher (which is just the next node in the list). The forward[1] pointer points to the next node in the list of height 2 or higher. In general forward[i] points to the next node in the list with a with height i-1 or higher. Thus, the pointer forward[i] can be used to skip over all nodes that have a height less than i-1. At the head of the list is a set of head pointers head[0], head[1], ... , head [h_max-1], where h_max is the height of the tallest node in the list. In an actual implementation, an appropriate value for h_max can be calculated when the list is created and node heights be capped so that no node is allocated a height greater than h_max. The head[] pointers can be implemented using the forward[] pointers of a special node of height h_max at the head of the list. When a forward or head pointer is unused, depending on the implementation, its value is set to NULL or a special NULL node at the end of the list can be used. When a special NULL node is used, it must have a key value greater than that of any other node in the list. The use of a special NULL node can give more efficient code, by preventing the need for checking for NULL pointers during a find() operation. The skip list implementation provided by the source files skip_list.c and skip_list.h just use NULL pointers for flexibility. Allocation of Node Heights -------------------------- A nodes height is determined when it is created, and remains the same throughout the lifetime of the node. Node heights are allocated with a probability p, defined as follows. A fraction, p of the nodes with forward[i] pointers also have forward[i+1] pointers. This allows a node heights, h, to be chosen randomly, as follows: Set h = 1 initially; Repeat with probability p { h = h + 1; } Suppose p = 0.5, then a 1/2 of all nodes will have height 1, 1/4 height 2, 1/8 height 3, and so on. It can be seen that there are fewer nodes with larger node sizes. This allocation of node heights, allows efficient searching, by following the higher forward pointers in order to skip many smaller nodes at once. In practice, when implementing a skip-list, the maximum node height needs to be known. For simplicity it is best to limit node heights to a maximum height allowed, h_max. If we know the expected maximum number of nodes, n_max, to be stored in the skip_list, then the value to use for maximum height can be calculated as follows: h_max = -log(n_max)/log(p) or log(n_max) (where the log base is 1/p) The capped node heights are generated as follows: Set h = 1 initially; Repeat with probability p { h = h + 1; if h > h_max then stop; } The find() Operation -------------------- In the following pseudo-code description, h_max denotes the maximum node height for the list, and k denotes the key value (or value of the data item) being searched for. This description of the find() operation assumes a special head node, head, but does not use a special NULL node. forward = head->forward; i = head->size - 1; loop { /* Ignore NULL forward pointers. */ while forward[i] == NULL do { i = i - 1; if i < 0 then goto exit_loop; } /* Don't traverse towards nodes which are not smaller than the * item being searched for. */ while key(forward[i]->item) >= key(k) do { i = i - 1; if i < 0 then goto exit_loop; } forward = forward[i]->forward; } exit_loop: In the code above, the variable forward points to the current nodes forward array, and is reassigned each time a link in the skip-list is traversed. When the above code exits the loop, forward[0]->item will point to the first item in the list with a key greater than or equal to k. When the loop exits, it must be checked that forward[0] is not NULL. Then it can be checked if key(forward[0]->item) == k, to determine if a matching item was found. For each cycle of the main loop, a link is followed to reach a node with a key smaller than k. The index i specifies which forward pointer is being traversed. Within the main loop, the first `while' loop is used to bypass forward pointers which are NULL. A pointer forward[i] is NULL when there are no more nodes of height i or greater in the list. This check for NULL pointers could be avoided if the skip_list is implemented using a special NULL terminator node at the end of the list. In such an implementation, the NULL terminator node would need to have a value greater than any other node in the list. The second `while' loop is used to bypass pointers to nodes with items that have a key which is greater than or equal to k. Within both of the `while' loops, the index, i, is decreased. When i reaches -1, the search is finished. The find() operation takes O(log n) time on average. In the worst case, the time complexity is O(n). However, the probability that the time taken by a find() operation is significantly worse than O(log n) is very small. The insert() Operation ---------------------- The insert() operation inserts an item, x, into the list. The insert() operation must locate the insertion position for x, using the same search process as the find() operation does. During this find() process it is necessary to keep track of which pointers will need to be updated after x is inserted into the list. This is achieved by maintaining an array update[] during the find process. Entry update[i] points to the node whose forward[i] pointer will be updated. A pseudo-code description of the find part of the insert() operation is given below: /* Locate insertion position. */ node_ptr = head; forward = node_ptr->forward; i = node_ptr->size - 1; loop { /* Ignore NULL pointers at the top of the forward pointer array. */ while forward[i] == NULL do { update[i] = node_ptr; i = i - 1; if i < 0 then goto end_find_loop; } /* Don't traverse toward nodes which are not smaller than the * item being searched for. */ while key(forward[i]->item) >= key(x) { update[i] = node_ptr; i = i - 1; if i < 0 then goto end_find_loop; } node_ptr = forward[i]; forward = node_ptr->forward; } end_find_loop: When the insertion position has been found, a new node, new_node can be created to hold x. The new node is allocated a random size for its forward[] array (see "Allocation of Node Heights" above). Then the appropriate pointers are updated as follows: /* Update pointers in the list. */ forward = new_node->forward; for(i = 0; i < new_node->size; i++) { /* new_node takes over another nodes forward pointer. */ forward[i] = update[i]->forward[i]; /* the other nodes forward pointer is updated to point to new_node. */ update[i]->forward[i] = new_node; } As for the find() operation, the insert operation takes O(log n) time on average. The delete() Operation ---------------------- The delete() operation removes the item with key k from the list. To locate the item to be deleted, the delete() operation uses the same find process as the insert operation. The update[] array is maintained the same as before. After the find process, the node which holds the item with key k is to be removed. Let remove_node be the node to be removed. Then pointers are updated as follows: forward = remove_node->forward; for(i = 0; i < remove_node->size; i++) { update[i]->forward[i] = forward[i]; } In each cycle of this loop, the forward[i] pointer which points to remove_node is updated to the value of remove_node's forward[i] pointer. When these pointer updates have completed, remove_node can be removed from the list. As for the find() operation, the delete operation take O(log n) time on average. The find_min() and delete_min() operations ------------------------------------------ The find_min() returns the minimum item, which is always the first item in the list. The first item in the list is easily accessible, so find_min() operation has O(1) worst case time complexity. Although the minimum node is located in O(1) time, the delete_min() operation uses time proportional to the size of the minimum node for updating the head pointers of the list after the minimum node is removed. This will take O(1) time on average, and O(log n) time at worst.