SORTING: AN EXPLORATIVE ARTICLE
Writer: Mike James (with modifications by Matthew R.Knight)
This tutorial deals with methods of sorting data into order. It is something of a shame that this subject has been largely neglected by the QB community because, apart from being a subject that every programmer should know a little about, it also contains enough surprising and ingenious programming methods to keep any programmer amused!
It is something of a puzzle why so many programmers have trouble with sorting... By examining the ways that humans carry out sorting in everyday life, you can begin to write programs to do the very same thing! How well does it sort?
If you have a program that sorts a list into order then obviously it is important that it does its job fast. There are some sorting algorithms that work very well with small amounts of data but the time that they take increases rapidly if you increase the amount of data to be sorted. It is important to know how well a sorting algorithm works but how do you measure this? The two main things influencing how long a sorting method takes are the actual set of data you give it to sort as well as how much of it there is. For example, if the list happens to already be in order then nearly all sorting methods will notice this and stop after doing very little work.
Similarly, the more the list is already in order, the faster a sorting method works. An obvious measure of how good a sorting algorithm is, is how fast it will perform on average with a list of a particular size. There will be lists that are sorted faster, and then there will be lists that take much longer. In fact, some lists will take much longer than the average time - for example, a list that is ordered in the entirely reverse order to that which is desired. Such lists are often called pathological. When comparing sorting methods there are two things to be taken into account - the average time to sort a list and the longest time to sort a list. For instance, if you find a sorting method which is fast on average but incredibly slow on a pathological list then make sure all your lists are well behaved or be prepared for a long wait every now and again! Obviously, an ideal sorting algorithm should not get much slower as the length of the list increases; it should be fast on average and not much worse for a pathological list.
Now that we know what we want it has to be admitted that most sorting methods are far from ideal!
Some simple sorting methods:
If you were given a pile of marked exam papers and were asked to put them in order you might proceed as follows:
1. Scan through all the papers and find the one with the highest mark.
2. Place this paper at the front of the pile. (This paper is now sorted and takes no part in any further sorting.)
3. Repeat (1) and (2) with the unsorted remainder of the pile, putting the highest one in front of the remainder, until all the papers are sorted. This is known as a selection sort because it is the repeated application of selecting the highest mark from the remaining papers that is used to sort the pile. In BASIC this is:
10 N = 10 20 DIM L(N) 30 GOSUB 1000 40 GOSUB 2000 50 GOSUB 3000 1000 FOR I = 1 TO N 1010 L(I) = RND 1020 NEXT I 1030 RETURN 2000 FOR I = 1 TO N-1 2010 M = I 2020 FOR J = I+1 TO N 2030 IF L(M) < L(J) THEN M = J 2040 NEXT J 2050 T = L(I) 2060 L(I) = L(M) 2070 L(M) = T 2080 NEXT I 2090 RETURN 3000 FOR I = 1 TO N 3010 PRINT L(I) 3020 NEXT I 3030 RETURN
Subroutine 1000 constructs a list of random numbers to be sorted and subroutine 3000 prints the list. Both of these subroutines will be used in the following examples without being printed again. Subroutine 2000 is the actual sort part of the program. It works in much the same way as the human method except that it uses a single array rather than a pile of papers. Lines 2010-2040 find the largest value in the array, starting from L(I) and ending with L(10). Lines 2050-2070 carry out a swap between L(I) and the largest element. Finding the largest value and swapping it is repeated until the array is sorted by the FOR loop starting at line 2000.
The selection sort is not a fast method. Just by looking at the way the two FOR loops are nested you should be able to see that, if L has N elements, the complete sort will take roughly N*N/2 comparisons and 3*N swaps. This figure however, is not affected at all by any order already present in the list so the selection sort is equally bad for all cases!
Another method based on the way a human might do it is the insertion sort. If you were given a hand of cards to arrange in order you might proceed along the following lines:
1. Place the first two cards in their correct order.
2. Then place the third card correctly within the first two.
3. Then place the fourth card in its correct place within the first three and continue like this until all the cards have been sorted. This idea is easy to convert into BASIC by using two arrays. The first holds the items to be sorted and the second is used to place the items one at a time in order.
10 N = 10 20 DIM L(N) 30 DIM A(N) 40 GOSUB 1000 50 GOSUB 2000 60 GOSUB 3000 70 STOP 2000 A(1) = L(1) 2010 FOR I = 2 TO N 2020 J = 1 2030 IF L(I) <= A(J) THEN GOTO 2090 2040 FOR K = I-1 TO J STEP -1 2050 A(K+1) = A(K) 2060 NEXT K 2070 A(J) = L(I) 2080 GOTO 2130 2090 LET J = J + 1 2100 IF J <= I -1 THEN GOTO 2030 2120 A(J+1) = L(I) 2130 NEXT I 2130 RETURN
And change line 3010 to: 3010 PRINT L(I), A(I)
This sort is once again carried out by subroutine 2000. Line 2000 takes the first item in the list L and places it in the array where the sorted list will be stored (i.e. A). The FOR loop starting at 2010 is responsible for inserting each of the items from L into A. Line 2020 to 2120 are responsible for actually carrying out the swap. The new item is compared with the items already sorted in the array by line 2030. When the correct position for the new item is found, lines 2040-2060 make space for it by moving all the elements that are smaller than it up the array by one place. Then the new item is inserted by line 2070. If the new item is smaller than any already in the array A, it is just appended to the end by line 2120 without having to make any space.
This procedure looks as though it will be slower than a selection sort because of all the moving about of already sorted items to make room for new items. However, things are not what they seem. The number of comparisons needed to sort N items is only N*N/4, which is better than a selection sort, but the number of moves is also N*N/4 which is worse than the 3N swaps of the selection sort except when N is tiny. In light of this, the insertion sort usually performs only slightly better than a selection sort, yet it uses twice as much memory... Draw your own conclusions.
The next sorting method, the bubble sort, is pherhaps a little too well known. This is really the first method that is not based on the way a human would sort things but tries to make use of the speed with which a computer can do simple things. The basic bubble sort algorithm is easy to understand. All you have to do is make a scan through the array comparing items that are next door to each other. If they are in the correct order then nothing is done. If they are in the wrong order then they are swapped. This scan is repeated until the list is fully sorted and there are no swaps made on a scan. The BASIC for this is:
10 N = 10 20 DIM L(N) 30 GOSUB 1000 40 GOSUB 2000 50 GOSUB 3000 60 STOP 2000 F = 0 2010 GOSUB 2500 2020 IF F = 0 THEN RETURN 2030 GOTO 2000 2500 FOR I = 1 TO N-1 2510 IF L(I) > L(I+1) THEN GOTO 2560 2520 T = L(I) 2530 L(I) = L(I + 1) 2540 L(I+1) = T 2550 F = 1 2560 NEXT I 2570 RETURN
The structure of this program is very simple to understand. Subroutine 2000 first sets a flag F to zero and calls subroutine 2500 which performs a scan and sets the flag F to 1 if a swap occurs. Line 2020 repeats the scan until F is zero indicating that array has been completely sorted. Subroutine 2500 performs the scan using a FOR loop and compares L(I) with L(I+1). The swap is carried out by lines 2520-2540 and the flag is set in 2550. The bubble sort is so easy to understand and quick to program because it is far from a good method of sorting! You need about N*n/2 comparisons and 3*n*n/4 swaps to sort a list of N items and this is worse than an insertion sort! The interesting thing about the bubble sort is that you can find it works very quickly on some lists - for example, a list that has only a few well sepparated pairs of items in the wrong order will be sorted in one scan. There are, however, also lists on which it does very badly - for example, a list in reverse order. In conclusion, the bubble sort is slow and very sensitive to the initial order of the list. There is an improvement to the simple bubble sort that makes it very good indeed. The main trouble with the bubble sort comes from the fact that items move by only one place at a time. If an item is at the bottom of the list but should be at the top then it will take N scans to move it to its correct place. If we extend the bubble method to comparing and swapping items that are further apart than one, then we might improve the speed with which items reach their final resting place. If the distance between items compared and possibly swapped is D we call the resulting sort a D-sort. In other words, the bubble sort can also be called a 1-sort because it compares items that are next to each other - that is, 1 apart. The only complication is that a list can be sorted as far as a 4-sort, say, is concerned but still not sorted as far as a 1-sort is concerned. In other words, all the items 4 elements apart can be in the correct order but adjacent items can still be the wrong way round. In practise, we employ a sequence of sorts with decreasing distances between the items compared until a 1-sort finally gives the fully sorted array. This method is often called a Shell sort after the person who invented it. A Shell sort in BASIC is not much more difficult than a bubble sort:
10 N = 10 20 DIM L(N) 30 GOSUB 1000 40 GOSUB 2000 50 GOSUB 3000 60 STOP 2000 D = 8 2010 IF D < 1 THEN RETURN 2020 F = 0 2030 GOSUB 2500 2040 IF F = 1 THEN GOTO 2020 2050 D = D / 2 2060 GOTO 2010 2500 FOR I = 1 TO N-D 2510 IF L(I) > L(I+D) THEN GOTO 2560 2520 T = L(I) 2530 L(I) = L(I + D) 2540 L(I + D) = T 2550 F = 1 2560 NEXT I 2570 RETURN
The way that this program works can be understood by comparing it with the bubbble sort given earlier. The distance between items compared is stored in D and starts out at 8. After each scan at a particular D returns without making any swaps, the value of D is halved and a new series of scans is initiated. This continues until a 1-sort returns without making any swaps. Thus, the sequence of sorts is 8-sort, 4-sort, 2-sort, 1-sort. It is difficult to say how good the shell sort is on average as it depends on the initial value given to D but it is one of the simplest of the efficient sorting methods and is a good practical choice unless you are going to be sorting very large lists of data. Certainly, there is never a good reason for using a bubble sort when you can use a Shell sort instead. This examination of sorting methods could continue almost endlessly. However, this tutorial has introduced fundamental ideas that will enable you to understand the problems and principles of sorting and appreciate some alternative approaches. There are plenty of more elaborate methods, any of which would need a great deal of explanation. If you need to sort very large lists of data then you will need a method that is even better than any of those presented in this tutorial. Some names to look up include tree sort, heap sort and quick sort.