****************************************************************************************** * SORT WITHIN CASES: IMPLEMENTATION OF SOME POPULAR SORTING ALGORITHMS * ****************************************************************************************** *Kirill Orlov. kior@comtv.ru * 2007/03/16 *Below is SPSS syntax of some well-known sorting algorithms. *In this case, the algos are used to sort values within cases. *Replace number 13 with the number of variables you are going to submit to sorting, and run *the corresponding syntax of this or that algo. *The following algos considered: *SIMPLE SORTING ALGORITHMS: BUBBLE, SELECTION, INSERTION. *ALGORITHMS FOR QUICK SORTING OF LARGE VECTORS: SHELL's INSERTION, HEAPSORT, QUICKSORT. *What's a "stable" sorting algorithm: *If in the data (vector) there are identical values ("ties") then a "stable" sorting algo *guarantees that original order among them will be kept, that is, the value occupying initially earlier (more left) *position in the vector than another value, equal to it, will stay more left than it in the output vector. *"Non-stable" sorting algo does not quarantee it. ****************************************************************************************** *Create data. input prog. vector v (13 f2). loop #c= 1 to 20. -loop #v= 1 to 13. - comp v(#v)= rnd(rv.uniform(1,13)). -end loop. -end case. end loop. end file. end input prog. exec. *************** BUBBLE (=EXCHANGE) algorithm, STABLE VERSION ******************. *Vars are looked through in pairs and swap their values when necessary. *The procedure repeates until no more swap is needed. *This algo is regarded as the most slow among simple sorting algorithms. It is "stable". vector v= v1 to v13. loop #it= 1 to 13. /*It can take maximum as much iterations of the procedure as there're vars -comp #sorted= 1. /*That's signal that the vector is sorted, let's think meanwhile it's sorted -loop #i= 2 to 13-#it+1. /*Run through vars, taking pair: this var and preceding one (or could take this and the following - doesn't matter) /*Could write just 'to 13', but 'to 13-#it+1' is time-thrifty (the thing is that at each iteration, the largest [smallest, for desc sort] value /*of all the series will definitely find itself at the vector's tail, therefore we need not to touch this tail during next iteration) - do if v(#i) instead <] -end loop. /*so that on exit from the run m is indeed the pos of min (max) value -comp #temp= v(#i). /*Now do swap: -comp v(#i)= v(#m). /*put min (max) value in pos i -comp v(#m)= #temp. /*and the value having been in i - put it in pos m end loop. /*Take next value (i+1) as supposed min (max) exec. *If data contain missing values then they will stay in their positions. To sort them all to the vector's tail *add to the condition of the IF command (that is inside loop): 'or miss(v(#m))'. ******** SELECTION (=EXTREMUM) algorithm, SHIFT VERSION (STABLE) *********. *This version is based on the previous one except that instead of swapping 2 values (current and the extremum) *it undertakes additional run-through to move all values that are on the left of the extremun to the right by 1 position. *At a cost of this prolongation (slowing) of the work "stability" is achieved. vector v= v1 to v13. loop #i= 1 to 13-1. -comp #m= #i. -loop #j= #i+1 to 13. - if v(#j)#val. /*And we must find a place within them where to insert our value /*Therefore, if the next value at the left to it is greater [lesser - for descending sort] then run through the sorted subvector from right to left - comp v(#i+1)= v(#i). /*shifting values there 1 pos to the right until we run into value which is not greater than ours -end loop. -comp v(#i+1)= #val. /*There, we insert our value beside it, at the right; now the left, sorted subvector expanded by one value end loop. /*Take next value in order, etc exec. *If data contain missing values then they will stay in their positions and serve as points of division in "subvectors": sort will go *within subvectors only. E.g., if a case is 5 4 . 5 6 4, the output is 4 5 . 4 5 6. *If you want to escape this division and want to sort all missing to the vector's tail, to if-condition inside loop add: 'or miss(v(#i))'. *************************** SHELL's INSERTION algorithm ****************************. *Shell (1959) offered how to speed up Insertion algo in situations of large as well as far-from-sortedness vectors. *The main point is that values "on the left" - among which a place to insert a taken value is searched - are scanned through *with large step (i.e. step not 1 but larger) which accelerates finding a place for insertion. After all values have been thus inserted *and the vector becomes closer to sordetness the procedure repeates with a lesser step, then again with still lesser step. In the end, with step 1 *(i.e. usual Insertion algo is applied when the vector is almost sorted). *For instance, there's a vector of 13 values. Initially take such a step so that the vector split in subvectors of 2 values, *distanced from each other by step: 13/2=6.5, i.e. step'll be 6. These are values with positions 1-7, 2-8, 3-9, etc. *Within these subvectors - within pairs of values - insertion sorting will be done. After that, another splitting of vector is done, 4 values in subvector, *separated by step 3 (13/4=3.25). These are positions 1-4-7-10, 2-5-8-11, 3-6-9-12. And these subvectors undergo sorting. *At the end, the whole vector will be sorted. *Shell's sort is not "stable", nor it is "online" sort. vector v= v1 to v13. loop #n= 1 to 13. /*We'll do "splittings" of vector (number of splittings in fact will be less then 13, we'll be over before we reach n=13) -comp #n= 2**#n. /*with n elements (values) in a subvector: 1-st splitting n=2, at 2-nd splitting n=4, at 3-rd n=8 etc -comp #step= trunc(13/#n). /*Elements in subvector are not adjacent, rather, they are separated by gap step which's therefore equals the number of full (with n elements) subvectors -if #step=1 #n= 13. /*If vector contains only one full subvector - take the whole vector instead -loop #sv= 1 to #step. /*For each "splitting", undertake Insertion sorting in each of its subvectors, sv is the number of a subvector *****Highlighted section is Insertion algo, the only nuance being that it goes with step step instead of step 1. - loop #ins= #sv+#step to #sv+#step*(#n-1) by #step. - comp #val= v(#ins). - loop #i= #ins-#step to #sv by -#step if v(#i)>#val. /*[for desc sort: < instead of >] - comp v(#i+#step)= v(#i). - end loop. - comp v(#i+#step)= #val. - end loop. *****. -end loop. end loop if #step=1. /*If the subvector was the only one (and it is the whole vector) that's signal to end job, sorting is done exec. *If there are missing values in your data then to if-condition in 'loop #i' add: 'or miss(v(#i))', and they will be sorted to vector's tail. *Without this addition, missing values will produce absurd sorting result. ************************ HEAPSORT algorithm ******************************. *This is a quick algo for large vectors. It is not "stable". *The job comprises 2 stages. On the 1st the vector is sorted into partly sorted vector called heap. *On the 2nd stage the heap is further sorted into the result. *Heap is a hierarchical structure of "parent with 2 children" type where a parent element is greater than its children *(such heap is called max-heap) or a parent element is lesser than its children (heap called min-heap). *Heap is stored as vector wherein for each element on position p *there are corresponding children elements on positions 2*p and 2*p+1. An example of max-heap in form of figure and vector: * 10 * 8 6 * 3 6 5 1 * 3 2 3 5 4 * *posit 1 2 3 4 5 6 7 8 9 10 11 12 *value 10 8 6 3 6 5 1 3 2 3 5 4 * *On both stages, the algo is the same, swapping: elements on parental positions are compared with their children, *and if it occures one of the children is greater (lesser, for building min-heap) than its parent the child swaps positions with the parent. *The difference between the stages is in regulations for the algo: on the 1st stage all parents are being checked which lead to heap where *the 1st element - root parent - is the largest (smallest) of all. On the 2nd stage that parent is withdrawn from swapping and the rest of vector *undergoes heapifying, then the root parent is withdrawn again, one more shortening the part which's left for heapifying, etc.: *each time that part becomes shorter and, besides, needs less ans less further sorting. *It may seem paradoxal at first glance but it is bulding of max-heap that results in ascending sort, and it is bulding of mix-heap that *brings descending sort. vector v= v1 to v13. comp #end= 13. loop #stage= 1 to 2. /*Two stages: create heap on the 1st, sort heap on the 2nd -loop #p= trunc(13*#stage/2) to 1 by -1. /*On the 1st stage p is parent's position: scan all parents (i e all elements from vector's mid to the left) /*On the 2nd stage value p runs through all elements from vector's end to the left - do if #stage=2. /*On the 2nd stage there're differences with the 1st: - comp #temp= v(1). /*Swap 1st element (most grand parent) with last element (rightmost child) in the vector - comp v(1)= v(#p). - comp v(#p)= #temp. - comp #end= #p-1. /*And "withdraw" the last element from the vector, for it's already sorted: it's the largest (smallest, for desc sort) of all currently; /*And in so doing the still-not-sorted part of the vector will gradually shorten - comp #p= 1. /*As for parent - to compare with children - take always currently the 1st element of vector - end if. *****Highlighted is comparison/swap of parent and child ("sifting"="percolating"). - loop #i= 1 to 13 if #p*2<=#end. /*Will repeat while a parent has a child (i e parent's pos p is such that 2*p doesn't exit vector) /*i will be not used (introduced only to escape SET MXLOOPS) and cycles will terminate before i=13 - comp #c= #p*2. /*Position of the child (left one of the two) - do if #c+1<=#end. /*When there're 2 children - if v(#c+1)>v(#c) #c= #c+1. /*then, if the right child is greater [lesser, for desc sort: < instead >] than the left, take the right one into comparison - end if. - do if v(#p) instead <] than the parent - comp #temp= v(#p). - comp v(#p)= v(#c). - comp v(#c)= #temp. - comp #p= #c. /*so that the parent is now on the child's pos (and so we'll then compare it with its futher children: its former grand-children) - else. /*And if parent is greater than the largest [lesser than the smallest] child, exit - break. - end if. - end loop. /*to take next parent in a queue for parent-children comparison sift *****. -end loop. end loop. exec. *If there are missing values the sort will be with errors. To ward it off and sort all missing to the vector's tail, *to condition 'if v(#c+1)>v(#c)...' add: 'or miss(v(#c+1))', and to condition 'do if v(#p)0. /*r is position in vectors s and e to read from: read s (start pos) and e (end pos) of the input subvector; the whole job lasts while there's something written in vector s (and e) -comp #s= #s(#r). /*Copy start pos because it will change (while e, end pos, will remain for a while) -loop #w= #w to 13 if #s<#e(#r). /*The task with the taken subvector is to repeatedly divide it into left and right wings with recoil from left (i e divide, then divide its right wing, /*then divide the right wing of the right wing, etc) while there remains smth to divide: end pos is beyond start pos; /*and w being pos in vectors s รจ e on which to write the whereabouts of emergent left wings, for we'll deal with them later ****Hightlited section is the division of subvector****. - comp #p= trunc((#s+#e(#r))/2). /*Select pivot value: let it be the middle element in the subvector - comp #temp= v(#p). /*Carry it to the head of the subvector (it is taking it away to a safe place) - comp v(#p)= v(#s). /*while the 1st element goes to pivot's former pos (swap) - comp v(#s)= #temp. - comp #p= #e(#r). /*Now, p will be the reception pos for values being thrown over to the right wing (let p be the end of the subvector initially) - loop #i= #e(#r) to #s+1 by -1. /*Do throwing over of values: take value one by one starting from the end (do not touch the pivot value kept on pos s) - do if v(#i)>v(#s). /*If the value is greater [lesser, for desc sort: < instead >] than the pivot value - comp #temp= v(#i). /*then put it on the right wing on the reception pos p - comp v(#i)= v(#p). /*(swap with the value occupying it) - comp v(#p)= #temp. - comp #p= #p-1. /*And move the reception pos by 1 to the left - end if. - end loop. /*Take next value to compare it with the pivot value and to through over to the right if necessary - comp #temp= v(#s). /*After throwings-over are done, put the pivot value on the pos p, wherever it is now - comp v(#s)= v(#p). /*(swap with the value occupying it) - comp v(#p)= #temp. /*Two wings are ready: all values on the right of the pivot value (its pos p) are larger [smaller], while on the left of it are not larger [not smaller] than it ****. - comp #s(#w)= #s. /*Start pos of left wing: write it to our memorandum - comp #e(#w)= #p-1. /*Start pos of left wing: write it to our memorandum - comp #s= #p+1. /*Start pos of right wing: on next loop we'll devide subvector which starts here and ends on pos e -end loop. /*And now we turn to do it end loop. /*When there's nothing to divide anymore, next subvector to divide is taken by reading its whereabouts from s(r) e(r) exec. *In the presence of missing values sort is not affected but positions of missing values move. If you don't like it and want *to sort all missing to the vector's tail, to 'do if' condition add: 'or miss(v(#i))'. *Though a pivot element may be selected arbitrary, speed depends on its selection. *If the input vector is already partly sorted then choosing a pivot element by an edge of (sub)vector considerably slows down the job. *Therefore it's most optimal and universal to take a pivot element from a middle part of (sub)vector, like the above syntax does. *Or you might choose it randomly.