Sorting Within Cases | Raynald's SPSS Tools

******************************************************************************************
* SORT WITHIN CASES: IMPLEMENTATION OF SOME POPULAR SORTING ALGORITHMS *
******************************************************************************************
*Kirill Orlov. kior@comtv.ru
* 2007/03/16

*Below is SPSS syntax of some well-known sorting algorithms.
*In this case, the algos are used to sort values within cases.
*Replace number 13 with the number of variables you are going to submit to sorting, and run
*the corresponding syntax of this or that algo.

*The following algos considered:
*SIMPLE SORTING ALGORITHMS: BUBBLE, SELECTION, INSERTION.
*ALGORITHMS FOR QUICK SORTING OF LARGE VECTORS: SHELL's INSERTION, HEAPSORT, QUICKSORT.

*What's a "stable" sorting algorithm:
*If in the data (vector) there are identical values ("ties") then a "stable" sorting algo
*guarantees that original order among them will be kept, that is, the value occupying initially earlier (more left)
*position in the vector than another value, equal to it, will stay more left than it in the output vector.
*"Non-stable" sorting algo does not quarantee it.
******************************************************************************************

*Create data.
input prog.
vector v (13 f2).
loop #c= 1 to 20.
-loop #v= 1 to 13.
- comp v(#v)= rnd(rv.uniform(1,13)).
-end loop.
-end case.
end loop.
end file.
end input prog.
exec.


*************** BUBBLE (=EXCHANGE) algorithm, STABLE VERSION ******************.

*Vars are looked through in pairs and swap their values when necessary.
*The procedure repeates until no more swap is needed.
*This algo is regarded as the most slow among simple sorting algorithms. It is "stable".

vector v= v1 to v13.
loop #it= 1 to 13. /*It can take maximum as much iterations of the procedure as there're vars
-comp #sorted= 1. /*That's signal that the vector is sorted, let's think meanwhile it's sorted
-loop #i= 2 to 13-#it+1. /*Run through vars, taking pair: this var and preceding one (or could take this and the following - doesn't matter)
                  /*Could write just 'to 13', but 'to 13-#it+1' is time-thrifty (the thing is that at each iteration, the largest [smallest, for desc sort] value
                  /*of all the series will definitely find itself at the vector's tail, therefore we need not to touch this tail during next iteration)
- do if v(#i)<v(#i-1). /*If the var's value is lesser than in the previous [to sort in desc order - greater than in it]
-  comp #sorted= 0. /*Then, marking that sorting is needed,
-  comp #temp= v(#i). /*Swap our two vars by their values (temp is temporary storage)
-  comp v(#i)= v(#i-1).
-  comp v(#i-1)= #temp.
- end if.
-end loop.
end loop if #sorted. /*Stop procedure (iterations) if, having looked over all pairs, find no more need to sort
exec.

*If data contain missing values then they will stay in their positions and serve as points of division in "subvectors": sort will go
*within subvectors only. E.g., if a case is 5 4 . 5 6 4, the output is 4 5 . 4 5 6.
*If you want to escape this division and want to sort all missing to the vector's tail, to 'do if' condition add: 'or miss(v(#i-1))'.


*************** BUBBLE (=EXCHANGE) algorithm, NON-STABLE VERSION ******************.

*This version of Bubble sort is somewhat quicker than the previous but it is not "stable".
*Here, a pair of vars are not adjacent vars, rather, a traditional combinatorical pairing is performed.
vector v= v1 to v13.
-loop #i= 1 to 13-1. /*Take one var
- loop #j= #i+1 to 13. /*Take the other var
-  do if v(#j)<v(#i). /*If the value in the 2nd is lesser than in the 1st (greater - for desc sort)
-   comp #temp= v(#j).
-   comp v(#j)= v(#i). /*swap the values
-   comp v(#i)= #temp.
-  end if.
- end loop.
-end loop.
exec.

*If there are missing values, sort will be incorrect. To avoit this,
*to 'do if' condition add: 'or miss(v(#i))', and all missing will be sorted to the vector's tail.


******** SELECTION (=EXTREMUM) algorithm, SWAP VERSION (NON-STABLE) *********.

*Selection algo looks for extremal value of the vector (minimum or maximum - depending on whether ascending or descending
*sorting is needed) and moves it to the beginning of the vector,
*then the same is repeated on the remaining portion of the vector, - and so forth till it all be sorted.
*It is the most intuitively plain sorting algorithm.
*This version is not "stable". Among the simple algorithms Selection algo is quicker than Bubble algo but
*usually slower than Insertion algo.
*Note: though in-built functions MIN (MAX) naturally offer themselves here, actually their usage will be not economical
*(if you like strict approach) because the functions imply covert running along the vector by SPSS processor, - but in every way we need
*to run along the vector to know the positions of the min (max) values. So it would be an extra run here.

vector v= v1 to v13.
loop #i= 1 to 13-1. /*Take value by value
-comp #m= #i. /*m will be position of min (max, if sort is desc) value; suppose so far it is on pos #i
-loop #j= #i+1 to 13. /*Run through the rest of the values to the right
- if v(#j)<v(#m) #m= #j. /*comparing them with supposedly min (max) and remembering the pos ot the lesser of them [greater - for desc sort, use > instead <]
-end loop. /*so that on exit from the run m is indeed the pos of min (max) value
-comp #temp= v(#i). /*Now do swap:
-comp v(#i)= v(#m). /*put min (max) value in pos i
-comp v(#m)= #temp. /*and the value having been in i - put it in pos m
end loop. /*Take next value (i+1) as supposed min (max)
exec.

*If data contain missing values then they will stay in their positions. To sort them all to the vector's tail
*add to the condition of the IF command (that is inside loop): 'or miss(v(#m))'.


******** SELECTION (=EXTREMUM) algorithm, SHIFT VERSION (STABLE) *********.

*This version is based on the previous one except that instead of swapping 2 values (current and the extremum)
*it undertakes additional run-through to move all values that are on the left of the extremun to the right by 1 position.
*At a cost of this prolongation (slowing) of the work "stability" is achieved.

vector v= v1 to v13.
loop #i= 1 to 13-1.
-comp #m= #i.
-loop #j= #i+1 to 13.
- if v(#j)<v(#m) #m= #j.
-end loop.
-comp #temp= v(#m).
-loop #j= #m-1 to #i by -1. /*Launch shifting: take value by value on the left side of the extremum
- comp v(#j+1)= v(#j). /*and move to 1 position right
-end loop.
-comp v(#i)= #temp. /*so that in the end pos i is vacant - put our extremum to it
end loop.
exec.


*************************** INSERTION algorithm ****************************.

*Consists of that it takes values one after another (starting from the 2nd) and, by scanning
*values which are on the left of it, finds the due place there to insert it
*(cards in card-index are usually sorted in this way by hand). By insertion it expands that left-hand subvector to the right by one position.
*Insertion value after value from right subvector to left, sorted, subvector, in this fashion grows the latter.
*Of the simple algorithms, this one is reputed quite quick; it is quicker on the average than Selection and much quicker than Bubble algos.
*However, its advantage over Selection holds only for data that are initially already not far from sortedness in the wanted direction.
*The algo is "stable". One more plus is that it permits data to enter (for example, be generated) gradually, value by value ("online" sorting).

vector v= v1 to v13.
loop #ins= 2 to 13. /*Take value by value starting from 2nd (ins is its sequential number, number of the var)
-comp #val= v(#ins). /*and memorize it; to the left of it lie the values that are already sorted by the algo
-loop #i= #ins-1 to 1 by -1 if v(#i)>#val. /*And we must find a place within them where to insert our value
                       /*Therefore, if the next value at the left to it is greater [lesser - for descending sort] then run through the sorted subvector from right to left
- comp v(#i+1)= v(#i). /*shifting values there 1 pos to the right until we run into value which is not greater than ours
-end loop.
-comp v(#i+1)= #val. /*There, we insert our value beside it, at the right; now the left, sorted subvector expanded by one value
end loop. /*Take next value in order, etc
exec.

*If data contain missing values then they will stay in their positions and serve as points of division in "subvectors": sort will go
*within subvectors only. E.g., if a case is 5 4 . 5 6 4, the output is 4 5 . 4 5 6.
*If you want to escape this division and want to sort all missing to the vector's tail, to if-condition inside loop add: 'or miss(v(#i))'.


*************************** SHELL's INSERTION algorithm ****************************.

*Shell (1959) offered how to speed up Insertion algo in situations of large as well as far-from-sortedness vectors.
*The main point is that values "on the left" - among which a place to insert a taken value is searched - are scanned through
*with large step (i.e. step not 1 but larger) which accelerates finding a place for insertion. After all values have been thus inserted
*and the vector becomes closer to sordetness the procedure repeates with a lesser step, then again with still lesser step. In the end, with step 1
*(i.e. usual Insertion algo is applied when the vector is almost sorted).
*For instance, there's a vector of 13 values. Initially take such a step so that the vector split in subvectors of 2 values,
*distanced from each other by step: 13/2=6.5, i.e. step'll be 6. These are values with positions 1-7, 2-8, 3-9, etc.
*Within these subvectors - within pairs of values - insertion sorting will be done. After that, another splitting of vector is done, 4 values in subvector,
*separated by step 3 (13/4=3.25). These are positions 1-4-7-10, 2-5-8-11, 3-6-9-12. And these subvectors undergo sorting.
*At the end, the whole vector will be sorted.
*Shell's sort is not "stable", nor it is "online" sort.

vector v= v1 to v13.
loop #n= 1 to 13. /*We'll do "splittings" of vector (number of splittings in fact will be less then 13, we'll be over before we reach n=13)
-comp #n= 2**#n. /*with n elements (values) in a subvector: 1-st splitting n=2, at 2-nd splitting n=4, at 3-rd n=8 etc
-comp #step= trunc(13/#n). /*Elements in subvector are not adjacent, rather, they are separated by gap step which's therefore equals the number of full (with n elements) subvectors
-if #step=1 #n= 13. /*If vector contains only one full subvector - take the whole vector instead
-loop #sv= 1 to #step. /*For each "splitting", undertake Insertion sorting in each of its subvectors, sv is the number of a subvector
*****Highlighted section is Insertion algo, the only nuance being that it goes with step step instead of step 1.
- loop #ins= #sv+#step to #sv+#step*(#n-1) by #step.
-  comp #val= v(#ins).
-  loop #i= #ins-#step to #sv by -#step if v(#i)>#val. /*[for desc sort: < instead of >]
-   comp v(#i+#step)= v(#i).
-  end loop.
-  comp v(#i+#step)= #val.
- end loop.
*****.
-end loop.
end loop if #step=1. /*If the subvector was the only one (and it is the whole vector) that's signal to end job, sorting is done
exec.

*If there are missing values in your data then to if-condition in 'loop #i' add: 'or miss(v(#i))', and they will be sorted to vector's tail.
*Without this addition, missing values will produce absurd sorting result.


************************ HEAPSORT algorithm ******************************.

*This is a quick algo for large vectors. It is not "stable".
*The job comprises 2 stages. On the 1st the vector is sorted into partly sorted vector called heap.
*On the 2nd stage the heap is further sorted into the result.
*Heap is a hierarchical structure of "parent with 2 children" type where a parent element is greater than its children
*(such heap is called max-heap) or a parent element is lesser than its children (heap called min-heap).
*Heap is stored as vector wherein for each element on position p
*there are corresponding children elements on positions 2*p and 2*p+1. An example of max-heap in form of figure and vector:
*          10
*     8          6
*   3   6      5   1
*  3 2 3 5    4 
*
*posit   1   2   3   4   5   6   7   8   9  10 11 12
*value  10   8   6   3   6   5   1   3   2   3  5  4
*
*On both stages, the algo is the same, swapping: elements on parental positions are compared with their children,
*and if it occures one of the children is greater (lesser, for building min-heap) than its parent the child swaps positions with the parent.
*The difference between the stages is in regulations for the algo: on the 1st stage all parents are being checked which lead to heap where
*the 1st element - root parent - is the largest (smallest) of all. On the 2nd stage that parent is withdrawn from swapping and the rest of vector
*undergoes heapifying, then the root parent is withdrawn again, one more shortening the part which's left for heapifying, etc.:
*each time that part becomes shorter and, besides, needs less ans less further sorting.
*It may seem paradoxal at first glance but it is bulding of max-heap that results in ascending sort, and it is bulding of mix-heap that
*brings descending sort.

vector v= v1 to v13.
comp #end= 13.
loop #stage= 1 to 2. /*Two stages: create heap on the 1st, sort heap on the 2nd
-loop #p= trunc(13*#stage/2) to 1 by -1. /*On the 1st stage p is parent's position: scan all parents (i e all elements from vector's mid to the left)
                                         /*On the 2nd stage value p runs through all elements from vector's end to the left
- do if #stage=2. /*On the 2nd stage there're differences with the 1st:
-  comp #temp= v(1). /*Swap 1st element (most grand parent) with last element (rightmost child) in the vector
-  comp v(1)= v(#p).
-  comp v(#p)= #temp.
-  comp #end= #p-1. /*And "withdraw" the last element from the vector, for it's already sorted: it's the largest (smallest, for desc sort) of all currently;
                    /*And in so doing the still-not-sorted part of the vector will gradually shorten
-  comp #p= 1. /*As for parent - to compare with children - take always currently the 1st element of vector
- end if.
*****Highlighted is comparison/swap of parent and child ("sifting"="percolating").
- loop #i= 1 to 13 if #p*2<=#end. /*Will repeat while a parent has a child (i e parent's pos p is such that 2*p doesn't exit vector)
                                  /*i will be not used (introduced only to escape SET MXLOOPS) and cycles will terminate before i=13
-  comp #c= #p*2. /*Position of the child (left one of the two)
-  do if #c+1<=#end. /*When there're 2 children
-   if v(#c+1)>v(#c) #c= #c+1. /*then, if the right child is greater [lesser, for desc sort: < instead >] than the left, take the right one into comparison
-  end if.
-  do if v(#p)<v(#c). /*Compare parent and the larger [smaller] child and swap their pos if the child is greater [lesser, for desc sort: > instead <] than the parent
-   comp #temp= v(#p).
-   comp v(#p)= v(#c).
-   comp v(#c)= #temp.
-   comp #p= #c. /*so that the parent is now on the child's pos (and so we'll then compare it with its futher children: its former grand-children)
-  else. /*And if parent is greater than the largest [lesser than the smallest] child, exit
-   break.
-  end if.
- end loop. /*to take next parent in a queue for parent-children comparison sift
*****.
-end loop.
end loop.
exec.

*If there are missing values the sort will be with errors. To ward it off and sort all missing to the vector's tail,
*to condition 'if v(#c+1)>v(#c)...' add: 'or miss(v(#c+1))', and to condition 'do if v(#p)<v(#c)' add: 'or miss(v(#c))'.


************************ QUICKSORT algorithm *****************************.

*This is a quick algo for large vectors, quicker than Heapsort but occupying more memory (because it writes notes for itself).
*It's not "stable". The idea is that vector is sorted into "halves" more and more fractionally, "devide and conquer" principle.
*All values greater than some selected value collect on one side of it and all values lesser than it collect on the other side of it.
*In either of the two subvectors some dividing (pivot) value is then selected again and the procedure repeates, and so on.
*Thus, the essense of the method is the division of a (sub)vector into left and right wings relatively to some selected pivot value.
*There exist a number of versions of the algo, here's one.

vector v= v1 to v13 /#s #e (13). /*Vectors s and e are to store "whereabouts" of subvectors (left wings only) emerging from division of larger subvectors
recode #s1 to #s13 (else=0). /*(Insure ourselves against carry-over from previous case because we deal with scratch vars)
comp #s(1)= 1. /*Begin with
comp #e(1)= 13. /*the subvector that is the whole vector: s - its start pos, e - its end pos
comp #w= 2. /*So, writing further whereabouts will start from here (w is position in vectors s and e to write to)
loop #r= 1 to 13 if #s(#r)>0. /*r is position in vectors s and e to read from: read s (start pos) and e (end pos) of the input subvector; the whole job lasts while there's something written in vector s (and e)
-comp #s= #s(#r). /*Copy start pos because it will change (while e, end pos, will remain for a while)
-loop #w= #w to 13 if #s<#e(#r). /*The task with the taken subvector is to repeatedly divide it into left and right wings with recoil from left (i e divide, then divide its right wing,
                                /*then divide the right wing of the right wing, etc) while there remains smth to divide: end pos is beyond start pos;
                                /*and w being pos in vectors s è e on which to write the whereabouts of emergent left wings, for we'll deal with them later
****Hightlited section is the division of subvector****.
- comp #p= trunc((#s+#e(#r))/2). /*Select pivot value: let it be the middle element in the subvector
- comp #temp= v(#p). /*Carry it to the head of the subvector (it is taking it away to a safe place)
- comp v(#p)= v(#s). /*while the 1st element goes to pivot's former pos (swap)
- comp v(#s)= #temp.
- comp #p= #e(#r). /*Now, p will be the reception pos for values being thrown over to the right wing (let p be the end of the subvector initially)
- loop #i= #e(#r) to #s+1 by -1. /*Do throwing over of values: take value one by one starting from the end (do not touch the pivot value kept on pos s)
-  do if v(#i)>v(#s). /*If the value is greater [lesser, for desc sort: < instead >] than the pivot value
-   comp #temp= v(#i). /*then put it on the right wing on the reception pos p
-   comp v(#i)= v(#p). /*(swap with the value occupying it)
-   comp v(#p)= #temp.
-   comp #p= #p-1. /*And move the reception pos by 1 to the left
-  end if.
- end loop. /*Take next value to compare it with the pivot value and to through over to the right if necessary
- comp #temp= v(#s). /*After throwings-over are done, put the pivot value on the pos p, wherever it is now
- comp v(#s)= v(#p). /*(swap with the value occupying it)
- comp v(#p)= #temp. /*Two wings are ready: all values on the right of the pivot value (its pos p) are larger [smaller], while on the left of it are not larger [not smaller] than it
****.
- comp #s(#w)= #s. /*Start pos of left wing: write it to our memorandum
- comp #e(#w)= #p-1. /*Start pos of left wing: write it to our memorandum
- comp #s= #p+1. /*Start pos of right wing: on next loop we'll devide subvector which starts here and ends on pos e
-end loop. /*And now we turn to do it
end loop. /*When there's nothing to divide anymore, next subvector to divide is taken by reading its whereabouts from s(r) e(r)
exec.

*In the presence of missing values sort is not affected but positions of missing values move. If you don't like it and want
*to sort all missing to the vector's tail, to 'do if' condition add: 'or miss(v(#i))'.

*Though a pivot element may be selected arbitrary, speed depends on its selection.
*If the input vector is already partly sorted then choosing a pivot element by an edge of (sub)vector considerably slows down the job.
*Therefore it's most optimal and universal to take a pivot element from a middle part of (sub)vector, like the above syntax does.
*Or you might choose it randomly.
...
Navigate from here