# Sorting Within Cases

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 | ****************************************************************************************** * SORT WITHIN CASES: IMPLEMENTATION OF SOME POPULAR SORTING ALGORITHMS * ****************************************************************************************** *Kirill Orlov. kior@comtv.ru * 2007/03/16 *Below is SPSS syntax of some well-known sorting algorithms. *In this case, the algos are used to sort values within cases. *Replace number 13 with the number of variables you are going to submit to sorting, and run *the corresponding syntax of this or that algo. *The following algos considered: *SIMPLE SORTING ALGORITHMS: BUBBLE, SELECTION, INSERTION. *ALGORITHMS FOR QUICK SORTING OF LARGE VECTORS: SHELL's INSERTION, HEAPSORT, QUICKSORT. *What's a "stable" sorting algorithm: *If in the data (vector) there are identical values ("ties") then a "stable" sorting algo *guarantees that original order among them will be kept, that is, the value occupying initially earlier (more left) *position in the vector than another value, equal to it, will stay more left than it in the output vector. *"Non-stable" sorting algo does not quarantee it. ****************************************************************************************** *Create data. input prog. vector v (13 f2). loop #c= 1 to 20. -loop #v= 1 to 13. - comp v(#v)= rnd(rv.uniform(1,13)). -end loop. -end case. end loop. end file. end input prog. exec. *************** BUBBLE (=EXCHANGE) algorithm, STABLE VERSION ******************. *Vars are looked through in pairs and swap their values when necessary. *The procedure repeates until no more swap is needed. *This algo is regarded as the most slow among simple sorting algorithms. It is "stable". vector v= v1 to v13. loop #it= 1 to 13. /*It can take maximum as much iterations of the procedure as there're vars -comp #sorted= 1. /*That's signal that the vector is sorted, let's think meanwhile it's sorted -loop #i= 2 to 13-#it+1. /*Run through vars, taking pair: this var and preceding one (or could take this and the following - doesn't matter) /*Could write just 'to 13', but 'to 13-#it+1' is time-thrifty (the thing is that at each iteration, the largest [smallest, for desc sort] value /*of all the series will definitely find itself at the vector's tail, therefore we need not to touch this tail during next iteration) - do if v(#i)<v(#i-1). /*If the var's value is lesser than in the previous [to sort in desc order - greater than in it] - comp #sorted= 0. /*Then, marking that sorting is needed, - comp #temp= v(#i). /*Swap our two vars by their values (temp is temporary storage) - comp v(#i)= v(#i-1). - comp v(#i-1)= #temp. - end if. -end loop. end loop if #sorted. /*Stop procedure (iterations) if, having looked over all pairs, find no more need to sort exec. *If data contain missing values then they will stay in their positions and serve as points of division in "subvectors": sort will go *within subvectors only. E.g., if a case is 5 4 . 5 6 4, the output is 4 5 . 4 5 6. *If you want to escape this division and want to sort all missing to the vector's tail, to 'do if' condition add: 'or miss(v(#i-1))'. *************** BUBBLE (=EXCHANGE) algorithm, NON-STABLE VERSION ******************. *This version of Bubble sort is somewhat quicker than the previous but it is not "stable". *Here, a pair of vars are not adjacent vars, rather, a traditional combinatorical pairing is performed. vector v= v1 to v13. -loop #i= 1 to 13-1. /*Take one var - loop #j= #i+1 to 13. /*Take the other var - do if v(#j)<v(#i). /*If the value in the 2nd is lesser than in the 1st (greater - for desc sort) - comp #temp= v(#j). - comp v(#j)= v(#i). /*swap the values - comp v(#i)= #temp. - end if. - end loop. -end loop. exec. *If there are missing values, sort will be incorrect. To avoit this, *to 'do if' condition add: 'or miss(v(#i))', and all missing will be sorted to the vector's tail. ******** SELECTION (=EXTREMUM) algorithm, SWAP VERSION (NON-STABLE) *********. *Selection algo looks for extremal value of the vector (minimum or maximum - depending on whether ascending or descending *sorting is needed) and moves it to the beginning of the vector, *then the same is repeated on the remaining portion of the vector, - and so forth till it all be sorted. *It is the most intuitively plain sorting algorithm. *This version is not "stable". Among the simple algorithms Selection algo is quicker than Bubble algo but *usually slower than Insertion algo. *Note: though in-built functions MIN (MAX) naturally offer themselves here, actually their usage will be not economical *(if you like strict approach) because the functions imply covert running along the vector by SPSS processor, - but in every way we need *to run along the vector to know the positions of the min (max) values. So it would be an extra run here. vector v= v1 to v13. loop #i= 1 to 13-1. /*Take value by value -comp #m= #i. /*m will be position of min (max, if sort is desc) value; suppose so far it is on pos #i -loop #j= #i+1 to 13. /*Run through the rest of the values to the right - if v(#j)<v(#m) #m= #j. /*comparing them with supposedly min (max) and remembering the pos ot the lesser of them [greater - for desc sort, use > instead <] -end loop. /*so that on exit from the run m is indeed the pos of min (max) value -comp #temp= v(#i). /*Now do swap: -comp v(#i)= v(#m). /*put min (max) value in pos i -comp v(#m)= #temp. /*and the value having been in i - put it in pos m end loop. /*Take next value (i+1) as supposed min (max) exec. *If data contain missing values then they will stay in their positions. To sort them all to the vector's tail *add to the condition of the IF command (that is inside loop): 'or miss(v(#m))'. ******** SELECTION (=EXTREMUM) algorithm, SHIFT VERSION (STABLE) *********. *This version is based on the previous one except that instead of swapping 2 values (current and the extremum) *it undertakes additional run-through to move all values that are on the left of the extremun to the right by 1 position. *At a cost of this prolongation (slowing) of the work "stability" is achieved. vector v= v1 to v13. loop #i= 1 to 13-1. -comp #m= #i. -loop #j= #i+1 to 13. - if v(#j)<v(#m) #m= #j. -end loop. -comp #temp= v(#m). -loop #j= #m-1 to #i by -1. /*Launch shifting: take value by value on the left side of the extremum - comp v(#j+1)= v(#j). /*and move to 1 position right -end loop. -comp v(#i)= #temp. /*so that in the end pos i is vacant - put our extremum to it end loop. exec. *************************** INSERTION algorithm ****************************. *Consists of that it takes values one after another (starting from the 2nd) and, by scanning *values which are on the left of it, finds the due place there to insert it *(cards in card-index are usually sorted in this way by hand). By insertion it expands that left-hand subvector to the right by one position. *Insertion value after value from right subvector to left, sorted, subvector, in this fashion grows the latter. *Of the simple algorithms, this one is reputed quite quick; it is quicker on the average than Selection and much quicker than Bubble algos. *However, its advantage over Selection holds only for data that are initially already not far from sortedness in the wanted direction. *The algo is "stable". One more plus is that it permits data to enter (for example, be generated) gradually, value by value ("online" sorting). vector v= v1 to v13. loop #ins= 2 to 13. /*Take value by value starting from 2nd (ins is its sequential number, number of the var) -comp #val= v(#ins). /*and memorize it; to the left of it lie the values that are already sorted by the algo -loop #i= #ins-1 to 1 by -1 if v(#i)>#val. /*And we must find a place within them where to insert our value /*Therefore, if the next value at the left to it is greater [lesser - for descending sort] then run through the sorted subvector from right to left - comp v(#i+1)= v(#i). /*shifting values there 1 pos to the right until we run into value which is not greater than ours -end loop. -comp v(#i+1)= #val. /*There, we insert our value beside it, at the right; now the left, sorted subvector expanded by one value end loop. /*Take next value in order, etc exec. *If data contain missing values then they will stay in their positions and serve as points of division in "subvectors": sort will go *within subvectors only. E.g., if a case is 5 4 . 5 6 4, the output is 4 5 . 4 5 6. *If you want to escape this division and want to sort all missing to the vector's tail, to if-condition inside loop add: 'or miss(v(#i))'. *************************** SHELL's INSERTION algorithm ****************************. *Shell (1959) offered how to speed up Insertion algo in situations of large as well as far-from-sortedness vectors. *The main point is that values "on the left" - among which a place to insert a taken value is searched - are scanned through *with large step (i.e. step not 1 but larger) which accelerates finding a place for insertion. After all values have been thus inserted *and the vector becomes closer to sordetness the procedure repeates with a lesser step, then again with still lesser step. In the end, with step 1 *(i.e. usual Insertion algo is applied when the vector is almost sorted). *For instance, there's a vector of 13 values. Initially take such a step so that the vector split in subvectors of 2 values, *distanced from each other by step: 13/2=6.5, i.e. step'll be 6. These are values with positions 1-7, 2-8, 3-9, etc. *Within these subvectors - within pairs of values - insertion sorting will be done. After that, another splitting of vector is done, 4 values in subvector, *separated by step 3 (13/4=3.25). These are positions 1-4-7-10, 2-5-8-11, 3-6-9-12. And these subvectors undergo sorting. *At the end, the whole vector will be sorted. *Shell's sort is not "stable", nor it is "online" sort. vector v= v1 to v13. loop #n= 1 to 13. /*We'll do "splittings" of vector (number of splittings in fact will be less then 13, we'll be over before we reach n=13) -comp #n= 2**#n. /*with n elements (values) in a subvector: 1-st splitting n=2, at 2-nd splitting n=4, at 3-rd n=8 etc -comp #step= trunc(13/#n). /*Elements in subvector are not adjacent, rather, they are separated by gap step which's therefore equals the number of full (with n elements) subvectors -if #step=1 #n= 13. /*If vector contains only one full subvector - take the whole vector instead -loop #sv= 1 to #step. /*For each "splitting", undertake Insertion sorting in each of its subvectors, sv is the number of a subvector *****Highlighted section is Insertion algo, the only nuance being that it goes with step step instead of step 1. - loop #ins= #sv+#step to #sv+#step*(#n-1) by #step. - comp #val= v(#ins). - loop #i= #ins-#step to #sv by -#step if v(#i)>#val. /*[for desc sort: < instead of >] - comp v(#i+#step)= v(#i). - end loop. - comp v(#i+#step)= #val. - end loop. *****. -end loop. end loop if #step=1. /*If the subvector was the only one (and it is the whole vector) that's signal to end job, sorting is done exec. *If there are missing values in your data then to if-condition in 'loop #i' add: 'or miss(v(#i))', and they will be sorted to vector's tail. *Without this addition, missing values will produce absurd sorting result. ************************ HEAPSORT algorithm ******************************. *This is a quick algo for large vectors. It is not "stable". *The job comprises 2 stages. On the 1st the vector is sorted into partly sorted vector called heap. *On the 2nd stage the heap is further sorted into the result. *Heap is a hierarchical structure of "parent with 2 children" type where a parent element is greater than its children *(such heap is called max-heap) or a parent element is lesser than its children (heap called min-heap). *Heap is stored as vector wherein for each element on position p *there are corresponding children elements on positions 2*p and 2*p+1. An example of max-heap in form of figure and vector: * 10 * 8 6 * 3 6 5 1 * 3 2 3 5 4 * *posit 1 2 3 4 5 6 7 8 9 10 11 12 *value 10 8 6 3 6 5 1 3 2 3 5 4 * *On both stages, the algo is the same, swapping: elements on parental positions are compared with their children, *and if it occures one of the children is greater (lesser, for building min-heap) than its parent the child swaps positions with the parent. *The difference between the stages is in regulations for the algo: on the 1st stage all parents are being checked which lead to heap where *the 1st element - root parent - is the largest (smallest) of all. On the 2nd stage that parent is withdrawn from swapping and the rest of vector *undergoes heapifying, then the root parent is withdrawn again, one more shortening the part which's left for heapifying, etc.: *each time that part becomes shorter and, besides, needs less ans less further sorting. *It may seem paradoxal at first glance but it is bulding of max-heap that results in ascending sort, and it is bulding of mix-heap that *brings descending sort. vector v= v1 to v13. comp #end= 13. loop #stage= 1 to 2. /*Two stages: create heap on the 1st, sort heap on the 2nd -loop #p= trunc(13*#stage/2) to 1 by -1. /*On the 1st stage p is parent's position: scan all parents (i e all elements from vector's mid to the left) /*On the 2nd stage value p runs through all elements from vector's end to the left - do if #stage=2. /*On the 2nd stage there're differences with the 1st: - comp #temp= v(1). /*Swap 1st element (most grand parent) with last element (rightmost child) in the vector - comp v(1)= v(#p). - comp v(#p)= #temp. - comp #end= #p-1. /*And "withdraw" the last element from the vector, for it's already sorted: it's the largest (smallest, for desc sort) of all currently; /*And in so doing the still-not-sorted part of the vector will gradually shorten - comp #p= 1. /*As for parent - to compare with children - take always currently the 1st element of vector - end if. *****Highlighted is comparison/swap of parent and child ("sifting"="percolating"). - loop #i= 1 to 13 if #p*2<=#end. /*Will repeat while a parent has a child (i e parent's pos p is such that 2*p doesn't exit vector) /*i will be not used (introduced only to escape SET MXLOOPS) and cycles will terminate before i=13 - comp #c= #p*2. /*Position of the child (left one of the two) - do if #c+1<=#end. /*When there're 2 children - if v(#c+1)>v(#c) #c= #c+1. /*then, if the right child is greater [lesser, for desc sort: < instead >] than the left, take the right one into comparison - end if. - do if v(#p)<v(#c). /*Compare parent and the larger [smaller] child and swap their pos if the child is greater [lesser, for desc sort: > instead <] than the parent - comp #temp= v(#p). - comp v(#p)= v(#c). - comp v(#c)= #temp. - comp #p= #c. /*so that the parent is now on the child's pos (and so we'll then compare it with its futher children: its former grand-children) - else. /*And if parent is greater than the largest [lesser than the smallest] child, exit - break. - end if. - end loop. /*to take next parent in a queue for parent-children comparison sift *****. -end loop. end loop. exec. *If there are missing values the sort will be with errors. To ward it off and sort all missing to the vector's tail, *to condition 'if v(#c+1)>v(#c)...' add: 'or miss(v(#c+1))', and to condition 'do if v(#p)<v(#c)' add: 'or miss(v(#c))'. ************************ QUICKSORT algorithm *****************************. *This is a quick algo for large vectors, quicker than Heapsort but occupying more memory (because it writes notes for itself). *It's not "stable". The idea is that vector is sorted into "halves" more and more fractionally, "devide and conquer" principle. *All values greater than some selected value collect on one side of it and all values lesser than it collect on the other side of it. *In either of the two subvectors some dividing (pivot) value is then selected again and the procedure repeates, and so on. *Thus, the essense of the method is the division of a (sub)vector into left and right wings relatively to some selected pivot value. *There exist a number of versions of the algo, here's one. vector v= v1 to v13 /#s #e (13). /*Vectors s and e are to store "whereabouts" of subvectors (left wings only) emerging from division of larger subvectors recode #s1 to #s13 (else=0). /*(Insure ourselves against carry-over from previous case because we deal with scratch vars) comp #s(1)= 1. /*Begin with comp #e(1)= 13. /*the subvector that is the whole vector: s - its start pos, e - its end pos comp #w= 2. /*So, writing further whereabouts will start from here (w is position in vectors s and e to write to) loop #r= 1 to 13 if #s(#r)>0. /*r is position in vectors s and e to read from: read s (start pos) and e (end pos) of the input subvector; the whole job lasts while there's something written in vector s (and e) -comp #s= #s(#r). /*Copy start pos because it will change (while e, end pos, will remain for a while) -loop #w= #w to 13 if #s<#e(#r). /*The task with the taken subvector is to repeatedly divide it into left and right wings with recoil from left (i e divide, then divide its right wing, /*then divide the right wing of the right wing, etc) while there remains smth to divide: end pos is beyond start pos; /*and w being pos in vectors s è e on which to write the whereabouts of emergent left wings, for we'll deal with them later ****Hightlited section is the division of subvector****. - comp #p= trunc((#s+#e(#r))/2). /*Select pivot value: let it be the middle element in the subvector - comp #temp= v(#p). /*Carry it to the head of the subvector (it is taking it away to a safe place) - comp v(#p)= v(#s). /*while the 1st element goes to pivot's former pos (swap) - comp v(#s)= #temp. - comp #p= #e(#r). /*Now, p will be the reception pos for values being thrown over to the right wing (let p be the end of the subvector initially) - loop #i= #e(#r) to #s+1 by -1. /*Do throwing over of values: take value one by one starting from the end (do not touch the pivot value kept on pos s) - do if v(#i)>v(#s). /*If the value is greater [lesser, for desc sort: < instead >] than the pivot value - comp #temp= v(#i). /*then put it on the right wing on the reception pos p - comp v(#i)= v(#p). /*(swap with the value occupying it) - comp v(#p)= #temp. - comp #p= #p-1. /*And move the reception pos by 1 to the left - end if. - end loop. /*Take next value to compare it with the pivot value and to through over to the right if necessary - comp #temp= v(#s). /*After throwings-over are done, put the pivot value on the pos p, wherever it is now - comp v(#s)= v(#p). /*(swap with the value occupying it) - comp v(#p)= #temp. /*Two wings are ready: all values on the right of the pivot value (its pos p) are larger [smaller], while on the left of it are not larger [not smaller] than it ****. - comp #s(#w)= #s. /*Start pos of left wing: write it to our memorandum - comp #e(#w)= #p-1. /*Start pos of left wing: write it to our memorandum - comp #s= #p+1. /*Start pos of right wing: on next loop we'll devide subvector which starts here and ends on pos e -end loop. /*And now we turn to do it end loop. /*When there's nothing to divide anymore, next subvector to divide is taken by reading its whereabouts from s(r) e(r) exec. *In the presence of missing values sort is not affected but positions of missing values move. If you don't like it and want *to sort all missing to the vector's tail, to 'do if' condition add: 'or miss(v(#i))'. *Though a pivot element may be selected arbitrary, speed depends on its selection. *If the input vector is already partly sorted then choosing a pivot element by an edge of (sub)vector considerably slows down the job. *Therefore it's most optimal and universal to take a pivot element from a middle part of (sub)vector, like the above syntax does. *Or you might choose it randomly. |

Related pages

...