WinBatch Tech Support Home

Database Search

If you can't find the information using the categories below, post a question over in our WinBatch Tech Support Forum.

TechHome

Functions

Can't find the information you are looking for here? Then leave a message over on our WinBatch Tech Support Forum.

ItemSort and Hyphens


Question:

I'm sorting a string containing a list of tab separated items. One item, containing a dash, is sorting before where I expect it to be. Example: ab-cd shows up before abcd.

I read in the database that hyphens and apostrophes are "ignored", so I did an ItemReplace of "-" to "|" before the sort, and an ItemReplace of "|" to "-" after the sort, to no avail. Using "{" in place of "|" produced the same results.

What can I use to get ItemSort to "not help", and sort the string "as is".

Answer:

WinBatch uses "Dictionary" sorts. It puts in the same place as you would find it in a dictionary of the language you are using.

Most non-characters are "nearly" ignored in sorting, thus ab-cd would be found very near abcd.

Most stuff in windows sorts this way. for example here are some files sorted in Windows Explorer.

abcdef.txt
abctest.wbt
abc-text.txt
abcxdef.txt
abcxxyz.txt
abcxys.txt
abcxyz.txt
abc-xyz.txt
abcxyzdef.txt
Note that abc-text is no where near abc-xyz.txt but are closer to where the files without the dashes are.

Anyway its a dictionary sort and there is no way to modify it. All the compare operators for strings use the same theory. The values of the numerical codes representing the characters matter not one whit in the sorting order.

User Reply:

So I wrote my own sort routine. Now there is a new problem. When doing a StrCmp (or "abcd" : "ab-d", where ":" is the Relational Operator), it turns out abcd is less than ab-d !!! Whaa??
if "abcd" < "ab-d"
Message("test", "abcd < ab-d")
else
if "abcd" == "ab-d"
Message("test", "abcd == ab-d")
else
Message("test", "abcd > ab-d")
endif
endif
There may be other issues such as differing string lengths. How can I do a bitwise compare ?????

Answer:

Relational operator also does "dictionary sequencing" on stirngs. Bit codes are meaningless.

All the sorting/comparing on strings in Winbatch works on dictionary sequencing, not on whatever underlying character assignment codes the operating system might be running on.

So maybe if we knew why you needed a character set encoding specific sort method some long unused brain cells might spring back into like and recall a solution to the problem.

User Reply:

I hope I am not coming across as being obsessed by this. With over 30 years of programming experience, I can't believe that something as commonly used as sorts and, especially, as compares are not truly reliable. Now I've got to go back and check everyplace I've done a StrCmp.

This dictionary sort feature should be an option, not the norm. Since apparently WYSIWYG is instead the option (or not an option at all), do you have any code lying around that does a bitwise compare on two strings?

Answer:

Do you have some code that shows its not reliable? Its reliable as far as I know. Have not see any failures as of yet. Much effort was spent in "sync'ing" up all the compares, sorts and related fucntion to be consistent.

I can understand some desired for a character code sort, but it is vaguely specific to the operating system and is also very US-centric. A large number of countries desired seort sequences and string compare sequences do not all into the same order the characters appear in the ANSI character set. Even in the US, the 'a' and the 'A' characters are not anywhere near each other.

You would need an entire set of functions to sort, compare and display strings. Its quite difficult to get *everything* snyc'ed up so its all predictable.

User Reply:

I guess let's close out this thread. I have the intention to find a workaround, but maybe I need to understand the problem better. To me, it appears to be a problem; maybe it isn't. To date, I've done COBOL, C, PLI, FORTRAN, yada, yada, and have never had a test be anything other than on exactly what is at hand.

Answer:

Check out the strcoll c runtime function documentation. You will see even lowly c makes a distinction between character set order and lexicographic character order.

From the documentation of the C-language lstrcmp function with is used for a lot of compares and sorts in WinBatch...

Here is some cutting and pasting for the lstrcmp and CompareString function from the Windows/C documentation

lstrcmp remarks

The lstrcmp function compares two strings by checking the first characters against each other, the second characters 
against each other, and so on until it finds an inequality or reaches the ends of the strings. 

The function returns the difference of the values of the first unequal characters it encounters. For example, lstrcmp determines that "abcz" is greater than "abcdefg" and returns the difference of z and d. 
The language (locale) selected by the user at setup time, or via the control panel, determines which string is greater (or whether the strings are the same). If no language (locale) is selected, Windows performs the comparison by using default values. In the Windows United States language functions, uppercase characters have lower values than lowercase characters. 

With a double-byte character set (DBCS) version of Windows, this function can compare two DBCS strings. 

The Win32 lstrcmp function uses a word sort, rather than a string sort. A word sort treats hyphens and apostrophes differently than it treats other symbols that are not alphanumeric, in order to ensure that words such as "coop" and "co-op" stay together within a sorted list. Note that in 16-bit versions of Windows, lstrcmp uses a string sort. For a detailed discussion of word sorts and string sorts, see the Remarks section of the reference page for the CompareString function . 


Comparestring Remarks If the two strings are of different lengths, they are compared up to the length of the shortest one. If they are equal to that point, then the return value will indicate that the longer string is greater. For more information about locale identifiers, see Locale Identifiers. Normally, strings are compared using what is called a "word sort" technique. In a word sort, all punctuation marks and other non-alphanumeric characters, except for the hyphen and the apostrophe, come before any alphanumeric character. The hyphen and the apostrophe are treated differently than the other non-alphanumeric symbols, in order to ensure that words such as "coop" and "co-op" stay together within a sorted list. If the SORT_STRINGSORT flag is specified, strings are compared using what is called a "string sort" technique. In a string sort, the hyphen and apostrophe are treated just like any other non-alphanumeric symbols: they come before the alphanumeric symbols. The following table shows a list of words sorted both ways:

Word Sort

String Sort

Word Sort

String Sort

billet

bill's

t-ant

t-ant

bills

billet

tanya

t-aria

bill's

bills

t-aria

tanya

cannot

can't

sued

sue's

cant

cannot

sues

sued

can't

cant

 sue's

sues

con

co-op

went

we're

coop

con

were

went

co-op

coop

we're

were

The lstrcmp and lstrcmpi functions use a word sort. The CompareString and LCMapString functions default to using a word sort, but use a string sort if their caller sets the SORT_STRINGSORT flag. The CompareString function is optimized to run at the highest speed when dwCmpFlags is set to 0 or NORM_IGNORECASE, and cchCount1 and cchCount2 have the value -1. The CompareString function ignores Arabic Kashidas during the comparison. Thus, if two strings are identical save for the presence of Kashidas, CompareString returns a value of 2; the strings are considered "equal" in the collation sense, though they are not necessarily identical.

User Reply:

Okay, I think I’ve distilled the problem to where the real issue can be seen. I’ve got a list of people whose e-mails I want to preserve no matter what. Recently, one of these senders changed his address, the only differences being that 1) he added a hyphen between his first and last name, and 2) he shortened his alias. In the code example below, his first name is “a”, last name is “b”, and aliases are “xxx” and “x”. I separate senders’ names and alias with TABs, saving each sender’s info as a line (LF) in a .txt file. When I go to read my e-mail, I (Winbatch) SORT the file and look for my trusted senders’ names in it. If they are there, I mark them as “trusted”, the rest (signaled by the < TEST failing) I have to review manually.

Since going from an “external” to an “internal” sort, my hyphenated sender (no pun intended) has failed to make the trusted list. Step through the code below and see how the first array differs from the second. Unless if somebody has a better idea, I find that I am going to have to extract the names into one list, the aliases into another (some names are cryptic, so I want to keep the aliases), create a sorted “index” list based on the names, and then re-arrange the names and alias using the index list.

BTW, thanks for the input from all on this subject. The immediate use may appear overkill, but I’ve got other things that will need this type of data manipulation. Also, I found yet another statement of lstrcmp at support.microsoft.com (article 96748). It would really be nice to have an “as is” compare / sort.

Here’s da code (F11 to the Exit line, then click on ra1_itm and ra2_itm):

str1 = StrCat("ab", @TAB, "x", @LF)
str1 = StrCat(str1, "a-b", @TAB, "xxx", @LF)
str2 = ItemSort(str1, @LF)
ra1_itm = Arrayize(str2, @LF)
str1 = StrCat("a-b", @TAB, "x", @LF)
str1 = StrCat(str1, "ab", @TAB, "xxx", @LF)
str2 = ItemSort(str1, @LF)
ra2_itm = Arrayize(str2, @LF)
Exit 

Answer:

Maybe a BinarySort?

User Reply:

I don't think the binary sort would solve the problem, as long as the record carries both the name and alias. It would speed things up though.

Actually, lstrcmp is a really neat option, but strcmp should still be available. But without an amendment, I think I can get the "index card" sort to do the job. If so, I'll post the code.

Answer:

With BinarySort you can sort on part of a record. As you're only concerned with hypens I think an indexed list or BinarySort is the way to go. Depending on the original format, final format required, and list size you may find indexing is just as fast as using BinarySort. BinarySort requires fixed sized records so you'd have to loop thru all the records and get the max length for both fields, then loop thru again and StrFix the fields.
Article ID:   W16446
File Created: 2005:02:18:12:20:46
Last Updated: 2005:02:18:12:20:46