Tuesday, January 10, 2012

SQL Server Soundex Functions



What is Soundex?

One of the common problems encountered when attempting to find information in a database is the presence of incorrectly spelled words or names. When executing queries that look for matching text, misspellings will not be found. To alleviate this problem, you can use phonetic algorithms such as Soundex.

Soundex is an algorithm that generates a simple code for a string. A Soundex code consists of a letter and three digits that are based upon the way in which the word will be pronounced by an English speaker. Words that sound similar but that are spelled differently often generate the same Soundex code. This allows queries to be run that return rows of data base upon the similarity of words.

To give an example of a Soundex code, consider the word "calendar". Another, similar-sounding word is "Collendar". Both of these words have a Soundex code of "C453". A misspelled variation, "calenda" also has the same Soundex code. One pitfall of Soundex is that it does not work well with strings that contain spaces or punctuation. The encoding of "Co-lender" gives a result of "C000" even though it sounds similar to Calender. However, if you remove all non-letter characters you find that "Colender" has the same Soundex code, "C453".

Soundex Functions

The examples in this article include queries that are executed against the JoBS database. This is the database that has been created and populated throughout the course of the tutorial. If you do not have an up-to-date copy of this database, please download and execute the script using the link at the top of this page.

Soundex

T-SQL's Soundex function is used to generate the Soundex code for character data. The string to be encoded is passed as the only argument. The following script shows the Soundex codes for the words mentioned above.

PRINT soundex('calendar')   -- C453 PRINT soundex('collender')  -- C453 PRINT soundex('calenda')    -- C453 PRINT soundex('colender')   -- C453 PRINT soundex('co-lender')  -- C000

Difference

The second Soundex function allows a comparison of two strings to be made using Soundex codes. The result of theDifference function is an integer between zero and four. A value of zero indicates the worst possible match. A value of four is the best possible match. In most cases it is more appropriate to make a comparison using the Difference function than by manually comparing Soundex codes.

The following example is a query against the Customers table of the JoBS database. This represents a search for the user-entered surname, "Moons". The query returns all of the customers, with the closest phonetic matches appearing first. The closest matching surname is "Munoz".

SELECT     FirstName,     LastName,     difference('Moons', LastName) AS Difference FROM     Customers ORDER BY     difference('Moons', LastName) DESC

Another common use of phonetic algorithms is to show alternative searches when an initial search has returned few or no results. The query below shows a single alternative when searching for the surname, "Moons".

SELECT TOP 1     'Did you mean "' + LastName + '"?' AS Message FROM     Customers ORDER BY     difference('Moons', LastName) DESC  -- Did you mean "Munoz"?


No comments:

Post a Comment