Wednesday, June 9, 2010

Full-Text Indexing in SQL Server


The most commonly used indexes in a SQL Server database are clustered and nonclustered indexes that are organized in a B-tree structure. You can create these types of indexes on most columns in a table or a view, except those columns configured with large object (LOB) data types, such as text and varchar(max). Although this limitation is not a problem in many cases, there will be times when you'll want to query such column types. However, without indexes defined on the columns, the query engine must perform a full table scan to locate the necessary data. But there is a solution—full-text search.

Full-text search refers to the functionality in SQL Server that supports full-text queries against character-based data. These types of queries can include words and phrases as well as multiple forms of a word or phrase. To support full-text queries, full-text indexes must be implemented on the columns referenced in the query. The columns can be configured with character data types (such as char and varchar) or with binary data types (such as varbinary and image). A full-text index is made up of word tokens that are derived from the text being indexed. For example, if the indexed text contains the phrase "tables can include indexes," the full-text index would contain four tokens: "tables," "can," "include," and "indexes." Because the list of tokens can be easily searched, full-text queries can quickly locate the necessary records.

In this article, I explain how to implement full-text indexing in your SQL Server 2005 or 2008 database, and I provide a number of examples to demonstrate how this is done. For these examples, I used the following script to create and populate the ProductDocsAdventureWorks2008 sample database in SQL Server 2008: table in the

USE AdventureWorks2008 GO IF OBJECT_ID (N'ProductDocs', N'U') IS NOT NULL DROP TABLE ProductDocs GO CREATE TABLE ProductDocs (   DocID INT NOT NULL IDENTITY,   DocTitle NVARCHAR(50) NOT NULL,   DocFilename NVARCHAR(400) NOT NULL,   FileExtension NVARCHAR(8) NOT NULL,   DocSummary NVARCHAR(MAX) NULL,   DocContent VARBINARY(MAX) NULL,   CONSTRAINT [PK_ProductDocs_DocID] PRIMARY KEY CLUSTERED (DocID ASC) ) GO INSERT INTO ProductDocs(DocTitle, DocFilename, FileExtension, DocSummary, DocContent) SELECT Title, FileName, FileExtension, DocumentSummary, Document FROM Production.Document

GO

If you're using the AdventureWorks sample database in SQL Server 2005, simply change the name of the database in the USE statement.

To implement full-text indexing in SQL Server, you should take the following steps:

  1. Create a full-text catalog, if necessary.
  2. Create the full-text index.
  3. Modify the list of noise words (SQL Server 2005) or stop words (SQL Server 2008), if necessary.
  4. Modify the thesaurus for the language being used, if necessary.

These steps are the same in both SQL Server 2005 and 2008, although the specific features supported in each version differ. Part of that difference results from the fact that full-text search is now integrated in the SQL Server 2008 database engine. Prior to 2008, full-text search was based on the MSSearch engine, which sits outside SQL Server and is part of the Microsoft Office group.

Creating the Full-Text Catalog

A full-text catalog provides a mechanism for organizing full-text indexes. Each catalog can contain zero or more indexes, but each index can be associated with only one catalog. Catalogs are implemented differently in SQL Server 2005 and 2008:

  • SQL Server 2005: A full-text catalog is a physical structure that must reside on the local hard drive associated with the SQL Server instance. Each catalog is part of a specific filegroup. If no filegroup is specified when the catalog is created, the default filegroup is used.
  • SQL Server 2008: A full-text catalog is a logical concept that refers to a group of full-text indexes. The catalog is not associated with a filegroup.

To create a full-text catalog in either version of SQL Server, you can use the CREATE FULLTEXT CATALOG statement, as shown in the following example:

USE AdventureWorks2008 GO CREATE FULLTEXT CATALOG ProductFTS

WITH ACCENT_SENSITIVITY = OFF

Full-text catalogs are associated with specific databases. In this case, I added the catalog to the AdventureWorks2008 database, but you can create the catalog on any user-defined database.

The only required clause in a CREATE FULLTEXT CATALOG statement is the CREATE FULLTEXT CATALOG clause, which requires that you provide a name for the catalog (in this case, ProductFTS). You can also specify whether this catalog is the default catalog, the catalog owner, or, as in the example above, whether the indexes in the catalog are accent sensitive or insensitive.

If the WITH ACCENT_SENSITIVITY clause is not specified, accent sensitivity is based on the default database collation. You can determine a database's collation by querying the sys.databases catalog view, as shown in the following example:

SELECT name, collation_name FROM sys.databases

WHERE name = 'AdventureWorks2008'

The follow table shows the statement's results:

Name

collation_name

AdventureWorks2008

SQL_Latin1_General_CP1_CI_AS

Note: The results shown above, along with the results shown for other examples in this article, are based on queries issued in SQL Server 2008 against the AdventureWorks2008 database. You results may be different depending on the SQL Server version and the database used.

As mentioned above, SQL Server 2005 full-text catalogs are part of a filegroup. If you're running the CREATE FULLTEXT CATALOGON FILEGROUP option. You can also specify the directory on which to store the catalog by using the IN PATH option. Both of these options are ignored in SQL Server 2008. For details about the ON FILEGROUP and IN PATH options, see the topic "CREATE FULLTEXT CATALOG (Transact-SQL)" in SQL Server 2005 Books Online. statement against SQL Server 2005, you can specify the name of the filegroup by using the

After you create a full-text catalog, you can use the sys.fulltext_catalogs catalog view to verify that the catalog has been created:

SELECT fulltext_catalog_id, name FROM sys.fulltext_catalogs

In this case, the statement returns the results shown in the following table:

fulltext_catalog_id

name

5

AW2008FullTextCatalog

10

ProductFTS

Notice that the ProductFTS full-text catalog is included in the results. Also notice that in this case the fulltext_catalog_id value is 10. The ID is a good number to note because you can use it in other statements that are related to the full-text catalog.

Creating the Full-Text Index

After you create your full-text catalog, you're ready to create your full-text index. You can then associate the index with the new catalog. If you don't specify a catalog when you create the index, the index is associated with the database's default catalog, whether it is the system catalog or a user-defined catalog that has been configured as the default.

A full-text index is defined at the table level, and only one full-text index can be defined per table. For a table to support a full-text index, a unique index must be defined on that table. In addition, the index must be defined on a single column and be non-nullable. This column is referred to as the key index in the full-text index definition. For best performance, the key index should be defined on a column configured with an integer data type. Often, the primary key is a good candidate for a key index.

You can define a full-text index by using the CREATE FULLTEXT INDEX statement, as shown in the following example.

CREATE FULLTEXT INDEX ON ProductDocs (DocSummary, DocContent TYPE COLUMN FileExtension LANGUAGE 1033) KEY INDEX PK_ProductDocs_DocID ON ProductFTS

WITH STOPLIST = SYSTEM

The first line of the statement includes the ON clause, which specifies the table name (in this case, ProductDocs). The statement's next line is a list of the columns that should be indexed (DocSummary and DocContent).

Notice that the DocContent column definition includes the TYPE COLUMN clause and the LANGUAGE clause. The TYPE COLUMN clause is necessary when the indexed column stores binary data. The TYPE COLUMN clause specifies the name of a different column in the table that stores the file extension for the binary data. For example, the binary data might be a .doc file or .xls file. SQL Server uses the column specified in the TYPE COLUMN clause to associate the binary data with the program, such as Word or Excel.

You can use the sys.fulltext_document_types catalog view to return a list of the document types supported by full-text search, as shown in the following statement:

SELECT document_type, version, manufacturer

FROM sys.fulltext_document_types

The following table provides a sample of some of the document types (indicated by their file extension) supported by full-text search:

document_type

Version

.ascx

12.0.6828.0

.asm

12.0.6828.0

.asp

12.0.6828.0

.aspx

12.0.6828.0

.bat

12.0.6828.0

.c

12.0.6828.0

.cmd

12.0.6828.0

.cpp

12.0.6828.0

.cxx

12.0.6828.0

.def

12.0.6828.0

.dic

12.0.6828.0

.doc

2006.0.6001.16503

The second clause in the DocContent column definition is LANGUAGE. For each column that you include in your index, you can specify the language of the documents in that column. You reference the language by using the locale identifier (LCID). You can view a list of the identifiers and their associated languages by using the sys.fulltext_languages catalog view:

SELECT * FROM sys.fulltext_languages

ORDER BY lcid

The following table shows a part of the result set. As you can see, 1033, the ID I use in the LANGUAGE clause, is English.

lcid

Name

0

Neutral

1025

Arabic

1026

Bulgarian

1027

Catalan

1028

Traditional Chinese

1031

German

1033

English

1036

French

1037

Hebrew

1039

Icelandic

1040

Italian

The next line of the CREATE FULLTEXT INDEX statement in the preceding example is the KEY INDEX clause. This is the name of the unique key index (in this case, PK_ProductDocs_DocID) that is defined on the ProductDocs table. Be sure to specify the index name, and not the column name, when defining your full-text index.

Following the KEY INDEX clause in the full-text index definition is the ON clause, which specifies the name of the full-text catalog (ProductFTS) that the index will join. In SQL Server 2008, you can also specify a filegroup where the index will be stored. However, this option isn't available in SQL Server 2005 because filegroup association is at the catalog level.

The final clause in the example CREATE FULLTEXT INDEX statement is WITH STOPLIST. This option, available only in SQL Server 2008, lets you specify the name of the stoplist that will be used for this index. In this case, the system stoplist is used, but you can instead specify a user-defined stoplist. (Stoplists are covered in more detail later in the article.)

After you create a full-text index on a table, you can use several catalog views to verify the index has been created and is associated with the correct catalog. The following SELECT statement joins the sys.tables, sys.fulltext_indexes, and sys.fulltext_catalogs views:

SELECT t.name AS TableName, c.name AS FTCatalogName FROM sys.tables t JOIN sys.fulltext_indexes i   ON t.object_id = i.object_id JOIN sys.fulltext_catalogs c

  ON i.fulltext_catalog_id = c.fulltext_catalog_id

As you can see in the following results, the ProductDocs table is associated with the ProductFTS catalog. Only tables with full-text indexes are listed in the results.

TableName

FTCatalogName

ProductReview

AW2008FullTextCatalog

Document

AW2008FullTextCatalog

JobCandidate

AW2008FullTextCatalog

ProductDocs

ProductFTS

One of the advantages of SQL Server 2008 is that the index is now stored within the database. That means you can issue a query that lists the contents of the index, something you cannot do in SQL Server 2005. The following SELECT statement uses the sys.dm_fts_index_keywords dynamic management function to return the list of terms stored in the full-text index created on the ProductDocs table:

SELECT display_term, column_id, document_count FROM sys.dm_fts_index_keywords

  (DB_ID('AdventureWorks2008'), OBJECT_ID('ProductDocs'))

The results returned by the statement include the indexed terms, along with the column ID and document count (number of rows) that contain the term. The column ID is based on the order the columns are defined in the table definition. In the ProductDocs table, the two indexed columns are DocSummary (the fifth column defined) and DocContent (the sixth column defined). The following table shows the first 25 terms stored in the ProductDocs full-text index.

display_term

column_id

document_count

100

5

1

100

6

1

1000

6

1

150

6

1

16

6

2

20

6

1

2000

6

1

20w

6

1

23ft

6

1

248

6

1

250

6

1

3000

6

1

44

6

1

48

6

1

500

5

1

500

6

1

618

6

1

619

6

1

620

6

1

Above

6

2

Absorbing

6

1

Acceptable

6

1

Accessories

6

1

Add

6

1

Adding

6

1

Modifying the List of Noise Words or Stop Words

When implementing full-text indexing in SQL Server, the area in which you will probably see the greatest differences between SQL Server 2005 and 2008 is in the way each version handles noise words or stop word.

Noise Words in SQL Server 2005

Noise words are those words that are automatically removed from a full-text index when that index is created. For example, in the phrase "an apple and an orange," the words "an" and "and" are considered noise words and are not be included in the index. Only "apple" and "orange" are tokenized and added to the index.

SQL Server 2005 defines noise words in a set of text files, which by default, are stored in the folder $SQL_Server_Install_Path\Microsoft SQL Server\MSSQL.1\MSSQL\FTData. Each text file is associated with a specific language and contains the noise words associated in that language. For example, the English noise words are stored in the noiseENG.txt file. The following list shows a sample of some of the words in the file:

about

1

after

2

all

also

3

an

4

and

5

another

6

any

7

are

8

as

9

at

0

be

$

because

been

before

being

between

both

but

by

came

can

come

could

did

do

You can edit a noise word file in any text editor. Simply add or remove words as necessary, and then save your changes. (Be sure to save a copy of the original file before changing it.)

Note: After you modify a noise word file, you must repopulate the applicable full-text indexes for the changes to take effect.

Stop Words in SQL Server 2008

SQL Server 2008 uses stop words, not noise words. Stop words are saved to stoplists, which are stored within SQL Server. You can create a stoplist by using a CREATE FULLTEXT STOPLIST statement, as shown in the following example:

CREATE FULLTEXT STOPLIST ProductSL

FROM SYSTEM STOPLIST;

In the first line, the statement creates a stoplist named ProductSL. The second line retrieves the stop words from the system stoplist and uses those stop words to populate the new stoplist. You can verify that the stoplist has been created by querying the sys.fulltext_stoplists catalog view:

SELECT stoplist_id, name FROM sys.fulltext_stoplists

As you can see in the following results, the ProductSL stoplist has been added to the database, and the stoplist ID is 5.

stoplist_id

Name

5

ProductSL

After you create a stoplist, you can view its contents by querying the sys.fulltext_stopwords catalog view, as shown in the following statement:

SELECT stopword FROM sys.fulltext_stopwords

WHERE stoplist_id = 5 AND language_id = 1033

Notice that I specified the stoplist ID and language ID to retrieve only the necessary content. The following table shows a partial list of the stop words saved to the ProductSL stoplist.

Stopword

About

After

All

Also

An

And

another

Any

Are

As

At

Be

because

been

before

being

between

both

But

By

came

Can

come

could

Did

Do

does

each

else

For

from

Get

Got

Had

Has

have

He

Her

here

Him

Himself

His

How

In SQL Server 2008, you can add words to or remove words from a stoplist by using the ALTER FULLTEXT STOPLIST statement. The following statement adds the word "nuts" to the stoplist:

ALTER FULLTEXT STOPLIST ProductSL

ADD 'nuts' LANGUAGE 1033;

When you generate a full-text index based on this stoplist, any occurrence of "nuts" will be treated as a stop word and be removed from the index.

SQL Server 2008 also includes the sys.dm_fts_parser dynamic management function. The function lets you test how SQL Server will tokenize a string based on a specific language and stoplist. In the following SELECT statement, the function parses the phrase "testing for fruit and nuts, any type of nut."

SELECT special_term, display_term FROM sys.dm_fts_parser

  (' "testing for fruit and nuts, any type of nut" ', 1033, 5, 0)

The first argument in the function is the string that will be parsed, the second argument is the language ID, the third argument is the stoplist ID, and the fourth argument specifies whether the parsing should be accent insensitive (0) or accent sensitive (1). The following table shows the query results. Notice that "nuts" is considered a noise word.

special_term

display_term

Exact Match

Testing

Noise Word

For

Exact Match

Fruit

Noise Word

And

Noise Word

Nuts

Noise Word

Any

Exact Match

Type

Noise Word

Of

Exact Match

Nut

If you want to drop "nuts" from the stoplist, you can use the following statement:

ALTER FULLTEXT STOPLIST ProductSL

DROP 'nuts' LANGUAGE 1033;

Now when you use the sys.dm_fts_parser function to view the stoplist, you will receive the following results, which show that "nuts" is now considered an exact match.

special_term

display_term

Exact Match

Testing

Noise Word

For

Exact Match

Fruit

Noise Word

And

Exact Match

Nuts

Noise Word

Any

Exact Match

Type

Noise Word

Of

Exact Match

Nut

Earlier in the article, when I created the full-text index on the ProductDocs table, I specified that the index should use the system stoplist. I can now modify the index definition to instead use the ProductSL index. In the following ALTER FULLTEXT INDEXProductSL: statement, I set the stoplist to

ALTER FULLTEXT INDEX ON ProductDocs

SET STOPLIST ProductSL

Note: After you modify a stoplist or full-text index definition, you must repopulate the applicable full-text indexes for the changes to take effect.

Modifying the Full-Text Thesaurus

Both SQL Server 2005 and SQL Server 2008 provide a set of XML thesaurus files that let you define synonyms to support full-text queries. For example, you can define a set of synonyms for "song," "tune," and "music." That way, whenever a query is issued against any one of these terms, the results include every other term defined in the set.

SQL Server includes a thesaurus file for each language supported by full-text search. The files are named according to the language they support. For instance, the thesaurus file that supports English synonyms is named tseng.xml.

In a default installation of SQL Server 2005, the thesaurus files are stored in the folder $SQL_Server_Install_Path\Microsoft SQL Server\MSSQL.1\MSSQL\FTData. In a default installation of SQL Server 2008, the files are stored in the folder $SQL_Server_Install_Path\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\FTData.

Each thesaurus file comes preconfigured with the following XML code. For example, the following code is included in the tseng.xml file:

<XML ID="Microsoft Search Thesaurus">

 

<!--  Commented out (SQL Server 2008)

 

    <thesaurus xmlns="x-schema:tsSchema.xml">

   <diacritics_sensitive>0</diacritics_sensitive>

        <expansion>

            <sub>Internet Explorer</sub>

            <sub>IE</sub>

            <sub>IE5</sub>

        </expansion>

        <replacement>

            <pat>NT5</pat>

            <pat>W2K</pat>

            <sub>Windows 2000</sub>

        </replacement>

        <expansion>

            <sub>run</sub>

            <sub>jog</sub>

        </expansion>

    </thesaurus>

-->

</XML>

As you can see, most of the code is commented out, but you can modify the file as necessary. The first element to note is <diacritics_sensitive>. This element determines whether the thesaurus file is accent sensitive (1) or insensitive (0). By default, a thesaurus file is accent insensitive.

The other two elements worth noting are <expansion> and <replacement>. The first lets you define expansion sets, and the second lets you define replacement sets:

  • Expansion set: A set of terms that can be substituted for each other. For example, in the sample <expansion> element in the tseng.xml file, three terms are included: "Internet Explorer," "IE," and "IE5." As a result, if a full-text query includes the term "Internet Explorer," the query results can include content that contains any of the three terms.
  • Replacement set: A set of terms in which one term is substituted for another. For example, in the example <replacement> element in the tseng.xml file, the terms "NT5" and "W2K" are replaced by "Windows 2000." As a result, if a full-text query includes the term "NT5" or "W2K," the search results include only content that contains "Windows 2000."

You can edit a thesaurus file in any XML or text editor. After you save your changes, you must take steps to ensure that the thesaurus is applied to your full-text searches. In SQL Server 2005, this means you must restart the full-text search engine. However, in SQL Server 2008, you need only run the sys.sp_fulltext_load_thesaurus_file system stored procedure, as shown in the following example:

EXEC sys.sp_fulltext_load_thesaurus_file 1033

As you can see, when you run the stored procedure, you must specify the LCID. The stored procedure then parses and loads the data from the applicable thesaurus file, without having to restart the full-text engine.


No comments:

Post a Comment