Wednesday, May 5, 2010

Creating CSV Files Using BCP and Stored Procedures

This article focuses on the use of the Bulk Copy Program (BCP) to create CSV files. Although it is possible to create a CSV file using DTS or SSIS, using BCP is often simpler and more efficient.

I use master..sysobjects as an example table to extract.


Create a simple CSV file

The simplest way to copy data from a database table to file is to use the basic BCP command:


BCP master..sysobjects out c:\sysobjects.txt -c -t, -T –S<servername>


bcp accdb.dbo.CSVTest out c:\temp\author.bcp -n -Usa -P -S AW4\SQLEXPRESS


The basic format for the BCP command for creating a CSV file is as follows:

BCP <table> out <filename> <switches>

The switches used here are:

  • -c Output in ASCII with the default field terminator (tab) and row terminator (crlf)
  • -t override the field terminator with ","
  • -T use a trusted connection. Note that U –P may be used for username/password
  • -S connect to this server to execute the command

Note that, like DTS/SSIS, BCP is a client utility, hence you need to supply the connection information.

For transfer of data between SQL servers, in place of –c, use –n or -N for native data format (-N = Unicode). This is much faster and avoids data conversion problems. Please refer to the previous BOL link for the complete format of the BCP command.

As BCP is a command line utility it is executed from T-SQL using xp_cmdshell. Create a directory called BCP on your c: drive and execute:

declare @sql varchar(8000)

select @sql = 'bcp master..sysobjects out
c:\bcp\sysobjects.txt -c -t, -T -S'
+ @@servername

exec master..xp_cmdshell @sql

Other field and row delimiters

Often, character data includes commas which will be interpreted in the file as a field terminator. A simple way to cater for this is to use a different field delimiter which does not appear in the data. Commonly used characters are pipe (|):

declare @sql varchar(8000)

select @sql = 'bcp master..sysobjects out
c:\bcp\sysobjects.txt -c –t| -T -S'
+ @@servername

execmaster..xp_cmdshell@sql

And caret (^):

declare @sql varchar(8000)

select @sql = 'bcp master..sysobjects out
c:\bcp\sysobjects.txt -c –t^ -T -S'
+ @@servername

exec master..xp_cmdshell @sql

The terminators are not limited to a single character, so if necessary you can use (|^):

declare @sql varchar(8000)

select @sql = 'bcp master..sysobjects out
c:\bcp\sysobjects.txt -c –t|^ -T -S'
+ @@servername

exec master..xp_cmdshell @sql

Note that this will increase the size of the file and so slow down the import/export. Another way to cater for embedded commas is to "quote encapsulate" character data – more about that later.

It is unusual to need to change the row terminator from the default of crlf (carriage return, line feed) but occasionally you will need the cr or lf on its own.

To do this use the hex value cr = 13 = 0x0D, lf = 10 = 0x0A. If you can't remember these values they are easily obtained, as follows:

select cr = ascii('

')

select lf = ascii(right('

',1))

To use these in the BCP command for the row terminator the –r switch is used:

declare @sql varchar(8000)

select @sql = 'bcp master..sysobjects out
                 c:\bcp\sysobjects.txt -c -t, -r0x0D -T -S'
+ @@servername

exec master..xp_cmdshell @sql

When the resulting file is opened in notepad the row terminators will not end the line – the character should appear as a non-ASCII character (a little oblong). If opened or copied to query analyser (or management studio) these will be interpreted as line breaks and the file will be more readable.

The terminator characters can be encapsulated with double quotes in the command to allow for space. I like to do this for anything other than a single character delimiter. The following example gives a crlf row terminator and | ^ field terminator.

declare @sql varchar(8000)
select @sql = 'bcp master..sysobjects out
                 c:\bcp\sysobjects.txt -c -t"| ^" -r"0x0D0A" -T -S'
+ @@servername
exec master..xp_cmdshell @sql

Formatting the extracted data

If you do not require all the columns/rows, or you wish to alter the data, a simple method is to create a view. This can be used in the BCP command in the same way as a table. The following view allows you to extract just the name, the formatted create date and time and then order the results according the date created (most recent first):

use tempdb
go
create view vw_bcpMasterSysobjects
as
  select top 100 percent
      name ,
      crdate = convert(varchar(8), crdate, 112) ,
      crtime = convert(varchar(8), crdate, 108)
   from master..sysobjects
   order by crdate desc
go
declare @sql varchar(8000)
select @sql = 'bcp tempdb..vw_bcpMasterSysobjects out
                 c:\bcp\sysobjects.txt -c -t, -T -S'
+ @@servername
exec master..xp_cmdshell @sql

Now we can quote encapsulate the extracted data by including the formatting in the view:

use tempdb
go
create view vw_bcpMasterSysobjects
as
   select top 100 percent
      name = '"' + name + '"' ,
      crdate = '"' + convert(varchar(8), crdate, 112) + '"' ,
      crtime = '"' + convert(varchar(8), crdate, 108) + '"'
   from master..sysobjects
   order by crdate desc
go
declare @sql varchar(8000)
select @sql = 'bcp tempdb..vw_bcpMasterSysobjects out
                 c:\bcp\sysobjects.txt -c -t, -T -S'
+ @@servername
exec master..xp_cmdshell @sql

Note:
Quote encapsulation can also be performed by use of a format file.

Complete control – stored procedures

If you examined the BCP syntax in BOL, you might have noticed that it is possible to extract from a query by using the queryout keyword in place of out.

So, for example, an equivalent but neater version of the previous code extract would place the ORDER BY clause in the BCP statement rather than the view:

use tempdb
go
create view vw_bcpMasterSysobjects
as
   select
      name = '"' + name + '"' ,
      crdate = '"' + convert(varchar(8), crdate, 112) + '"' ,
      crtime = '"' + convert(varchar(8), crdate, 108) + '"'
   from master..sysobjects
go
declare @sql varchar(8000)
select @sql = 'bcp "select * from tempdb..vw_bcpMasterSysobjects
order by crdate desc, crtime desc"
queryout c:\bcp\sysobjects.txt -c -t, -T -S'
+ @@servername
exec master..xp_cmdshell @sql

Of course, strictly speaking, the view was not necessary at all as the query could have been included in the BCP statement but that can get difficult to maintain.

The queryout method allows you to BCP from the result of a stored procedure, which opens up a lot of possibilities and offers a lot of control over the file format. For anything other than a simple table extract I would tend to use this method rather than a view. I would also format each line within the stored procedure. This means that the formatting can be tested independently from the file creation.

Employing this technique, the above extract becomes:

use tempdb
go
create proc s_bcpMasterSysobjects
as
   select   '"' + name + '"'
            + ',' + '"' + convert(varchar(8), crdate, 112) + '"'
            + ',' + '"' + convert(varchar(8), crdate, 108) + '"'
   from master..sysobjects
   order by crdate desc
go
declare @sql varchar(8000)
select @sql = 'bcp "exec tempdb..s_bcpMasterSysobjects"
queryout c:\bcp\sysobjects.txt -c -t, -T -S'
+ @@servername
exec master..xp_cmdshell @sql

More complex formatting

We can now change the format of the extracted data to anything we require. For example, to include column names at the top and a rowcount at the end:

use tempdb
go
create proc s_bcpMasterSysobjects
as
      set nocount on
     
      create table #a (seq int, crdate datetime, s varchar(1000))
      -- header - column headers
      insert      #a (seq, crdate, s)
      select      1, null,
                  '"name","crdate","crtime"'
     
      -- data
      insert      #a (seq, crdate, s)
      select      2, crdate,
                          '"' + name + '"'
                  + ',' + '"' + convert(varchar(8), crdate, 112) + '"'
                  + ',' + '"' + convert(varchar(8), crdate, 108) + '"'
      from master..sysobjects

      -- trailer - rowcount
      insert      #a (seq, crdate, s)
      select      3, null,
                  'rowcount = ' + convert(varchar(20),count(*)-1)
      from #a

      select      s
      from  #a
      order by seq, crdate desc
go

This stored procedure will format the data as required, but when it is run via the BCP command it receives an error:

"Invalid object name '#a'"

Using the profiler you can see that when using queryout, the query is called twice. BCP tries to get the result set format by calling the stored procedure with fmtonly. This does not execute the stored procedure but returns the format of the resultset. Unfortunately, it does not work if the stored procedure creates and uses a temp table. This problem can be circumvented by including a set fmtonly off command in the execution:

declare @sql varchar(8000)
select @sql = 'bcp "set fmtonly off exec tempdb..s_bcpMasterSysobjects"
queryout c:\bcp\sysobjects.txt -c -T -S'
+ @@servername
exec master..xp_cmdshell @sql

Note that the column terminator is not used since the resultset is a single column.

---Editor's Note---
This was the only piece of code that I failed to get working. The stored procedure executed fine outside the BCP, but when I tried to run the above command I received an error:

SQLState = HY010, NativeError = 0
Error = [Microsoft][SQL Native Client]Function sequence error
NULL

The author was unable to replicate this error. Is anyone aware of what might be causing this on my machine?
---End Editor's Note---

Look at this execution using the profiler and you will see set fmtonly off followed by set fmtonly on. There is a potential problem with this though: it means that the stored procedure is executed twice, once for the format and once to extract the data – due to the fmtonly setting, both these calls will produce a resultset. This needs to be taken into account when considering how long the process will take and its impact on your system.

As the stored procedure is executed twice it should not contain any updates of permanent data. Particularly if data is flagged as exported then the flagging will be performed on the first call and no data will be extracted for the second call. All in all, this method should be used with caution and comments (warnings) placed in the stored procedure for the aid of future maintenance.

To get round all of these problems, use the stored procedure to insert into a table then, from the BCP, extract from that table. This gives you added features that can be valuable on systems that are performing a lot of extracts:

  • It gives a record of the data extracted – if the file is lost it can be recreated.
  • The data extracted can be presented to a user for viewing from the database.
  • If there are problems with the file, the support can see the data without needing access to the file.

For this process, the format stored procedure, s_bcpMasterSysobjects, will insert into the table using an identifier and the BCP will extract using the same identifier. This process can be controlled be (a control) stored procedure which would allocate the identifier and pass it to the format stored procedure which inserted the extracted data into a table with that identifier. It then calls another stored procedure or in-line code to create files (via BCP) for each identifier which is not marked as complete. The identifier is flagged as complete after the file creation. In this way the file creation becomes independent of the extract of the data.

This identifier can be passed to both processes from a controlling stored procedure or the format stored procedure can get the next available identifier and the extract can flag the identifier as extracted – so it extracts everything that has not been extracted, allowing the export to be independent of the extract.

Which technique you use will depend on your system. The following code demonstrates the use of a controlling stored procedure:

use tempdb
go
create table Extract
      (
      Extract_ID  int ,
      Seq1        varchar(100) null ,
      Seq2        varchar(100) null ,
      Seq3        varchar(100) null ,
      Data        varchar(8000) ,
      InsertDate  datetime default getdate()
      )
go
create proc s_bcpMasterSysobjects
@ExtractID int
as
declare
@rowcount int
      set nocount on
     
      -- header - column headers
      insert      Extract (Extract_ID, Seq1, Data)
      select      @ExtractID,
                  '01' ,
                  '"name","crdate","crtime"'
     
      -- data
      insert      Extract (Extract_ID, Seq1, Seq2, Data)
      select      @ExtractID,
                  '02' ,
                  + convert(varchar(100), '99990101' - crdate, 121) ,
                          '"' + name + '"'
                  + ',' + '"' + convert(varchar(8), crdate, 112) + '"'
                  + ',' + '"' + convert(varchar(8), crdate, 108) + '"'
      from master..sysobjects

      select @rowcount = @@rowcount
     
      -- trailer - rowcount
      insert      Extract (Extract_ID, Seq1, Data)
      select      @ExtractID,
                  '03' ,
                  'rowcount = ' + convert(varchar(20),@rowcount)
go
create proc s_Extract
@ExtractID int
as
      select      Data
      from  Extract
      where Extract_ID = @ExtractID
      order by Seq1, Seq2, Seq3
go

Now the data is extracted via:

Create table ExportLog (Export_id int, Status varchar(20))

Insert ExportLog select 25, 'Extracting'
exec
tempdb..s_bcpMasterSysobjects 25
update
ExportLog set Status = 'Exporting' where Export_id = 25
declare
@sql varchar(8000)
select @sql = 'bcp "exec tempdb..s_Extract 25"
queryout c:\bcp\sysobjects.txt -c -T -S'
+ @@servername
exec master..xp_cmdshell @sql
update
ExportLog set Status = 'complete' where Export_id = 25

You can view the data extracted via:

exec tempdb..s_Extract 25

Alternatively, you can use the following, which will also give the date that the data was extracted:

select * from tempdb..Extract order by Seq1, Seq2, Seq3

Note:
For the extract I have given three sort columns Seq1, Seq2, Seq3 which are all order ascending. This means that the extracting stored procedure needs to place the values in ascending order (see the date manipulation). This could have ascending and descending sort columns or the use could be dependent on the extract type.

For a system which is performing a lot of small exports I will usually implement this process, and a similar process for imports.

Extracting all tables from a database

This is usually performed to transfer data between servers/databases and so should use the native data format. The easiest way is to create the BCP statements from a query then execute them:

select 'exec master..xp_cmdshell'
            + ' '''
            + 'bcp'
            + ' ' + TABLE_CATALOG + '.' + TABLE_SCHEMA + '.' + TABLE_NAME
            + ' out'
            + ' c:\bcp\'
            + TABLE_CATALOG + '.' + TABLE_SCHEMA + '.' + TABLE_NAME + '.bcp'
            + ' -N'
            + ' -T'
            + ' -S' + @@servername
            + ''''
from INFORMATION_SCHEMA.TABLES
where TABLE_TYPE = 'BASE TABLE'

The result of this will be a series of BCP statements of the form:

exec master..xp_cmdshell 'bcp tempdb.dbo.Extract out
c:\bcp\tempdb.dbo.Extract.bcp -N -T –S<servername>'

which will extract all tables in the database. To import just change the "out" to "in".

Summary

We have seen how to use BCP to perform a simple extract of a table to a CSV file, how to perform slightly more complex extracts using a view, extracting using a stored procedure to format the data and a system which allows the data to be extracted and saved in a table and viewed/exported from there.


No comments:

Post a Comment