SQLSolutions and SSIS: Using the Script Task in SSIS to Process Data Files When They Arrive

Introduction

In two previous articles, Using the WMI Event Watcher Task in SSIS to Process Data Files and Using the Konesans File Watcher Task in SSIS to Process Data Files, I demonstrated two techniques that allowed us to process a data file into a database table as soon as it arrived in a directory. Both techniques allow us to complete our task yet both have advantages and disadvantages. Neither will be right for every project, and in some cases neither will be useful.
In this article I will demonstrate a third technique in SSIS to complete the same task detailed in my previous two articles using only a SSIS Script Task that leverages the .NET Framework 4.0 Class System.IO.FileSystemWatcher. This technique aims to address some of the disadvantages of each of the other two techniques, but as with most things there are tradeoffs. We will walk through the development of a package that leverages this technique and at the end we’ll compare the three techniques. Hopefully this will prepare you to choose a technique that best suits your project.
To recap our problem case, we want our SSIS package to wait for an Excel file and then load that Excel file into a table as soon as it arrives. Similar to the techniques that use the WMI Event Watcher Task or the Konesans File Watcher Task this technique also produces far less system activity, in terms of starting and stopping the SSIS package, than techniques where a SSIS package is run every minute (or few) to check for a file and exit if one is not available. Using the Script Task to alert our package that it can begin processing significantly increases our chances of experiencing a very short amount of time between when a file becomes available and when processing that file begins.

Requirements

Here are some facts about the scenario we will be following in this article:

A directory will be designated as the “drop directory”. This is the directory (i.e. Windows folder) where Excel files that need to be processed by our SSIS package will be delivered (i.e. dropped) by a business user, or by any automated process.
Each Excel file will have a Worksheet (also known as a Tab) named “Products” containing a fixed set of columns. Each file may contain a different number of data rows but the format will remain consistent.
The name of the Excel file will change from day to day however it will follow a pattern. For example, the name will follow the format BusinessData.YYYYMMDD.xlsx where YYYYMMDD will change depending on the day it is delivered (e.g. BusinessData.20120903.xlsx).
A minimum of zero files and a maximum of one file will be delivered for processing per business day.
The data file needs to be processed as soon as it arrives.
The SSIS package should wait indefinitely for a file to arrive. *

Here is a list of the primary technologies used to produce this demo:

Windows 7
SQL Server 2012 Evaluation Edition (Database Engine, Management Tools Complete, Integration Services and SQL Server Data Tools (SSDT) selected during install)
SQL Server 2012 Data Tools (SSDT) for SSIS development
ACE OLE DB Driver 12.0 (install Microsoft Access Database Engine 2010 -or- Access or Excel 2007/2010)

Regarding CPU, the Script Task does not use much CPU (effectively zero) while waiting for a file to arrive. Regarding memory use, an SSIS package that runs continuously will stay loaded in memory while it is running and watching for files, just like any other running program would. However for discussion purposes it will occupy effectively the same amount of memory that a package built to run once per minute would occupy.
The purpose of running a package continuously is to alleviate the need to load and unload the package from memory each time we need to check if a file has arrived. The tradeoff here is between the constant memory use but overall low CPU use of a package that runs continuously vs. the CPU and memory allocation and de-allocation overhead of loading and unloading a package many times in the course of a day.
* A note about allowing an SSIS package to run continuously: there is nothing inherently wrong with setting up an SSIS package to behave in this way. The File Watcher Task offers us a very lightweight way (in terms of resources) to watch for new files being added to a directory.

Design

We stated our requirements above, decided to use SSIS 2012 to process our Excel files, and now need a high-level outline for how we will accomplish the task using the chosen technology. Generically this will be our processing logic:

When a file arrives move it from the “drop directory” to a “processing directory” to reduce the chances of anything interfering with the file while it is being processed.
Clear the database staging table where the Excel data will be loaded.
Load the file from the “processing directory” into the staging table.
Move the file from the “processing directory” to an “archive directory”.

In terms of SSIS the above outline translates into the following:

Use a Script Task to watch for new files in the “drop directory”.
Move the file from the “drop directory” to the “processing directory” using a File System Task.
Clear the database staging table using an Execute SQL Task.
Load the Excel file into the staging table from the “processing directory” using a Data Flow.
Move the file from the “processing directory” to the “archive directory” using a File System Task.
Exit.

Here is a preview of the finished package:

Step-By-Step Development

In this section we will walk through the creation of the SSIS package, step-by-step.
Before we begin, ensure you have an Excel driver capable of connecting to Excel 2007/2010 (xlsx) documents. You can check the Drivers tab in the ODBC Data Source Administrator accessible via the Control Panel to see if you have the ACE ODBC Driver. If you have this driver, then you should also have the OLE DB Driver as well which is what SSIS will actually be using to connect to the Excel file.

If you do not have Excel 2007 (or above) installed or have a driver installed that lists *.xlsx as a supported Excel file extension then you can attain the ACE Drivers by installing the Microsoft Access Database Engine 2010, a free download. See the References section for a link to the download.
Let's get started with our development.
1. Create a new directory named ExcelDrop. I used C:\@\ for this demo (e.g. C:\@\ExcelDrop\).
2. Under ExcelDrop create two directories, Processing and Archive. Your directory tree should look like this:

3. Stage file BusinessData.YYYYMMDD.xlsx (available in the download attached to this article) in the Processing directory. During the development phase the SSIS Excel Connector will need the file to be present in order to generate the metadata necessary to define the file import within the SSIS package. Once development is complete the file can be moved, and the package will have no trouble operating properly at runtime if the file is not present.
4. Create a new database and staging table in your SQL Server 2012 instance:

USE [master]
GO
CREATE DATABASE [Inventory]
GO
USE [Inventory]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE dbo.ProductStaging
(
 ProductStagingId INT 
        IDENTITY(1,1) NOT NULL,
 ProductName NVARCHAR(255) NULL,
 PricePerUnit DECIMAL(12, 2) NULL,
 CONSTRAINT PK_ProductStaging
         PRIMARY KEY CLUSTERED
         (ProductStagingId ASC)
)
GO

5. Create a new SQL Server 2012 SSIS Package using SQL Server Data Tools (SSDT) and name it LoadExcelWorkbook. See the References section at the end of this article if you need assistance getting started with SSDT and SSIS 2012.
6. Add the following Variables to the SSIS package. Note that some expressions build on other variables so create all variables initially, then revisit the variables that require it to set their Expression values.

Variable Name	Type	Value	Expression
ArchivePath	String	<derived from Expression>	@[User::WatcherInputDropPath] + "Archive\\"
ProcessingFile	String	<derived from Expression>	@[User::ProcessingPath] + @[User::WatcherOutputFileName]
ProcessingPath	String	<derived from Expression>	@[User::WatcherInputDropPath] + "Processing\\"
WatcherInputDropPath	String	C:\@\ExcelDrop\	n/a
WatcherInputFileMask	String	*BusinessData.xlsx**	n/a
WatcherInputFindExistingFiles	Boolean	False	n/a
WatcherInputIncludeSubdirectories	Boolean	False	n/a
WatcherInputTimeoutAsWarning	Boolean	False	n/a
WatcherInputTimeoutSeconds	Int32	0	n/a
WatcherOutputFileFullName	String	*<empty string>*	n/a
WatcherOutputFileName	String	*<empty string>*	n/a

For additional information about SSIS Expressions see the References links at the end of this article.
The variables prefixed with Watcher control the behavior of the Script Task that will watch for files. No other configuration or code changes are necessary outside of setting the values for these variables, i.e. you do not need to know or code any C# to begin using this solution.
Here is a description of each variable:

Variable Name	Description
WatcherInputDropPath	Path to watch. This can be a path on a local drive letter, a mapped drive letter or a UNC path. The path must have a trailing backslash, e.g. C:\@\ExcelDrop\
WatcherInputFileMask	Pattern of file to watch. Windows file name wildcards are allowed, e.g. *
WatcherInputFindExistingFiles	If True then existing files will be detected. If false only newly created files will be detected.
WatcherInputIncludeSubdirectories	If True subdirectories of WatcherInputDropPath are watched. This variable can be used in conjunction with WatcherInputFindExistingFiles to achieve the desired behavior.
WatcherInputTimeoutAsWarning	If True and the timeout period is reached before a file is detected then only a Warning is raised. Set to True and use in conjunction with an OnWarning Event Handler to react to timeouts without an exception being raised. Variable value is ignored if WatcherInputTimeoutSeconds = 0.
WatcherInputTimeoutSeconds	Number of seconds to watch for a file. Set to 0 to wait indefinitely.
WatcherOutputFileFullName	This variable is used by the Script Task to store the full file name (including path) of the detected file, e.g. C:\@\ExcelDrop\BusinessData.YYYYMMDD.xlsx
WatcherOutputFileName	This variable is used by the Script Task to store the file name of the detected file, e.g. BusinessData.YYYYMMDD.xlsx

7. Drag a new Script Task from the Common section in the Toolbox onto the Control Flow design surface.

8. Double-click the Script Task to open the Script Task Editor. First go to the General page and name the task Watch for Incoming Excel File:

	WMI Event Watcher Task	Konesans File Watcher Task	Script Task as implemented in Article using the .NET FileSystemWatcher Class
Included in SSIS	Yes	No	Yes
Must Run Third-party Installer on Each Server and Developer Workstation	No	Yes	No
Supported Provided By	Microsoft	Konesans	Microsoft
Code and Configuration Provided by Microsoft or is Viewable	WMI Event Watcher Task provided by Microsoft.	Konesans File Watcher Task not provided by Microsoft and is closed source.	Script Task provided by Microsoft. Script content is custom written and is open source.
User Base	Worldwide	Relatively Limited	Worldwide
Support for Watching Network Location	Yes	Yes	Yes
Permissions Required for Watching Network Location	Elevated permission on remote OS	Permission to Network Location	Permission to Network Location
Can Recognize Existing Files	No	Yes	Yes
Can Recognize Files in Subdirectories	Yes via carefully crafted and slow-running WQL	Yes	Yes
Waits Until Exclusive Access to File is Achieved Before Returning Control	No	Yes	Yes
Skills Additional to SSIS Required to Effectively Support or Extend	WQL	None	C# Scripting

Tuesday, September 2, 2014

Using the Script Task in SSIS to Process Data Files When They Arrive

Introduction

Requirements

Design

Step-By-Step Development

Testing

Conclusion

No comments:

Post a Comment