SQLSolutions and SSIS

Monday, December 16, 2013

Performance tuning in SSIS

http://sqlmag.com/sql-server-integration-services/designing-ssis-packages-high-performance

The process for designing SQL Server Integration Services (SSIS) packages is typically iterative. You start by getting the components working individually or in small sets, then concentrate on ensuring that the components will work in the correct sequence. During later iterations, you add in more components or adjust properties to perform error handling. Then, in a final pass, you might add in abstractions, taking advantage of variables and expressions to enable runtime changes for your package. But your work is not yet complete at this stage.

Before you put your package into production, you need to take some more time to review your package with an eye toward preventing, or at least mitigating, performance problems. Sooner or later, the business environment is likely to change, requiring you to perform the same work within a shorter time span or to process higher data volumes than originally planned. I’ll describe areas within a package that can cause performance problems and offer suggestions for modifying the package design to get better performance.

Bear in mind that there are various factors that can affect the performance of SSIS packages. Several of these factors are completely external to SSIS, such as the source systems that you’re using for data extractions, disk speed and configuration, NICs, network bandwidth, and the amount of memory available on the server executing the packages. I won’t address how to resolve bottlenecks caused by external factors, but rather concentrate on specific strategies you can take within the design of your packages. I assume that you already have a general understanding of SSIS architecture and package development practices.

Understanding Control Flow Performance

Every SSIS package has at least one task in the control flow. If you add multiple tasks to the control flow, you can direct the sequence of tasks by adding precedence constraints to connect a series of tasks from beginning to end. You can even group these tasks together in containers. Apart from the Data Flow Task, which I’ll describe in more detail later, the performance of each individual task depends on the external systems with which the task must communicate to get its job done. Therefore, the only way within the package to speed up control flow processing is to have tasks or containers (collectively called executables) run in parallel. Figure 1 illustrates parallelism in Container A and sequential workflow in Container B. In this example, the containers themselves also execute in parallel.

Figure 1: Running executables in parallel to speed up control flow processing

The package property that determines how many executables can run in parallel is MaxConcurrentExecutables, as shown in Figure 2. The default is -1, which means that the control flow engine will use the number of logical processors on the server plus 2. For example, if you’re executing the package with the default setting on a quad-core server, you can have up to six executables running in parallel.

Figure 2: Using the MaxConcurrentExecutables property to configure how many executables can run in parallel

In some cases, you won’t see much change if you try to increase the MaxConcurrentExecutables value. However, one scenario with potential for improvement is when you have a package with tasks that must wait for a response from external systems and your server is dedicated to executing packages. In that case, you might be able to boost parallelism by replacing the MaxConcurrentExecutables value with a higher number. Start by increasing the value to the number of processors plus 3, then test the package in Business Intelligence Development Studio (BIDS) to see whether the number of tasks executing in parallel increased. If so, continue to increment the property value by 1, test, and repeat until no further parallelization is possible, or you have no more tasks to run in parallel.

Understanding Data Flow Performance

The Data Flow Task is used to retrieve data from one or more sources, optionally change the structure or content of the data in transit, and send the data to one or more destinations. In business intelligence (BI) applications, this task is used to perform extraction, transformation, and loading (ETL) operations.

SSIS uses a pipeline engine with an in-memory buffer architecture to efficiently manage Data Flow Task operations. Performance of the pipeline engine largely depends on the number of records moving through the pipeline and the number of buffers required to accomplish the transformations and to move the data to the destinations. Therefore, to optimize performance, you need to understand how data flow components and the Data Flow Task properties affect pipeline throughput and buffer requirements.

Finding Bottlenecks and Establishing Baselines

Before you start making changes to the data flow, you should establish some baselines so that you can identify the slower components in the data flow and better assess the impact of any subsequent changes. Let’s assume that your package has a Data Flow Task with a single source, one transformation, and one destination. You should execute the Data Flow Task multiple times to calculate an average for various conditions, which you record in a table. Here are the steps:

Create a table with the column and row headings shown in Table 1.
Execute the package a total of 10 times, adding the times to column A. This will determine the time required to execute the package when all three components are intact.
Replace the destination component with a Multicast transformation, which allows the task to complete successfully without adding any noticeable overhead to the task execution time. Execute the package a total of 10 times, adding the times to column B.
Remove the first transformation in the data flow and connect the source directly to the Multicast transformation to measure only the data extraction process. Execute the package a total of 10 times, recording the times in column C.
Calculate the average for each of the three base measurements.
Derive the processing time of the transformation component by subtracting the averaged value in column C from the averaged value in column B.
Derive the time required to load the destination by subtracting the averaged value in column B from the averaged value in column A.

Most packages have more than one transformation, and some packages have multiple sources or destinations. For these packages, you simply add more columns to the table as needed and work through combinations of components to isolate processing time for each component.

Based on the test results in the table, you can identify the component that consumes most of the processing time and focus your tuning efforts on it. Furthermore, as you implement changes, you’ll be able to quantify the difference in performance and determine whether the resulting difference justifies altering your package permanently.

Fixing Source Component Bottlenecks

Bringing data into the data flow pipeline is naturally constrained by the speed with which the source can actually deliver the data requested by the source component. The processing time required to extract the data is the fastest possible speed for the Data Flow Task, but it’s pointless to extract data without doing something with it. Any subsequent transformation or destination that you add to the data flow pipeline will add performance overhead, which I’ll discuss separately.

Assuming that you optimized the external factors (e.g., disk and server resources) and identified the source component that’s the bottleneck, what can you do within the data flow design to improve that component’s performance? Here are some guidelines:

Reduce the number of columns. Regardless of the type of data source you’re using, select only the columns that are needed for transformations and columns that must be loaded into the destination. That way, memory isn’t wasted on data that won’t be used downstream.

Reduce the number of rows. When working with a relational source, use a WHERE clause to filter the data to the rows you need rather than rely on a transformation in the pipeline to eliminate rows. When working with large flat files, break up the source file where possible and process the resulting files using sequential Data Flow Tasks to minimize I/O contention.

Reduce column width. The column width is affected by the data type. You might need to recast data types in a relational source query or manually edit column widths for other sources to reduce the column width to the smallest possible size that can still accommodate the data coming from the source. SSIS works more efficiently with smaller data types.

Use the SQL Command option instead of the Table or View option for relational sources. The OLE DB source lets you retrieve data from a table or view in the OLE DB data source or use a SQL command to retrieve data from the OLE DB data source. When you use the Table or View option to retrieve data from a table or view, the data flow pipeline engine issues a command using the OPENROWSET function to get the data. When you use the SQL Command option, the engine uses the faster sp_executesql procedure.

Use the fast parsing mode for Flat File sources. If the data environment doesn’t need to support local-specific formats for integer, date, and time data types, you can boost performance by setting the FastParse property to True for relevant columns in the Advanced Editor for the Flat File source.

Fixing Transformation Bottlenecks

The performance of transformations in the data flow depends on the efficient use of buffers. High-performance transformations, known as synchronous transformations (see Figure 3), operate on data in an existing buffer, thereby avoiding the overhead of copying data from one buffer to another. Synchronous transformations can either manipulate data in place (e.g., Derived Column transformation) or add a new column (e.g., Audit transformation). But even this group of transformations consists of two subgroups: streaming transformations, which are the fastest of all the transformations, and row-based transformations, which must perform an operation row by row and consequently are slower.

Figure 3: Categorizing transformations into the synchronous and asynchronous groups

After a row-based transformation completes processing all rows in a buffer, that buffer becomes available for processing by the next transformation in the data flow. If the next transformation is a row-based transformation, the data stays in the same buffer. Performance of the data flow is optimal in this case. If the next transformation in the data flow is an asynchronous transformation, a new buffer is required (which takes time to create and load) and a new thread is introduced into the data flow.

Moving from one memory buffer to another is still faster than the I/O operations required to store data on disk between transformations, which is why SSIS in general performs very well. The problem arises when the data flow engine must handle large data volumes, causing it to run out of memory to allocate for buffers. When memory is no longer available, the data flow engine swaps data to disk and consequently slows the data flow performance.

There are two types of asynchronous transformations—partially blocking transformations and blocking transformations—as you can see in Figure 3. Partially blocking transformations have multiple inputs, but a single output. After the transformation writes a row to the new buffer, the row is available to the next transformation even while the partially blocking transformation continues processing remaining rows. Blocking transformations, by contrast, must read all rows in the pipeline before writing rows to the new buffer, which prevents downstream components from starting. Consequently, blocking transformations are most likely to be the cause of performance problems in the data flow when large volumes of data are involved.

When baseline performance testing identifies transformations as the bottleneck, there might be another way to design the data flow to accomplish the same goal. Consider the following design alternatives:

Perform transformations in the source query. When you’re using a relational source, performing transformations in the source query lets you take advantage of the database engine’s capabilities while freeing up resources on the package execution server if it’s separate from your source server. Data type casting, cleansing (using ISNULL or TRIM functions, for example), aggregations, and sorts are common operations that you can perform in the source query. This approach will work only with relational sources. To use it, you need to set the IsSorted property of the source component’s output to True. You also need to configure the SortKeyPosition properties for each output column using the Advanced Editor for the source component.

Separate aggregate operations. If you need to perform different aggregations (each of which is associated with a separate destination) but you can’t perform the aggregations in the source query, you can create a separate Aggregate transformation for each destination rather than create multiple outputs for the transformation. That way, downstream processing for the more summarized aggregations (with fewer rows) can continue without waiting for the less summarized aggregations (with more rows) to finish. In addition, you can configure the AutoExtendFactor property to tune memory separately for each Aggregate transform.

Remove unneeded columns. After asynchronous operations, there might be columns that were used for a transformation and left in the pipeline even though they won’t be loaded into the destination. By eliminating these columns, you can minimize the memory requirements for the new buffer.

Reduce the number of rows. If the data flow doesn’t have a relational source, you won’t be able to use a WHERE clause in the source query to eliminate rows. However, row reduction is still an important goal for performance tuning, so add a Conditional Split transformation immediately after the source component to filter data as early as possible in the data flow.

Optimize the cache for Lookup transformations. Using the Lookup transformation without a cache is slow, so having all rows available in memory using the Full Cache option is a better approach. However, pipeline processing is blocked until the cache is loaded. So, you should follow some of same recommendations I discussed in the “Source Data Extraction” section: Reduce the number of columns and rows to manage memory usage and use the SQL command to retrieve data more quickly.

Preload the cache for Lookup transformations. If you’re using SQL Server 2008 or later, you can use the Cache Transform transformation to preload the cache in a separate data flow. You can even use a separate package to load the cache as a special type of Raw File that persists until you delete it. This cache file loads into memory much more quickly than loading the cache directly from an OLE DB source.

Replace the Slowly Changing Dimension (SCD) transformation with Merge Join and Conditional Split transformations. The SCD transformation is notoriously slow with large dimensions because it has to perform a row-by-row lookup to check whether a dimension row in the pipeline already exists in the target dimension. You could use a Lookup transformation instead, but the memory and time required to load the cache might still create a performance bottleneck. An alternative design is to use a Merge Join transformation with a LEFT JOIN to match sorted source records on the left with sorted dimension records on the right. You then add a Conditional Split transformation to evaluate columns added from the dimension source. If those columns are null, there is no match, which means the row is a new record. If there’s a match, you use the Conditional Split transformation to separate Type 1 from Type 2 processing.

Use an Execute SQL task instead of an OLE DB Command transformation.Another commonly slow component is the OLE DB Command transformation, which performs operations row by row. If the purpose of the OLE DB Command transformation is to update a table with a value obtained in the pipeline, it might be faster to load the pipeline data into a staging table, then use an Execute SQL task in the control flow to perform a set-based UPDATE operation.

Fixing Destination Component Bottlenecks

Sometimes the performance bottleneck in a package is caused by the destination component. Although external factors can affect the performance of destinations, there are some recommendations for improving the data flow design that relate to destinations:

Optimize the OLE DB destination. When using a table or view as the target for the data access method, you can choose whether to use the Fast Load option. This option performs a bulk insert, which is much faster than the row-by-row insert that would otherwise occur. Also, enabling the Table Lock option will help improve performance. If you have a lot of rows to insert into a table with a clustered index, the data flow engine must first sort all rows in the pipeline before performing the insert. You can get better performance if you specify the number of rows per batch, which will reduce the volume of data to be sorted at one time. Another alternative is to drop the index before loading the data into the destination and rebuild the index after the load is complete.

Use a SQL Server destination instead of an OLE DB destination. If the target is a SQL Server database, you can get up to 25 percent faster performance from the SQL Server destination because the pipeline engine can bypass the network layer during the data load. However, the package must execute on the same server as the target database and all data types in the pipeline must match the data types in the target table.

Set data types explicitly. An OLE DB destination can recast data types using a Data Conversion transformation, but there’s a performance cost. You can avoid adding a Data Conversion transformation to the data flow by casting the data type explicitly in the source query for a relational source or by manually editing the data type for each column in the Flat File Connection Manager for a Flat File source.

Evaluating Buffer Efficiency

After working through any bottlenecks to optimize performance, your next step is to evaluate how efficiently SSIS can place data in buffers. To do this, you must enable the BufferSizeTuning logging event, then execute your package and examine the log. You’ll be able to see how many rows were put into the buffer. You can compare this value to the DefaultBufferMaxRows property, which has a default value of 10,000 rows, as shown in Figure 4. If the buffer is actually getting 1,000 rows instead of 10,000, you can adjust DefaultBufferSize to make the buffer larger so that it can hold more rows. Alternatively, you can eliminate columns or reduce column widths by changing data types to try to get more rows into the buffer.

Figure 4: Checking the DefaultBufferMaxRows property’s value

Another property that you can adjust for performance is EngineThreads. This property’s default value is 10 in SQL Server 2008 and later, but only 5 in SQL Server 2005. If you have a server with multiple processors, you can increase this value. You can test your package by incrementally changing the value to determine whether the additional threads improve performance.

Designing in High Performance

As you develop your SSIS packages, it’s important to consider how the design choices you make can affect package performance. Often, there’s more than one way to accomplish an objective in SSIS, but attaining high performance requires you to understand the performance differences in possible approaches.

Practices to increase the performance of SSIS Packages

http://www.codetails.com/bbc172038/increasing-the-performance-of-ssis-package-best-practices/20121107

SSIS – SQL Server Integration Services is an improved version of DTS (Data Transformation Services) and is widely being used for ETL (Extract, Transform and Load) operations. It provides an excellent set of tasks and features that can be used in all data related activities such as extraction and applying any transformations with great performance. And above all! This package, widely being used by Database Developers and Database Administrators, comes free with SQL Server installation so if you are using SQL Server you do not need to apply/buy any other licenses for SSIS.

This article describes best practices to enhance and boost performance of a SSIS package. You may find several articles on the Internet to boost the performance, but this gives you a holistic picture and represents my views

Handling Indexes while data transformation

When working with any data in RDBMS, indexing plays a vital role in boosting search operations. Clustered and Non-Clustered Indexes both may exist in source and/or destination data. When amount of data is small, it doesn’t matter if indexes are managed run-time or indexes are managed separately. However, with a large amount of data involved while transformation, unmanaged indexes hit the performance greatly. You can see a drastic difference in the performance when transforming large amount of un-indexed data.

To handle this, first make the target table a heap by dropping all indexes created on it. Then, transfer the data to heap. At the end of the transformation, create an index on the target table. Here is a flowchart to do this activity.

Legend: NCI = Non Clustered Indexes; CI = Clustered Indexes

**Avoid using select * in data flow task**

Data Flow task is one of the most commonly used task of Integration Service. In Data Flow task, SSIS uses a buffer to do Transfer and Transformation task. The size of the buffer is dependent on several factors. One of them is the Estimated Row Size. The Estimated Row Size is determined by summing up the maximum size of all columns in a row.

Estimated Row Size = Sum of Maximum size of ALL columns in a Row

So more the number of columns at source, less number of rows in the buffer at destination.

Hence, it is recommended to select a minimum number of columns at source which are required at destination. So if we use the query “select * ” on source connection it will reduce the buffer size. Even if we need all the columns from source, it is better to write down the name of the columns rather than using asterisk (*). When we use asterisk (*), it takes another round for the source to gather the metadata about the columns.

Avoid using Table or View name in Variables

SSIS has provided a great feature of using variables and expressions within Script Task, Connection Managers and alike. It is recommended to avoid using Table name or View name in a variable for a source/destination connection manager. This is because, SSIS internally queries using “select *” clause on the variable used. And as described earlier, this can reduce your performance

Destination Connection settings

While setting up OLE DB Destination connection, consider following settings.

Data Access Mode : This setting provides the ‘fast load’ option which internally uses a BULK INSERT statement for uploading data into the destination table instead of a simple INSERT statements (for each single row) as in the case for other options. So unless you have a reason for changing it, don’t change this default value of fast load. If you select the ‘fast load’ option, there are also a couple of other settings which you can use as discussed below.

Keep Identity : By default this setting is unchecked which means the destination table (if it has an identity column) will create identity values on its own. If you check this setting, the dataflow engine will ensure that the source identity values are preserved and same value is inserted into the destination table.

Keep Nulls : Again by default this setting is unchecked which means default value will be inserted (if the default constraint is defined on the target column) during insert into the destination table if NULL value is coming from the source for that particular column. If you check this option then default constraint on the destination table’s column will be ignored and preserved NULL of the source column will be inserted into the destination.

Table Lock : By default this setting is checked and the recommendation is to let it be checked unless the same table is being used by some other process at same time. It specifies a table lock will be acquired on the destination table instead of acquiring multiple row level locks, which could turn into lock escalation problems.

Check Constraints : Again by default this setting is checked and recommendation is to un-check it if you are sure that the incoming data is not going to violate constraints of the destination table. This setting specifies that the dataflow pipeline engine will validate the incoming data against the constraints of target table. If you un-check this option it will improve the performance of the data load.

Setting up Rows Per Batch and Maximum Insert Commit Size Settings

Rows per batch : The default value for this setting is -1 which specifies all incoming rows will be treated as a single batch. You can change this default behavior and break all incoming rows into multiple batches. The allowed value is only positive integer which specifies the maximum number of rows in a batch.

Maximum insert commit size : The default value for this setting is ’2147483647′ (largest value for 4 byte integer type) which specifies all incoming rows will be committed once on successful completion. You can specify a positive value for this setting to indicate that commit will be done for those number of records. You might be wondering, changing the default value for this setting will put overhead on the dataflow engine to commit several times. Yes that is true, but at the same time it will release the pressure on the transaction log and tempdb to grow tremendously specifically during high volume data transfers.

The above two settings are very important to understand to improve the performance of tempdb and the transaction log. For example if you leave ‘Max insert commit size’ to its default, the transaction log and tempdb will keep on growing during the extraction process and if you are transferring a high volume of data the tempdb will soon run out of memory as a result of this your extraction will fail. So it is recommended to set these values to an optimum value based on your environment.

Avoid Asynchronous Transformation wherever possible

SSIS runtime executes every task other than data flow task in the defined sequence. Whenever the SSIS runtime engine encounters a data flow task, it hands over the execution of the data flow task to data flow pipeline engine.

The data flow pipeline engine breaks the execution of a data flow task into one more execution tree(s) and may execute two or more execution trees in parallel to achieve high performance.

Synchronous transformations get a record, process it and pass it to the other transformation or destination in the sequence. The processing of a record is not dependent on the other incoming rows.

Whereas the asynchronous transformation requires addition buffers for its output and does not utilize the incoming input buffers. It also waits for all incoming rows to arrive for processing, that’s the reason the asynchronous transformation performs slower and must be avoided wherever possible. For example, instead of using Sort Transformation you can get sorted results from the source itself by using ORDER BY clause.

Use SSIS performance Counter where required

SSIS has great event logging mechanism that helps analyse the performance of the package and its components. SSIS has also introduced System Performance counter counters to monitor the performance of your SSIS runtime and data flow pipeline engines.

For example,

SSIS Package Instance counter indicates the number of SSIS packages running on the system;
Rows read and Rows written counters indicate the total number of rows coming from the source and total number of rows provided to destination;
Buffers in use and Buffer memory counters indicate the total number buffers created and amount of memory used by them;
Buffer spooling is a very important counter and tells about number of buffers (which are not currently in use) written to the disk when physical memory runs low;
BLOB bytes read, BLOB bytes written and BLOB files in use counters give detail about the BLOB data transfer and tells about number of BLOB bytes read, written and total number of files that the data flow engine currently is using for spooling BLOB data etc.

Thursday, December 12, 2013

5 Tips for Developing SQL Server Integration Services Packages

http://sqlmag.com/sql-server-integration-services/5-tips-developing-sql-server-integration-services-packages

SQL Server Business Intelligence Development Studio (BIDS) provides a powerful environment for developing SQL Server Integration Services (SSIS) packages. Once you get the hang of how to implement tasks in your control flow and configure your data flow, most efforts are fairly straightforward. Even so, you can sometimes run into situations that slow down development or stop progress altogether. Often these situations are easy to resolve if you know how to proceed. Here are five tips to help you avoid a few frustrations that might arise when developing SSIS packages so that you can keep your development efforts moving forward.

Tip #1: Control Design Time Validation

When you open an SSIS package in BIDS, SSIS Designer tries to connect to the defined data sources and verify the metadata associated with those data sources. If the data sources are unavailable, the database objects are locked, or the connect times are slow, you might have to wait an inordinate amount of time for the validation to complete or to learn that the validation failed. This can be particularly annoying if you have to reopen your packages often.

You can also run into validation problems if you try to add a component that relies on a database object not yet created. For example, suppose a control flow includes an Execute SQL task that creates a table. If you add a Data Flow task that populates the table with data, you'll receive a validation warning because SSIS can't validate the table -- it doesn't yet exist. And if you try to run the package, you'll receive a validation error.

Fortunately, SSIS lets you override the default behavior by allowing you to work offline or to configure properties that affect validation. To work offline, you simply enable the Work Offline option on the SSIS menu. You can access this option while you're working on the package or before you open it. (To do the latter, a different BIDS solution must be open so that the SSIS menu is displayed.) When you're ready for your package to be back online, you just disable the Work Offline option, and SSIS Designer returns to its default behavior.

The second method you can use to override the default validation behavior is to set one of the two validation-related properties in your package. First, you can set the DelayValidation property to True for a specific task. For instance, Figure 1 shows the property set to True for a Data Flow task. After you've configured the property, you can work with the task and run your package without receiving those validation warnings or errors.

Figure 1: Setting the DelayValidation Property on a Data Flow Task

Note, however, that the DelayValidation property is available only in executables (e.g., Control Flow tasks, the package itself). The property isn't available in data flow components. If you want to change the behavior of a specific data flow component, you must set its ValidateExternalMetadata property to False. Figure 2 shows the property set to False for an OLE DB Source component in the data flow.

Figure 2: Setting the ValidateExternalMetadata Property on an OLE DB Source Component

Setting the ValidateExternalMetadata property to False provides more granular control over your data flow than setting the DelayValidation property to True. However, when the ValidateExternalMetadata property is set to True (the default), it helps avoid locking issues when your package uses transactions. For this reason, you might want to stick with using DelayValidation at the task level, unless you have some overriding need to work at the component level.

Tip #2: Create a Package Template

For many SSIS developers, it's not uncommon to create multiple packages that contain many of the same components. Your packages might share variables, connection managers, event handlers, property settings, or tasks. Rather than implementing these same components each time you create a package, you can create a package template that includes all these components and base each new package on the template.

Although creating a template in BIDS is relatively simple, it's not particularly intuitive how you go about doing it. Basically, you create a package in SSIS Designer as you would any other package, although you should give it a name that makes it easily recognizable as a template, such as SsisPkgTemplate.

After you've added the components that you want to include in the template, save and close the package. Then go to the Windows Explorer folder where you store your SSIS project files and copy the template package you just created to the BIDS template folder. The location of the template folder can vary depending on the OS, whether you're working in a 32-bit or 64-bit environment, and whether you've installed BIDS or Visual Studio in its default location. On my system, I'm running 64-bit versions of Windows 7 and SQL Server 2008 R2, both installed on the C drive, with BIDS installed in its default location. As a result, the location of my template folder is C:\Program Files\Microsoft Visual Studio 9.0\Common7\IDE\PrivateAssemblies\ProjectItems\DataTransformationProject\DataTransformationItems.

After you copy your template package to the BIDS template folder, you can start using the template to create packages. In Solution Explorer, right-click the project name, point to Add, and click New Item. When the Add New Item dialog box appears, select the name of the template you just created, provide a name for the new package, and click Add.

After you create the new package, you need to generate a unique ID for it. To do so, locate the package's ID property, click the down arrow in the property value text box, then click Generate New ID. At this point, the new package will already include the components in the template. If needed, you can add more components.

Tip #3: Implement Package Configurations

When an SSIS package is moved to a new environment, the data source connections in the package often need to be updated. For example, if you develop an SSIS package using a test database as the data source, you need to update the data source connection so that it points to the production database when you deploy the package.

One way of performing the necessary updates is to manually open the package, modify the applicable property values (such as the SQL Server instance), and redeploy the package. Another method is to use package configurations that automatically supply those values to the package. Although the latter is the preferable approach, it's often not implemented.

Package configurations are easy to set up and modify, which makes supporting changing property values a lot simpler than modifying packages directly. A package configuration is a stored property value that can be updated outside of the package. If the package has been configured to reference package configurations, the package looks for the configurations at runtime, then applies the defined values to the applicable properties. You can modify the package configurations as often as you like, without having to open the package, and the package will use the most current values that have been defined.

Package configurations are useful not only for storing data source information but also for any situation in which property values might change. For instance, you might want to change the value of a variable that's used in an expression in a Script task or Execute SQL task.

To use package configurations, you must enable them on each package for which you want to implement them. With the package open, select the Package Configurations option from the SSIS menu. This launches the Package Configurations Organizer shown in Figure 3. To enable package configurations, select the Enable package configurationsoption.

Figure 3: Enabling Package Configurations in the Package Configurations Organizer

Next, you need to add the package configurations. To do so, click the Add button to launch the Package Configuration Wizard. When the wizard appears, click Next to skip the Welcome page and go to the Select Configuration Type page shown in Figure 4.

Figure 4: Selecting the Type of Configuration in the Package Configuration Wizard

On the Select Configuration Type page, select one of the following five types of package configurations from the Configuration type drop-down list:

XML configuration file. You can use an XML configuration file to store one or more property values for one or more packages. You can have the wizard generate a new XML configuration file for the current package's property values or add the package's property values to an existing XML configuration file. Either way, you need only select the properties you want to include. The wizard will then write the properties' current values to the XML configuration file.
Environment variable. You can use an environment variable to store the value of a package property. You can associate this type of configuration with only one property, so you need to create an environment variable for each property you want to update. (Unlike the XML configuration file, the environment variable isn't created for you by the wizard. You must create it yourself.)
Registry entry. You can use a registry entry to store the value of a package property. The key must exist under the registry hive HKEY_CURRENT_USER. You can associate this type of configuration with only one property, so you must create a registry entry for each property you want to update. (The wizard doesn't create the registry entry for you. You must create it yourself.)
Parent package variable. Within a child package, you can use a variable that exists in the parent package to store the property value you want to pass to the child package. As with environment variables and registry entries, you must specify a variable for each property you want the parent package to update.
SQL Server. You can use a SQL Server table to store one or more property values for one or more packages. The table must include columns for the property value, the SSIS data type associated with that value, the path that points to the property, and a configuration filter that specifies which package configuration the row is associated with. The wizard can create the table for you, or you can create it before running the wizard. In either case, you must specify the connection manager that the package should use to connect to the database where the table will be located. As with the XML configuration file, you must also select the properties you want to include. The wizard will then write the properties' current values to the table.

After you select the package configuration type, follow the steps in the Package Configuration Wizard to finish creating the package configuration. You can then update the package configuration as necessary in order to pass in the correct property value at runtime.

Note that you're not limited to one configuration type. You can combine types or use more than one type. If you plan to update only one or two properties at runtime, you might find that an environment variable or registry entry is the easiest way to go. For multiple properties, using an XML configuration file or SQL Server table might be a better solution.

You also want to take into account security issues when determining the best solution for your environment. For instance, an XML configuration file might not provide the security you need. A SQL Server database might provide a more secure environment than a file share that's out of your control. However, you'd have to make sure that the SQL Server instance would always be available when needed.

Tip #4: Remove Leading and Trailing Spaces

When developing the data flow of an SSIS package, you might be unaware that the source data is padded with spaces until you try to run your package and problems appear. For instance, suppose your package retrieves string values from a comma-separated value (CSV) file, then uses a Lookup transformation to compare those values. You might not know that many of the string values have spaces before or after them until a lot of comparisons evaluate to False during runtime. Because the Lookup transformation is space-sensitive, "book" doesn't equal "book ", even though they appear to be the same value.

Extra spaces can also be a problem if you're trying to insert data into a column configured with a character data type and the data type's length is too small to accommodate the source value and its spaces. For instance, you can insert the value "one two three" into a VARCHAR(15) column but not the value "one two three ". As a result, when you're retrieving data that might include extra spaces, you might need to remove those spaces, no matter whether the data is coming from text files, columns configured with the CHAR data type, or another type of data source.

If you're retrieving data from a database, you can include the logic to remove spaces in your T-SQL code. For instance, the Production.Document table in the AdventureWorks2008R2 database includes the Revision column, which is defined with the NCHAR(5) data type. To remove the extra spaces, you can use a T-SQL statement similar to the following one in your OLE DB Source component:

SELECT Title, FileName, RTRIM(Revision) AS Revision
FROM Production.Document

This statement uses the RTRIM function to remove any trailing spaces from the Revision values. If you want to remove leading spaces as well, you can modify the statement to include the LTRIM function:

SELECT Title, FileName, LTRIM(RTRIM(Revision)) AS Revision
FROM Production.Document

Now when the Revision values are passed down the data paths, they won't include extra spaces.

You can take this approach only if your source is a database. If it's a text file or some other source that you have less control over, you can add a Derived Column transformation to the data flow and trim the extra spaces there. Figure 5 shows how to trim spaces from the Revision values in the Derived Column Transformation Editor. As the figure shows, you can create a simple expression that uses the TRIM function to trim both leading and trailing spaces. The rest of the data flow can then use the derived column in place of the original Revision column in order to work with data that contains no leading or trailing spaces.

Figure 5: Creating a Derived Column to Trim Spaces

Tip #5: Run Packages with the 32-Bit Runtime

If you're working with a 64-bit version of SSIS and BIDS, you might come across limitations in some of the features available in the development environment. For example, suppose you set a breakpoint within the C# code of a Script task. If you run the package in debugging mode, the package will pause at all the breakpoints you set outside of the Script task, but not at the breakpoint within the Script task. An examination of the Progress tab in SSIS Designer will reveal a warning message that indicates Script tasks can't be debugged while running under the 64-bit version of the SSIS runtime.

The problem is that some SSIS features don't work properly under the 64-bit version of the SSIS runtime. If you want to use those features, you must configure the project to load the 32-bit environment. To do so, right-click the package in Solution Explorer and click Properties to open the properties for that package. Then, in the Property Pages dialog box, navigate to the Debugging page and set the Run64BitRuntime property to False, as Figure 6 shows.

Figure 6: Setting the Run64BitRuntime Property to False

By setting the Run64BitRuntime property to False, you're specifying that packages in the project should start in the 32-bit runtime rather than the 64-bit runtime. Note, however, this applies only to the BIDS environment. To run a package as 32-bit outside of BIDS, you have to take specific steps when you execute the package. For example, if you create a SQL Server Agent job to run the package, you should specify that the 32-bit runtime be used.

Keep Your Development Efforts Moving Forward

In addition to the five tips mentioned here, there are many other considerations to take into account when developing an SSIS package. Even so, knowing how to open packages without validation, create templates, use package configurations, handle leading and trailing spaces, and run your packages with the 32-bit runtime can, in the long run, save you time developing and troubleshooting your packages. That's not to say each of these tips carries the same weight or that they all apply to your circumstances, but at least some of them should prove useful in your development efforts. For any of the tips that do apply, be sure to check out SQL Server Books Online (BOL) for more information.

Friday, December 6, 2013

Transaction in SSIS

http://sqlage.blogspot.in/

SSIS - Load All Records From Source To Destination OR Load No Record [ SSIS Transactions]

Scenario:

We have to load data from source table to destination table. We want to load all records or no record. That means if any error happen then we want to rollback the transaction.

Solution :

First, let's study what is the default behavior of our SSIS Package with default settings.

I have created two tables, Create definitions are below and have inserted two rows in Source table.

CREATE TABLE [dbo].[SourceTable](

[Name] [varchar](100) NULL,

[SaleDate] [varchar](50) NULL

)

insert into dbo.SourceTable

values ('Aamir','2013-12-03 10:19:56.887'),

('Raza','Test Date')

--Destination Table

CREATE TABLE [dbo].[DestinationTable](

[Name] [varchar](100) NULL,

[SaleDate] [datetime] NULL

)

Notice that I have created SaleDate column in Source table with VARCHAR(50) and in Destination I have that column with data type datetime.

Let's create our SSIS Package and try to load the records in destination table from source without using any transformation. In other words we are not trying to change the data type of source before inserting into destination table.

Bring the data flow task to the control flow pane and then use the ole db source and ole db destination. use the source table in ole db source and destination table in OLE DB destination as shown. Map both the columns.

Let's execute our package and see how many records will be inserted or no record will be inserted?

After execution the package failed and as it was not able to insert 'Test date' value in datetime column in destination table.

Let's check the destination table for the records inserted.

Even the package failed but it inserted the row that was in correct format. But we are interested to load all of them or load no records.

Let's change the transactionOption to Required instead of Supported and rerun our Package. Before running the package truncate destination table.

To change the TransactionOption to Required, Right Click on Data Flow task and then go to properties and change the TransactionOption to Required as shown below

Let's run our package now and see what we get

The package failed but this time we got below error

"SSIS package "Package5.dtsx" starting.

Information: 0x4004300A at Data Flow Task, SSIS.Pipeline: Validation phase is beginning.

Information: 0x4001100A at Data Flow Task: Starting distributed transaction for this container.

Error: 0xC001401A at Data Flow Task: The SSIS Runtime has failed to start the distributed transaction due to error 0x8004D01B "The Transaction Manager is not available.". The DTC transaction failed to start. This could occur because the MSDTC Service is not running.

Warning: 0x80019002 at Package5: SSIS Warning Code DTS_W_MAXIMUMERRORCOUNTREACHED. The Execution method succeeded, but the number of errors raised (2) reached the maximum allowed (1); resulting in failure. This occurs when the number of errors reaches the number specified in MaximumErrorCount. Change the MaximumErrorCount or fix the errors.

SSIS package "Package5.dtsx" finished: Failure."

To enable transaction in SSIS Package, we need to have MSDTC service running. Let's enable the service by going to Services.

Can not find the service with name MSDTC or DTC. The service is with name Distributed Transaction Coordinator. Let's start the service

One more time run the package and see the results

The package failed because of invalid data all the records were roll backed. Non of the record was inserted in destination.

Note :
To enable Transaction the Distributed Transaction Coordinator (MSDTC) should be running