The package needed to import a number of people into an employee table. Like most employee tables, this table had references on each employee to their manager. Since the managers weren’t already in the database I was going to be loading them in the same import that their employees were being imported. Since I needed to reference each employee’s manager, the managers needed to be loaded before their employees.
One of the restrictions for the package was that I couldn’t write to the destination database or server prior to completing the data ordering. This might seem like a silly requirement. The goal though was to have the data completely prepared for loading prior to hitting the production server.
RECURSIVE SOLUTION
While designing the package it occurred to me that this was a classic example of recursion. There was a root level with some people related to it. And then people related to those people and so on. If I loaded each level of the hierarchy before the lower level, the employees could be associated with their managers as they were loaded into the database.
In envisioning the solution, I considered the same types of operations that would occur when doing recursion through common tables expressions. In this scenario, the root level items are identified within dataset. We’ll call this Level 0.
After Level 0 is selected, it is joined back to the data to find out what data is related to it. This new group of people would be those that report to people in Level 0. This new group will be Level 1.
Then Level 1 is given the same treatment as Level 0 to find the data that relates to Level 1. The new set of data is then Level 2. This continues until there is no more data to join together and the process ends.
This is what we need to do in as SSIS package. But we are going to do it within the flow without accessing a database. And we’ll be using something awesome… Raw Files.
PACKAGE CONTROL FLOW
Now that we’ve defined the solution that is needed let’s consider what is needed from a Control Flow perspective. First, we’ll need to identify the rows that comprise the root level. From there we need to compare that level and subsequent levels to the source data to build the new levels. When there is no more data to process the package should complete.
- DF – Data Setup: This task is included to create a flat file that can be used as the source for the process. Since the original package didn’t touch any databases, I wanted to maintain that spirit as well as possible. This would be deleted if you opt to use this package.
- DF – Build Hierarchy Root: Data Flow task that selects the first level from the data to create the root level of the employee hierarchy.
- FLC – Cycle Through Levels: A For Loop container that loops across the next tasks as long as the row count from DF – Build Hierarchy Levels does not return 0.
- FST – Copy File: File System Copy task that copies the file created by the data flow tasks to create a source raw file for the DF – Validate Output.
- ST – Increment Hierarchy Level: Script task that increments the HierarchyLevel variable by 1 for every cycle in the loop.
- DF – Build Hierarchy Levels: Data Flow task that adds each of the non-root levels the employee hierarchy.
- DF – Validate Output: This step is only included so that we can see the final output without having to open up any files. This would be deleted if you opt to use this package.
PACKAGE VARIABLES
There are a number of variables that I added to the package to help control how everything processed.
The variables from above were added for the following purposes:
- HierarchyLevel: The current level of the hierarchy. Default value is 0.
- RawFileHierarchy: Location of the raw file that is the destination for each level of the hierarchy output.
- RawFileSource: Location of the raw file that contains all previous levels of the hierarchy.
- RowCount: Count of the number of rows last added to the hierarchy. Default value is –1.
- WorkFolder: This is my one variable to rule them all that sets the RawFileHierarchy and RawFileSource path.
PACKAGE DATA FLOWS
Now we’ll go into the core data flows of the package. I’m going to ignore the Data Flows that are just there because this is a demo. They aren’t important to the core of this package.
DF – BUILD HIERARCHY ROOT
This is the first data flow of SSIS package the implements the solution. It will build the first level of the hierarchy. With the data that is being used for this package anyone without a manager is a root level person. Those that match that criteria are sent to the raw file destination.
- FF – Employee Source: Flat File Source that extracts unsorted rows for the package.
- CS – Identify Current Hierarchy Level: Conditional Split task that tests for the rules the identify the root level. In this case that logic was whenever the ManagerID was NULL. Only the rows that meet the condition are passed forward in the data flow.
- DC – Add HierarchyLevel Column: A Derived Column task that adds the HierarchyLevel variable in as a column in the data flow.
- RF – Employee Destination: Raw File Destination that receives all of root level rows.
DF – BUILD HIERARCHY LEVELS
Once the root level has been created, it can be used to find the subsequent levels and that logic is implemented in the DF – Hierarchy Levels Data Flow. Each level is passed into this data flow and checked to see if there are any employees in the source data (flat file) that report to this level. Those that match are sent to the raw file destination.
- RF – Hierarchy Source: Raw File Source that is a copy of the previous levels Raw File Destination. This contains all levels of the hierarchy from the root to the most recent level added.
- CS – Get Previous Level: Conditional Split that returns the most recent level of the hierarchy based on the HierarchyLevel column. Only the rows that meet the condition are passed forward in the data flow.
- ST – Sort EmployeeID: Sort that orders the employees from the previous hierarchy level by their employee ID.
- FF – Employee Source: Flat File Source that extracts unsorted rows for the package.
- ST – Sort ManagerID: Sort that orders the employees from the source file by their manager ID.
- MJ – Match Current Managers: Merge Join that joins the employees from the previous hierarchy level with those in the source level. The output from this operation is the employees that are in the current hierarchy level. This is an INNER JOIN operation.
- DC – Add HierarchyLevel Column: A Derived Column task that adds the HierarchyLevel variable in as a column in the data flow.
- RC – Count Level Members: Row Count task to count the number of employees found at the current hierarchy level. When this sets the variable to 0 the For Loop Container will terminate.
- RF – Hierarchy Destination: Raw File Destination that receives all of root level rows.
PACKAGE OUTPUT
If you’ve followed along (or more likely downloaded the package), you should be able to fire it off and get the results below. Each level has been color coded and shows that in the output the rows are sorted by hierarchy level.
RECURSIVE WRAP-UP
Hopefully the detail that I’ve provided above is sufficient to describe the solution. If more detail is needed, feel free to leave some comments. I’ve been pretty happy with the solution and what I’ve been able to do with it. With volumes up to a million rows it’s done well with processing data.
While this package didn’t place the employees into that database, it wouldn’t be hard to branch the data flow and add the employees into a table. However the data needs to be handled can be done while maintaining the ability to segment each level of the hierarchy.
No comments:
Post a Comment