I want to insert a new record into a table with an Identity field and return the new Identify field value back to the data stream (for later insertion as a foreign key in another table).
What is the most direct way to do this in SSIS?
TIA,
barkingdog
P.S. Or should I pass the identity value back in a variable and not make it part of the data stream?
If you need to do this for every row, then using identities with an SSIS is not a good idea. You cannot get the new identity back until the row is committed, but that would mean committing one row at a time in SSIS. Even then, you only get back the last identity - which may not be what you expect if a parallel process has added a row between you commiting your row and asking for it's identity.
The best way is to use a script to generate a key in the data flow. In that way, you will know what the key value for each row is in advance and it can be inserted (thanks to multicast) into different tables at once, guaranteeing referential integrity.
Donald
|||Donald,
When you wrote "You cannot get the new identity back until the row is committed, but that would mean committing one row at a time in SSIS."
When I run a normal SSIS package that reads from a file and writse to a database isn't one row being committed at a time? Or does SSIS save as many rows as possible in, say a memory buffer, and then commit then all at once?
TIA,
barkindog
|||Strictly speaking it is the provider that handles commits, not SSIS.
The Fastload option on the OLEDB provider allows you to set batch sizes from 1 to "the entire data load in one batch."
If you do not use Fast Load, then one row at a time is sent.
The OLEDB command component also processes one row at a time.
However, in all these cases, the problem is not the performance of handling one row at a time (although that is a real factor) - it is also that you cannot get back the identity for the row you have just committed.
The pattern in SQL Server (and in most rdbms's) is that you can get the last identity issued. It is tempting to think that having just posted a row, the last identity issued must be for that row. Many a design has foundered on that assumption, as just the teensiest smidgin of parallelism soon throws that process out of synchronization.
I much prefer issuing keys in advance in the ETL process - you can do so much with them, with great performance and guaranteed integrity.
Donald
|||
Regarding "Many a design has foundered on that assumption, as just the teensiest smidgin of parallelism soon throws that process out of synchronization."
1. If my job is the only one updating the table with the Identity column , and I'm not running multiple copies of my job, then I presume that parallellism can't happen to me. Or does SSIS do things "in the background" that could cause a smidgin of parallelism, even for my particular case?
2. Later on I will need to re-run my job with new data. Then I have to read the current value of the Identity from the table, add 1 to it, and begin with that value. Your argument about parallelism makes me wonder if the only way to accurately read the identity value from a table is to make sure no other app updates that table. (That sure puts a dent in the possibility of scaling out horizontally with servers.)
TIA,
barkingdog
|||1. The OLEDB command destination may send a command for the second row before the first has completed. Our buffer architecture is designed to maximise the potential for pipeline parallelism.
2. The only way to guarantee that the last identity you read is the last one you inserted, is to be able to guarantee that no process has written to the table since your process.
We do have a design pattern for highly parallel key generation that may (but may not) be in the next version . Either way there will be a paper on this at some point.
The best strategy is to know your keys in advance - by generating them in your data integration process. That way, you have complete control.
Donald