Answered by:
inner join date >= start_date and date <= end_date in SSIS

Question
-
Good morning,
I got a list of days going 365 days that I generates from a query I've found on Jamie's excellent blog SSIS junkie which looks as follow:
datelist
with mycte as
(
select convert(smalldatetime, '14/06/2009',103) DateValue union all
select DateValue + 1
from mycte
where DateValue + 1 < '2010-06-14 00:00:00'
)
select DateValue
from mycte
OPTION (MAXRECURSION 0)
I use it in an OLE DB source component and it produces a one column datevalue starting 2009/06/14 all the way to 2010/06/14.
Then I got another query:
Hits
SELECT a.search_id,date_from,date_until,isnull(e.city_id,0)as city_id,a.hotel_id,isnull([hit-id],1)as [hit-id],isnull(available,0)as available
FROM Search.dbo.search_summary a
LEFT JOIN Search.dbo.hit_count b
ON (b.search_id=a.search_id and hotel_id=a.hotel_id and b.searchdate=@searchdate)
LEFT JOIN main.dbo.hotels e ON (e.hotel_id=a.hotel_id)
WHERE a.searchdate=@searchdate and a.hotel_id>0
I need to be able to inner join those 2 sources so that I get data as follow:
SELECT a.city_id,a.hotel_id,c.day_date,count(*)as hits,
count(CASE WHEN available=1 THEN [hit-id] ELSE null END)as hit_available
FROM Hits a
INNER JOIN datelist c ON (c.day_date>=a.date_from and c.day_date<a.date_until)
GROUP BY a.city_id,a.hotel_id,c.day_date
So basically, if someone book a room in a hotel for 5 days, I want to have a row for everyday in datelist which is between date_from and date_until.
That procedure exists already but I'm trying to switch it to SSIS because the aggregated tables will soon be sitting on a different server. So I want to grab raw data, aggregate on the fly and stick it into the data warehouse server.
Is there a way to do that in SSIS without having to write a huge query combining it all in an OLE DB source?Thursday, June 18, 2009 3:53 AM
Answers
-
One way to do it would be to:
1. Add both those as sources in a Data Flow.
2. Modify each SQL statement to add a "dummy" column - "1 AS dummy".
3. Use the advanced editor of both sources to tell SSIS that the output is sorted, and the SortOrderKey is 1 for the "dummy" column.
4. Use a Merge Join - this will essentially be a cartesian product of the two inputs.
5. Use a Conditional Split to filter out the rows using your "c.day_date>=a.date_from and c.day_date<a.date_until" clause.
The above won't be terribly efficient - because it's making that cartesian product...- Marked as answer by NicoinThailand Friday, June 19, 2009 2:22 AM
Thursday, June 18, 2009 4:05 AM -
Ok.
There's no easy way in SSIS.
Ended up using a script component and add the same row in a loop between date_from and date_until.
Code looks like this:
for (DateTime day = Row.datefrom; day <= Row.dateuntil; day = day.AddDays(1))
{
dailyoutputBuffer.AddRow();
dailyoutputBuffer.cityid = Row.cityid;
dailyoutputBuffer.hotelid = Row.hotelid;
dailyoutputBuffer.date = day;
if (Row.available == false)
dailyoutputBuffer.hits = Row.hitid;
else
dailyoutputBuffer.hitsavailable = Row.hitid;
}
There should be an addRow component.- Marked as answer by NicoinThailand Friday, June 19, 2009 2:22 AM
Thursday, June 18, 2009 7:40 AM
All replies
-
One way to do it would be to:
1. Add both those as sources in a Data Flow.
2. Modify each SQL statement to add a "dummy" column - "1 AS dummy".
3. Use the advanced editor of both sources to tell SSIS that the output is sorted, and the SortOrderKey is 1 for the "dummy" column.
4. Use a Merge Join - this will essentially be a cartesian product of the two inputs.
5. Use a Conditional Split to filter out the rows using your "c.day_date>=a.date_from and c.day_date<a.date_until" clause.
The above won't be terribly efficient - because it's making that cartesian product...- Marked as answer by NicoinThailand Friday, June 19, 2009 2:22 AM
Thursday, June 18, 2009 4:05 AM -
Yes. That's what I was looking at. Cartersian product in my situation could kill my server.
We have a big lot of search daily. (about 70 millions a day these days)
Considering the amount of search we can have in a day this can potentially be fatal.
it produces a lot of row only to filter them out after.
It's nicely thought though.
What about this:
On control flow level:
2 SQL tasks in a sequence container one for each query. Use SELECT INTO ##dates and ##hits
Then link the sequence container to a data flow.
In the data flow: 1 OLE DB source using inner join with the condition above.
From there proceed.
Just need to find out how to use temp tables with SSIS now.Thursday, June 18, 2009 5:59 AM -
Ok.
There's no easy way in SSIS.
Ended up using a script component and add the same row in a loop between date_from and date_until.
Code looks like this:
for (DateTime day = Row.datefrom; day <= Row.dateuntil; day = day.AddDays(1))
{
dailyoutputBuffer.AddRow();
dailyoutputBuffer.cityid = Row.cityid;
dailyoutputBuffer.hotelid = Row.hotelid;
dailyoutputBuffer.date = day;
if (Row.available == false)
dailyoutputBuffer.hits = Row.hitid;
else
dailyoutputBuffer.hitsavailable = Row.hitid;
}
There should be an addRow component.- Marked as answer by NicoinThailand Friday, June 19, 2009 2:22 AM
Thursday, June 18, 2009 7:40 AM -
Yes. That's what I was looking at. Cartersian product in my situation could kill my server.
We have a big lot of search daily. (about 70 millions a day these days)
Considering the amount of search we can have in a day this can potentially be fatal.
it produces a lot of row only to filter them out after.
It's nicely thought though.
What about this:
On control flow level:
2 SQL tasks in a sequence container one for each query. Use SELECT INTO ##dates and ##hits
Then link the sequence container to a data flow.
In the data flow: 1 OLE DB source using inner join with the condition above.
From there proceed.
Just need to find out how to use temp tables with SSIS now.
I guess, it can be done, as you have suggested, through temp tables.. and frankly when i read you first post, first thing came to my mind was to use temp tables.
Now to use temp tables you would need to Set RetainSameConnection=TRUE on the connection manager and set DelayValidation=TRUE on all tasks that use the temp table.
But to create your package you will need metadata ( for that you can create a dummy table of same structure as temp tables)
hope it works for you!!
Rahul Kumar, MCTS, India, http://sqlserversolutions.blogspot.com/- Edited by Rahul Sherawat Thursday, June 18, 2009 9:06 AM
Thursday, June 18, 2009 9:03 AM -
One thing to remember though is that SSIS isn't going to store the entire result set (the cartesian product) in memory, or even on disk in a temp table like SQL Server would. That's one of the benefits of the SSIS pipeline. Once on e input has been completely received by the Merge Join (just one of the inputs - likely the smaller one), then the Merge Join will start outputting the joined rows. If, at this point, memory is getting anywhere close to tight, then SSIS will pause the second (larger) source from introducing more rows into the pipeline until pressure is reduced. And that's exactly what the Conditional Split component will be doing - throwing out a vast majority of rows, effectively reducing memory pressure, and allowing the pipeline to ingest more rows.
I can't tell you what the memory impact of this will be in SSIS - you'll have to try it out, and I recommend you do. I'd bet it's not half as bad as you think it'll be.Thursday, June 18, 2009 4:08 PM -
Hi Todd,
thanks for the explanation.
I'm going to try both approach to complete the picture.
The script works pretty well already. It's going to be the time to beat!
I think I can mark both solution as answer.
NicoFriday, June 19, 2009 2:22 AM