locked
Insert distinct based on ID and Time RRS feed

  • Question

  • Hi, I will be getting millions of rows daily and out of that I need to insert new records into a table called @DistinctDetails shown in the below DDL:

    DECLARE @DWDetail TABLE (ID int not null,String_text varchar(300),object varchar(2048),code int not null) DECLARE @Detail TABLE (ID int not null,String_text varchar(300),object varchar(2048),code int not null) DECLARE @DistinctDetail TABLE (String_text varchar(300),object varchar(2048),code int not null,LoadTime datetime null) INSERT INTO @DWDetail VALUES(1,'Apple','www.Apple.com',0) INSERT INTO @DWDetail VALUES(2,'Samsung','www.Samsung.com',0) INSERT INTO @DWDetail VALUES(3,'Nokia','www.Nokia.com',0) INSERT INTO @DWDetail VALUES(4,'Apple','www.Apple.com',0) INSERT INTO @DWDetail VALUES(5,'Apple','www.Apple.com',2) INSERT INTO @DWDetail VALUES(6,'dell','www.dell.com',0) INSERT INTO @DWDetail VALUES(7,'Apple','www.Apple.com',0) INSERT INTO @DistinctDetail VALUES ('Apple','www.Apple.com',0,'2013-05-07 10:53:22:000') INSERT INTO @DistinctDetail VALUES('Samsung','www.Samsung.com',0,'2013-05-07 10:53:22:000') INSERT INTO @DistinctDetail VALUES('Nokia','www.Nokia.com',0,'2013-05-08 10:53:22:000') INSERT INTO @DistinctDetail VALUES('Apple','www.Apple.com',2,'2013-05-09 10:53:22:000') INSERT INTO @DistinctDetail VALUES('dell','www.dell.com',0,'2013-05-09 10:53:22:000') INSERT INTO @Detail VALUES(1,'Apple','www.Apple.com',0) INSERT INTO @Detail VALUES(2,'Samsung','www.Samsung.com',0) INSERT INTO @Detail VALUES(3,'Nokia','www.Nokia.com',0) INSERT INTO @Detail VALUES(4,'Apple','www.Apple.com',0) INSERT INTO @Detail VALUES(5,'Apple','www.Apple.com',0) INSERT INTO @Detail VALUES(6,'dell','www.dell.com',0) SELECT * FROM @Detail SELECT * FROM @DistinctDetail

    <-------Insert new records by using the ID field on @Detail table-----> SELECT mpd.String_text,mpd.object,mpd.code FROM @DWDetail mpd WHERE ID > 6 AND NOT EXISTS (SELECT 1 FROM @DistinctDetail pd WHERE mpd.String_text = pd.String_text AND LTRIM(RTRIM( mpd.object)) = LTRIM(RTRIM( pd.object)) AND mpd.code=pd.code AND LoadTime > '2013-05-09 10:53:22.000' ) GROUP BY mpd.String_text,mpd.object,mpd.code Declare @ID int = (SELECT MAX(ID) FROM @Detail) --INSERT into @Detail SELECT * FROM @DWDetail WHERE ID>@ID

    I want the query which compare with the existing and insert new records, If I compare this directly, it is never ending process, as it has to go through millions of records and do the comparision, Adding any index does not work here. I want the shortest way to insert only new records.

    Having issue in this query:

    SELECT mpd.String_text,mpd.object,mpd.code FROM @DWDetail mpd WHERE ID > 6 AND NOT EXISTS (SELECT 1 FROM @DistinctDetail pd WHERE mpd.String_text = pd.String_text AND LTRIM(RTRIM( mpd.object)) = LTRIM(RTRIM( pd.object)) AND mpd.code=pd.code AND LoadTime > '2013-05-09 10:53:22.000' ) GROUP BY mpd.String_text,mpd.object,mpd.code


    • Edited by SQL_Gun Tuesday, May 14, 2013 5:19 PM edit
    Tuesday, May 14, 2013 5:16 PM

Answers

  • I suggest to always add aliases to your tables when you select from more than 1 table - this simplifies the maintenance. In your case there is not much we can do to optimize the query - index on (code, object, string_text) may help assuming that these fields are not too big. Put the field with most selectivity first in this index.

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog

    • Proposed as answer by Kalman Toth Tuesday, May 21, 2013 7:24 AM
    • Marked as answer by Allen Li - MSFT Tuesday, May 21, 2013 8:09 AM
    Tuesday, May 14, 2013 6:06 PM

All replies

  • The LTRIM(RTRIM slow down this query and also what is LoadTime criteria and why it is in EXISTS portion? Do you only want to compare rows for the certain LoadTime? Which table this column belongs to?

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog

    Tuesday, May 14, 2013 5:30 PM
  • If you have million of records then using a temp table might yield better performance than a table variable.  Typically, table variable are more performant for smaller table and temp tables are faster for larger table.

    Additionally, indexes cannot be created on table variable but can be created on temp table.  So if your choice of a table variable is the reason you cannot use index then this would be an additional reason to consider it.

    Finally, if your value do not have spaces then you would not need ltrim and rtrim.

    Good luck!

    Tuesday, May 14, 2013 5:37 PM
  • Ok, I will remove the LTRIM and RTRIM from the query, The LoadTime field belongs to the DistinctDetail table, I was trying different queries.

    Do you only want to compare rows for the certain LoadTime?
    No, I dont want to compare for certain loadtime, I want to do on complete table. I will remove the LoadTime.

    Tuesday, May 14, 2013 5:44 PM
  • I was just giving an example by creating the DDl using the variable, instead I have physical tables.
    Tuesday, May 14, 2013 5:45 PM
  • I suggest to always add aliases to your tables when you select from more than 1 table - this simplifies the maintenance. In your case there is not much we can do to optimize the query - index on (code, object, string_text) may help assuming that these fields are not too big. Put the field with most selectivity first in this index.

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog

    • Proposed as answer by Kalman Toth Tuesday, May 21, 2013 7:24 AM
    • Marked as answer by Allen Li - MSFT Tuesday, May 21, 2013 8:09 AM
    Tuesday, May 14, 2013 6:06 PM