none
Find missing IDs using high performance query

    Question

  • I have two tables. The first table contains IDs (unique records), say, 100,000.

    The second table does not contain all IDs of first table but may contain duplicate values.

    First Table

    ID  int     P.K

    Name varchar(50)

    Second Table

    ID    int   F.K

    Salary   decimal(10,2)

    I want to get all IDs from first table which are not in the second table using a high performance query.

    Initially, I thought a poor way like:

    select ID from table1 where ID not in (select distinct ID from table2)

    Later, I worked out something like:

    select A.ID from table1 A
    left join table2 B
    on A.ID = B.ID
    and salary is null
    Is there any other way to get results with a high performance query?

    Saturday, March 29, 2014 10:33 PM

Answers

  • There should be an index on ID in the second table. That's the most important thing for performance.

    I would write the query as:

    SELECT ID
    FROM   table1 t1
    WHERE  NOT EXISTS (SELECT *
                       FROM   table2 t2
                       WHERE  t2.ID = t1.ID)

    Which I expect to produce the same plan as your NOT IN query (which does not need the DISTINCT). There are two advantages with NOT EXISTS:

    1) You can't get unexpected results due to NULL values.
    2) It works with a multi-column condition too.

    The query with LEFT JOIN is more likely to generate a different plan, and many claims that this gives better performance. I would suggest that this depends on the situation. I prefer NOT EXISTS in most cases, for the simple reason of clarity.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    • Marked as answer by OldEnthusiast Sunday, March 30, 2014 7:08 AM
    Saturday, March 29, 2014 10:48 PM

All replies

  • select ID from table1
    Except
    select ID from  table2

    Saturday, March 29, 2014 10:44 PM
  • There should be an index on ID in the second table. That's the most important thing for performance.

    I would write the query as:

    SELECT ID
    FROM   table1 t1
    WHERE  NOT EXISTS (SELECT *
                       FROM   table2 t2
                       WHERE  t2.ID = t1.ID)

    Which I expect to produce the same plan as your NOT IN query (which does not need the DISTINCT). There are two advantages with NOT EXISTS:

    1) You can't get unexpected results due to NULL values.
    2) It works with a multi-column condition too.

    The query with LEFT JOIN is more likely to generate a different plan, and many claims that this gives better performance. I would suggest that this depends on the situation. I prefer NOT EXISTS in most cases, for the simple reason of clarity.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    • Marked as answer by OldEnthusiast Sunday, March 30, 2014 7:08 AM
    Saturday, March 29, 2014 10:48 PM
  •  >> I have two tables. The first table contains IDs (unique records), say, 100,000. <<

    Rows are not records; this is a fundamental concept. There is no such thing as a generic magical “id” in RDBMS. That is Kabbalah magic. And we do not use numeric data types for identifiers; what math do you do on them? NONE! What you do with an identifier is validate and verify it. That means strings, regular expression, check digits, etc. 

    Likewise, a name is an attribute property; it has to be the name of something in particular. Tables have specific names, not a vague,useless narrative. You have no idea how to write a proper schema, do you? 

    >> The second table does not contain all IDs of first table but may contain duplicate values. <<

    Why would it contain any of them? The goal of any database, not just SQL, is to reduce redundancy. But you increase it! Why? SQL programmers reference identifiers. The referenced table is the sole occurrence in the schema. 

    CREATE TABLE Personnel 
    (emp_id CHAR(10) NOT NULL PRIMARY KEY,
     emp_name VARCHAR(50) NOT NULL);

    CREATE TABLE Salaries
    (emp_id CHAR(10) NOT NULL PRIMARY KEY
      REFERENCES Personnel(emp_id) 
      ON DELETE CASCADE,
     salary_amt DECIMAL(10,2) NOT NULL
     CHECK (salary_amt >= 0.00));

    >> I want to get all IDs from first table which are not in the second table using a high performance query. <<

    What is a “high performance query”? We never used that term in ANSI> this means we are looking for employees that we are not paying. 

    >> Initially, I thought a poor way like:
    SELECT emp_id 
     FROM Personnel  
     WHERE emp_id 
           NOT IN (SELECT DISTINCT emp_id FROM Salaries);

    Yup, that stinks :) The SELECT DISTINCT is silly and useless. The IN() predicate also has problems. 


    >> Later, I worked out something like:

    SELECT A.emp_id 
      FROM Personnel AS P
           LEFT OUTER JOIN 
           Salaries AS S
           ON P.emp_id = B.emp_id
              AND salary_amt IS NULL; <<

    probably about as bad. Think in sets and use higher level ops. 

    SELECT X.emp_id 
      FROM (SELECT emp_id FROM Personnel
            EXCEPT 
            SELECT emp_id FROM Salaries)
            AS X(emp_id);

    --CELKO-- Books in Celko Series for Morgan-Kaufmann Publishing: Analytics and OLAP in SQL / Data and Databases: Concepts in Practice Data / Measurements and Standards in SQL SQL for Smarties / SQL Programming Style / SQL Puzzles and Answers / Thinking in Sets / Trees and Hierarchies in SQL

    Sunday, March 30, 2014 1:40 AM