locked
Generate Random Row RRS feed

  • Question

  • Hi, I want to know why when i use "order by newid()" in a select statment i get a randoms rows? as i know newid() generate uniqueidentifier number (GUID) and order by get columns name, for sorting rows.
    Wednesday, November 11, 2009 8:04 AM

Answers

  •  as i know newid() generate uniqueidentifier number (GUID) and order by get columns name, for sorting rows.

    Kamran,

    Interesting question.

    BOL: "The ORDER BY clause can include items that do not appear in the select list."

    You could include newid() as column in the select list. Therefore you can use it in ORDER BY even when not in the SELECT list.

    select col1 = NEWID(), * from Production.Product
    order by col1
    /*
    ProductID	Name
    747	HL Mountain Frame - Black, 38
    915	ML Touring Seat/Saddle
    853	Women's Tights, M
    855	Men's Bib-Shorts, S
    ....
    */
    select  * from Production.Product
    order by NEWID()
    /*
    ProductID	Name
    349	Flat Washer 7
    451	Lock Nut 11
    328	Mountain End Caps
    368	Thin-Jam Hex Nut 4
    967	Touring-1000 Blue, 50
    .....
    */

    Kalman Toth SQL SERVER 2012 & BI TRAINING
    New Book: Beginner Database Design & SQL Programming Using Microsoft SQL Server 2012


    Wednesday, November 11, 2009 9:25 AM

  • The key here is the NEWID function, which generates a globally unique identifier (GUID) in memory for each row. By definition, the GUID is unique and fairly random; so, when you sort by that GUID with the ORDER BY clause, you get a random ordering of the rows in the table. Taking the top 10 percent (or whatever percentage you want) will give you a random sampling of the rows in the table.

    Often, when questions about how to select random rows are asked in discussion groups, the NEWID query is proposed; it is simple and works very well for small tables. However, the NEWID query has a big drawback when you use it for large tables. The ORDER BY clause causes all of the rows in the table to be copied into the tempdb database, where they are sorted. This causes two problems:

    1. The sorting operation usually has a high cost associated with it. Sorting can use a lot of disk I/O and can run for a long time.
    2. In the worst-case scenario, tempdb can run out of space. In the best-case scenario, tempdb can take up a large amount of disk space that never will be reclaimed without a manual shrink command.

    FROM http://msdn.microsoft.com/en-us/library/cc441928.aspx

    good reads regarding your Question
    http://msdn.microsoft.com/en-us/library/aa175776%28SQL.80%29.aspx


    Best Regards,
    Melissa Suciadi


    If you have found this post helpful, please click the 'Vote as Helpful' link (the green triangle and number on the top-left).

    If this post answers your question, click the 'Mark As Answered' link below. It helps others who experience the same issue in future to find the solution.

    Wednesday, November 11, 2009 8:34 AM
  • Thanks, Abdshall. but key of my question is that why does generate different numbers for each rows, when i use NewId() function instead of rand() function. how NewId() function designs so can generates different numbers for each rows, how does this function understand generates different numbers for each rows? what is different between functions that generates one value, with functions that generates different values for each rows?


    SQL Server must choose a seed value for RAND() when a seed value is not specified.  I believe the Query Optimizer will only do this once per batch.  Assume the seed value remains static for the duration of the batch.  Since the seed value is static, the return value from RAND() will remain the same for every row it is used in.

    NEWID() on the other hand, by definition, must return a new value.
    Phil Brammer | http://www.ssistalk.com | Twitter: http://twitter.com/PhilBrammer
    Tuesday, November 17, 2009 4:42 PM
  • NEWID() and RAND() are two different things, and this is by design from Microsoft not us. NEWID() generates a unique value while RAND() generates a random number. If you notice one thing NEWID() doesn't take any seeds, yet RAND() does. If a seed is specified, all random numbers will be the same.


    Abdallah, PMP, ITIL, MCTS
    Tuesday, November 17, 2009 4:33 PM

All replies


  • The key here is the NEWID function, which generates a globally unique identifier (GUID) in memory for each row. By definition, the GUID is unique and fairly random; so, when you sort by that GUID with the ORDER BY clause, you get a random ordering of the rows in the table. Taking the top 10 percent (or whatever percentage you want) will give you a random sampling of the rows in the table.

    Often, when questions about how to select random rows are asked in discussion groups, the NEWID query is proposed; it is simple and works very well for small tables. However, the NEWID query has a big drawback when you use it for large tables. The ORDER BY clause causes all of the rows in the table to be copied into the tempdb database, where they are sorted. This causes two problems:

    1. The sorting operation usually has a high cost associated with it. Sorting can use a lot of disk I/O and can run for a long time.
    2. In the worst-case scenario, tempdb can run out of space. In the best-case scenario, tempdb can take up a large amount of disk space that never will be reclaimed without a manual shrink command.

    FROM http://msdn.microsoft.com/en-us/library/cc441928.aspx

    good reads regarding your Question
    http://msdn.microsoft.com/en-us/library/aa175776%28SQL.80%29.aspx


    Best Regards,
    Melissa Suciadi


    If you have found this post helpful, please click the 'Vote as Helpful' link (the green triangle and number on the top-left).

    If this post answers your question, click the 'Mark As Answered' link below. It helps others who experience the same issue in future to find the solution.

    Wednesday, November 11, 2009 8:34 AM
  •  as i know newid() generate uniqueidentifier number (GUID) and order by get columns name, for sorting rows.

    Kamran,

    Interesting question.

    BOL: "The ORDER BY clause can include items that do not appear in the select list."

    You could include newid() as column in the select list. Therefore you can use it in ORDER BY even when not in the SELECT list.

    select col1 = NEWID(), * from Production.Product
    order by col1
    /*
    ProductID	Name
    747	HL Mountain Frame - Black, 38
    915	ML Touring Seat/Saddle
    853	Women's Tights, M
    855	Men's Bib-Shorts, S
    ....
    */
    select  * from Production.Product
    order by NEWID()
    /*
    ProductID	Name
    349	Flat Washer 7
    451	Lock Nut 11
    328	Mountain End Caps
    368	Thin-Jam Hex Nut 4
    967	Touring-1000 Blue, 50
    .....
    */

    Kalman Toth SQL SERVER 2012 & BI TRAINING
    New Book: Beginner Database Design & SQL Programming Using Microsoft SQL Server 2012


    Wednesday, November 11, 2009 9:25 AM
  • Melissa - GUIDs are no longer *guaranteed* to be unique as NEWID() uses GUID v4, which is based on a random number generator and collisions may occur. 

    It is supposed to be unique, but others have "testified" they've seen collisions on large tables.

    Phil Brammer | http://www.ssistalk.com | Twitter: http://twitter.com/PhilBrammer
    Wednesday, November 11, 2009 2:07 PM
  • Yes, you will get random rows ordering by NEWID(). But a better approach is to order by the expression CHECKSUM(NEWID()) which provides better random distribution.
    Plamen Ratchev
    Wednesday, November 11, 2009 2:25 PM
  • I have a few questions on this subject If I may:

    Lets say I have a table Exercises and table Exam. And in table Exam_Exercises I have information on what Exercises belong in each Exam. Now If I fetch Exam's Exercises like this:

    SELECT 
    e.ID
    ,e.Name
    ,CHECKSUM(NEWID())  AS Index
    FROM
    Exercise e
    INNER JOIN Exam_Exercises ee ON e.ID=ee.ExerciseId
    ORDER BY
    Index

    a) Will those exercises will be selected in random order?
    b) Do I have to worry about tempdb running out of space if Exam every Exam can have up to 100 exercises asociated with it (although the actual number of entries in Exam and Exercise tables can reach 10 000)?




    Cheers!
    Thursday, November 12, 2009 1:47 AM
  • Answers:

    a) Yes, this will provide random order for exercises.

    b) I doubt with 100 exercises per exam you can run our of space in tempdb (even if you multiply 10 000 entries by 100 you will end up with 1,000,000 which is not a large result set by today's standards).
    Plamen Ratchev
    Thursday, November 12, 2009 3:01 AM
  • Thanks!
    Cheers!
    Thursday, November 12, 2009 5:32 AM
  • Thanks cheers. :) you answered me like a teacher.
    Thursday, November 12, 2009 6:51 AM
  • Thanks Melissa. your answered was very helpful:)
    Thursday, November 12, 2009 6:52 AM
  • Yes, you will get random rows ordering by NEWID(). But a better approach is to order by the expression CHECKSUM(NEWID()) which provides better random distribution.
    Plamen Ratchev

    Plamen, explanation?
    Kalman Toth, SQL Server & BI Training, SSAS, SSIS, SSRS; http://www.SQLUSA.com
    Thursday, November 12, 2009 7:03 AM
  • Itzik found out this was a better approach to get good random distribution. He noted this on the following article (as well as in his book Inside SQL Server 2008: T-SQL querying):
    http://www.sqlmag.com/Articles/ArticleID/93966/93966.html
    Plamen Ratchev
    Thursday, November 12, 2009 3:28 PM
  • Good to know.

    CHECKSUM(NEWID()) generates a signed integer random number.

    Thanks Plamen.


    -- RANDOM NUMBER GENERATOR SAMPLE OUTPUT
    -- CHECKSUM (NEWID())
    SELECT rnd=CHECKSUM('EAEEC419-8124-4764-829C-9C42588FE23B')
    SELECT rnd=CHECKSUM('EAEEC419-8124-4764-829C-9C42588FE23C')
    SELECT rnd=CHECKSUM('DAEEC419-8124-4764-829C-9C42588FE23C')
    SELECT rnd=CHECKSUM('DAEEC419-8124-4764-828C-9C42588FE23C')
    -- -710262848
    -- -710262817
    -- -710275105
    -- -727052321

    Kalman Toth, SQL Server & BI Training, SSAS, SSIS, SSRS; http://www.SQLUSA.com
    Thursday, November 12, 2009 4:46 PM
  • SELECT 
    e.ID
    ,e.Name
    ,CHECKSUM(NEWID())  AS Index
    FROM
    Exercise e
    INNER JOIN Exam_Exercises ee ON e.ID=ee.ExerciseId
    ORDER BY
    Index


    Can somebody tell me how shoud i change this in order for it to only return say first 5 rows instead of all exercise rows that are assigned to exam? Is It even possible (possible=not painful) or do i have to do that in application.

    Cheers!
    Thursday, November 12, 2009 11:18 PM
  • It is not very clear what you need as you did not provide sample data, but try this:

    SELECT e.ID, e.Name
    FROM Exercise AS e
    CROSS APPLY(SELECT TOP(5) ee.ExerciseId
                FROM Exam_Exercises AS ee
                WHERE e.ID = ee.ExerciseId
                ORDER BY CHECKSUM(NEWID())) AS ee2;

    Plamen Ratchev
    Thursday, November 12, 2009 11:45 PM
  • Hi, I have another question about your answred. why nothing happen when is use RAND() function instead of
    NEWID() function? as i know rand()  generates random number but why is it constants for each rows?
    Sunday, November 15, 2009 4:37 PM
  • Why do you want to use RAND to generate new ID's. It only generates random numbers between 0 and 1, but it doesn't guarantee that these numbers will not be repeated. Also, if you sepcify a seed in the RAND function, for example RAND(1), the number will always be the same.
    moreover, as Plamen suggested, you should use CHECKSUM(NEWID()) if you need to guarantee a unique ID.
    Abdallah, PMP, ITIL, MCTS
    Sunday, November 15, 2009 4:48 PM
  • Hi Abdshall, your answred is a good plan when i want to generate randoms rows, but my question is that why when i use RAND() function generates a constant number for each rows. for example if i choose 5 rows on a table. when i used this statment SELECT Top 5 *,RAND() as RandomG FROM Customer . Result is:

    cid  fname  lname  email                               RandomG
    -----------------------------------------------------------------------------
    1    Erivin     Antony ErvinAn12@msn.com          0.0083936506376151
    2    Sam      Smith  maHtabKarimi@yahoo.com   0.0083936506376151
    3    david     sheen  david_sh@yahoo.com          0.0083936506376151
    4    keyvan  satrabi keyvansat@gmail.com         0.0083936506376151
    5    sonya    simva    sonisim@yahoo.com           0.0083936506376151
     
    As you see i got 0.0083936506376151 for all of rows, i didn't get  random number for each rows.
    Tuesday, November 17, 2009 2:44 PM
  • Yes, that will give you the same result because the way this works is RAND is being called once and with the same seed. Check out this link
    http://msdn.microsoft.com/en-us/library/ms177610.aspx

    If you want to generate different results, you can do the following

    SELECT Top 5 *,RAND(Customer.cid) as RandomG FROM Customer

    This should give you different results.


    Abdallah, PMP, ITIL, MCTS
    Tuesday, November 17, 2009 2:52 PM
  • Thanks, Abdshall. but key of my question is that why does generate different numbers for each rows, when i use NewId() function instead of rand() function. how NewId() function designs so can generates different numbers for each rows, how does this function understand generates different numbers for each rows? what is different between functions that generates one value, with functions that generates different values for each rows?

    Tuesday, November 17, 2009 4:23 PM
  • NEWID() and RAND() are two different things, and this is by design from Microsoft not us. NEWID() generates a unique value while RAND() generates a random number. If you notice one thing NEWID() doesn't take any seeds, yet RAND() does. If a seed is specified, all random numbers will be the same.


    Abdallah, PMP, ITIL, MCTS
    Tuesday, November 17, 2009 4:33 PM
  • Thanks, Abdshall. but key of my question is that why does generate different numbers for each rows, when i use NewId() function instead of rand() function. how NewId() function designs so can generates different numbers for each rows, how does this function understand generates different numbers for each rows? what is different between functions that generates one value, with functions that generates different values for each rows?


    SQL Server must choose a seed value for RAND() when a seed value is not specified.  I believe the Query Optimizer will only do this once per batch.  Assume the seed value remains static for the duration of the batch.  Since the seed value is static, the return value from RAND() will remain the same for every row it is used in.

    NEWID() on the other hand, by definition, must return a new value.
    Phil Brammer | http://www.ssistalk.com | Twitter: http://twitter.com/PhilBrammer
    Tuesday, November 17, 2009 4:42 PM
  • Thanks Phil.
    Saturday, November 21, 2009 10:32 AM
  • Kamran,

    Adding again to the posts above in this very interesting thread.

    IDENTITY and NEWID systems functions are designed for row versioning, so naturally they give new values for new rows.

    RAND() behaves the same way as GETDATE() in a query: replaced with a literal which is the result of a single function evaluation.  You would have to do a WHILE loop line by line to get different values for RAND() & GETDATE(). Demo follows.


    -- RAND and NEWID differences
    -- First Batch
    SELECT TOP 5 RAND()    AS RandomNo, 
                 GETDATE() AS ModDate, 
                 NEWID()   AS GlobalUniqueID, 
                 * 
    FROM   Northwind.dbo.Customers 
    /* RandomNo			ModDate					GlobalUniqueID
    0.192356139076468	2009-11-22 09:03:46.470	E349E750-5FEF-45FA-9361-486732A42BA8
    0.192356139076468	2009-11-22 09:03:46.470	2D3FFE14-AAB1-4416-938F-C3FB57BF04DD
    0.192356139076468	2009-11-22 09:03:46.470	5EF9B3EA-400D-426E-9914-E10F3A3D8EDC
    0.192356139076468	2009-11-22 09:03:46.470	D54D5A00-7E9A-4517-8FD5-E13F7FBA7FE9
    0.192356139076468	2009-11-22 09:03:46.470	27D11869-1B4E-4AB0-AC68-FEEB5D28D5A5 */
    select RAND(1)  -- 0.713591993212924
    select RAND(1)  -- 0.713591993212924
    select RAND(2)  -- 0.713610626184182
    select RAND(2)  -- 0.713610626184182
    select RAND()   -- 0.217821139260039
    select RAND()   -- 0.570956802191052
    select RAND(),RAND(),RAND() -- 0.213689267406154 0.68442381362486 0.201630044636296
    GO
    -- Second Batch
    select RAND(1)  -- 0.713591993212924
    select RAND(2)  -- 0.713610626184182
    select RAND(3)  -- 0.71362925915544
    select RAND()   -- 0.963400850719992
    GO

    Kalman Toth, SQL Server & BI Training, SSAS, SSIS, SSRS; http://www.SQLUSA.com
    Sunday, November 22, 2009 2:11 PM
  •  but my question is that why when i use RAND() function generates a constant number for each rows. for example if i choose 5 rows on a table. when i used this statment SELECT Top 5 *,RAND() as RandomG FROM Customer . Result is:

    cid  fname  lname  email                               RandomG
    -----------------------------------------------------------------------------
    1    Erivin     Antony ErvinAn12@msn.com          0.0083936506376151
    2    Sam      Smith  maHtabKarimi@yahoo.com   0.0083936506376151
    3    david     sheen  david_sh@yahoo.com          0.0083936506376151
    4    keyvan  satrabi keyvansat@gmail.com         0.0083936506376151
    5    sonya    simva    sonisim@yahoo.com           0.0083936506376151
     
    As you see i got 0.0083936506376151 for all of rows, i didn't get  random number for each rows.
    You can force different random number for each row with the combination of rand() & newid() functions. Demo follows:

    SELECT TOP (5) CustomerID, TerritoryID, AccountNumber,
    		RAND(CHECKSUM(NEWID())) AS RandomG
    FROM AdventureWorks2008.Sales.Customer
    /*
    CustomerID	TerritoryID	AccountNumber	RandomG
    1	1	AW00000001	0.848030620551453
    2	1	AW00000002	0.857290604235395
    7	1	AW00000007	0.415018172139368
    19	1	AW00000019	0.520703127830369
    20	1	AW00000020	0.671764827482506
    */


    Kalman Toth, SQL Server & BI Training, SSAS, SSIS, SSRS; http://www.SQLUSA.com
    Friday, November 27, 2009 12:19 AM
  • Hi, Mr Khanjenouri. lol

    I just found it online.

    The RAND() function only generate the random number at the evaluation time, not the execution.

    http://www.sqlhacks.com/Retrieve/Real-Random-Number

    • Proposed as answer by AmirMohammad Friday, March 15, 2013 8:02 PM
    Friday, March 15, 2013 8:02 PM