none
INT math is slow in SQL server, but only under VMware - really weird! RRS feed

  • Question

  • Hi

    I had noticed a really strangle thing. INT calculation is 2-3-5 times slower in SQL server running in VMware than on physical machines. From VMware side, I had eliminated all possible things: balooning, CPU overconsumption, "ready" time etc. INT CPU tests (link 'Pi' digits calculation programs) show the same performance, while SQL server math operations are MUCH slower. 

    Please check the test of few results on different machines. I am DBA for >20y, but I am really puzzled.

    Test is simple, no parallelism, no IO, limited memory use (L2,L3 caching is irrelevant)

    Results

    -- 31403 ms, one PROD VM, SQL 13.0.5337 (why is it so slow??? )
    -- 9227 ms,  AWS RDS 14.0.3035.2
    -- 7733 ms,  very old physical machine, 10.50.6220.0, very old
    -- 13666 ms,  my really old desktop, physical, 13.0.4446.0
    -- 6346 ms,   desktop of my colleague, physical, 14.1000.169

    create function dbo.isPrime (@n bigint)
    returns int
    as
      begin
      if @n = 1 return 0
      if @n = 2 return 1
      if @n = 3 return 1
      if @n % 2 = 0 return 0
      declare @sq int
      set @sq = sqrt(@n)+1 -- check odds up to sqrt
      declare @dv int = 1
      while @dv < @sq 
        begin
    	set @dv=@dv+2
    	if @n % @dv = 0 return 0
    	end
      return 1
      end
    GO
    declare @dt datetime set @dt=getdate()
    select dbo.isPrime(1000000000000037)
    select datediff(ms,@dt,getdate()) as ms
    GO



    • Edited by tzimie Monday, December 9, 2019 2:08 PM
    Monday, December 9, 2019 2:04 PM

All replies

  • Hi Tzimie,

    Did you compare the execution plans between the machines? What about enabling the old cardinality estimator? Did you try that?

    regards!

    Monday, December 9, 2019 4:16 PM
  • lol. There are no tables involved!!! 

    exec plans are identical - it is a simple loop with 3 variables. 

    Monday, December 9, 2019 4:19 PM
  • Just in case... Are we talking about the same resources on the Vm number of Vcpus/Sockets/Numas or CPU model?
    Monday, December 9, 2019 4:49 PM
  • This test is single thread, so on 2 vCPU VM it used 50% of the total CPU, on 4 - 25% etc. VMs are really small so they are single Numa. I even have 1 vm with a single vCPU, it consumes `100% of the CPU, but execution time is the same. 


    Monday, December 9, 2019 7:27 PM
  • I did a very quick test. I ran the function twice on SQL 2019 RTM on my physical machine running Win 10 1909. The executions were in 7-9 seconds. I then ran the function on a VM which runs on the same physical machine. On this VM I have SQL 2019 RC1. The execution time here was eight seconds. That is, I saw dramatic difference as you do.

    My VM is running under VMWare Workstation 15.5.1.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Monday, December 9, 2019 10:47 PM
  • Hi Erland, Yes, there is a new improvement for inline fuctions in sql2019 where SQL server will execute functions like normal batches instead of assuming that the fuction return 0 or 100 rows. Anyway good point! Maybe that is the problem.
    Tuesday, December 10, 2019 1:39 AM
  • Hi tzimie,

    Here are some similar threads which might help:

    https://stackoverflow.com/questions/9290291/sql-script-in-vm-taking-long-time-for-execution

    https://www.sqlservercentral.com/forums/topic/will-virtual-machines-slow-down-t-sql-queries

    Hope this could help you.

    Best Regards,

    Amelia Gu


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Tuesday, December 10, 2019 6:55 AM
  • I did a very quick test. I ran the function twice on SQL 2019 RTM on my physical machine running Win 10 1909. The executions were in 7-9 seconds. I then ran the function on a VM which runs on the same physical machine. On this VM I have SQL 2019 RC1. The execution time here was eight seconds. That is, I saw dramatic difference as you do.

    Erland - you did, or you did not, see dramatic differences?

    The numbers you posted look about the same, right?

    OP, I was going to suggest that just because you have a VM, doesn't mean you have an entire physical processor.  There's something called "the noisy neighbor" problem that can happen when the same CPU chip, cache, and main memory, are shared by other VMs that are much busier than yours.  I suspect this is the source of some problems I've seen with SQL Server performance generally, crazy performance variations that just cannot be tracked down.

    Josh

    Tuesday, December 10, 2019 7:57 AM
  • Thank you all for your replies.

    1. I was also confused by (did/did not)

    2. I am going to make a test at home with VMware player soon. 

    3. Josh, physical blades are running on 20-25% CPU, so there is no CPU contention, also, results stay the same, time is 31s - +/- 1 sec for any time I run it

    4. What puzzles me is that pure CPU test should not suffer from any virtualization overhead at all (!!!) - because it is virtualization, not emulation! And tests like CalcPi confirm that. 

    5. Amazon people know something I dont know, AWS RDS machines are as fast on my test as physical ones.

    Still experimenting, will let you know

    Tuesday, December 10, 2019 8:56 AM
  • ok, an update.

    Tried to check latch waits, nothing that can reveal the mystery.

    However, natively compiled isPrime is 50x faster.

    I tend to believe that when SQL server executes the "traditional" code, is does something between the statements (checks if an event needs to be added to SQL profiler trace, or user cancels a query, may be other things), and that "something" somehow calls OS, which, under VMware, goes thru virtualization stack.

    Tuesday, December 10, 2019 10:22 AM
  • Erland if famous for forgetting "not"s. :-)

    I think in the end you need a machine where you can run the samt test both natively and under VMWare, on the same machine. And perhaps also throw in a Hyper-V in case you did see a difference with VMWare (just for fun).

    Without seeing a native test from your slow VMWare environment, we are just guessing. Perhaps you have something in the noisy env that is noisy *enough* to mess with you. One thing you might try with is if there is a difference (improvement) if you allocate just one core to the machine? (Again, just as a test.)


    Tibor Karaszi, SQL Server MVP (Web Blog)

    Tuesday, December 10, 2019 10:23 AM
    Moderator
  • Erland if famous for forgetting "not"s. :-)

    Blush!

    Sorry about that.

    Gonzalo makes a good point about scalar functions being inlined in SQL 2019. However, I don't think this function is inlined, since there is a loop.

    But as Tibor says, if you are comparing a VM one physical box with a non-virtual environment on some other machine,  you are comparing apples to oranges. Or at least you are not comparing virtual vs. physical.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Tuesday, December 10, 2019 11:52 AM
  • The OP is also testing multiple SQL versions.

    I ran the same test on the same physical machine with SQL 2014 , 2016, 2017 and 2019 and got the same results for all versions.  I don't have VMs anymore to test.

    SQL ms
    SQL 2014 12.0.6329.1 for 7766
    SQL 2016 13.0.4522.0 for 7590
    SQL 2017 14.0.3223.3 for 7544
    SQL 2019 15.0.2070.41 for 8680

    Tuesday, December 10, 2019 1:05 PM
    Moderator
  • SQL 2019 15.0.2070.4  for 18680ms

    or

    SQL 2019 15.0.2070.41 for 8680ms?

    Tuesday, December 10, 2019 1:16 PM
  • SQL 2019 15.0.2070.41 for 8680ms

    The forum reformatted it and removed the spaces.


    Tuesday, December 10, 2019 1:28 PM
    Moderator
  • Just got an access to VMware farm where there is almost no activity, and got almost no overhead, which confirms Erlands findings for a spherical VM in a vacuum.

    So when I calculate isPrime(1000000000000037), I run the loop below sqrt(1000000000000037)/2 = 15'811'388 times. Assuming that in the code below there are 3 statements (I don't know the sql internals, so not sure, but lets not count 'begin' and 'end' as statements), we can estimate an overhead as 20 extra sseconds divided by 15811388 and then divided by 3.

      while @dv < @sq 
        begin
    	set @dv=@dv+2
    	if @n % @dv = 0 return 0
    	end
    

    I get 420ns. I'd like to believe that SQL performs some syscall after every statement, but just from time to time. (if (stmtcnt++ mode 255) == 0)... So from time to time it goes into syscall, which triggers Hypervisor to reschedule, or at least to re-analyze the queue. Even if there is no CPU contention, but many active VMs, queue is long and every micro second can be critical.

    In 99% SQL statements are "Big" and IO bound, so 1 extra microsecond is irrelevant. Except such extreme cases of pure calculations. SQL server is not designed for that type of workload, but still, I have to deal with some poorly written code. 

    Natively compiled UDFs as 50x faster and don't suffer from that problem. But It is quite strange that in our days, when even the last lazy interpreter has it's own JIT, SQL server continues to operate on a "naive" level, and "JITiing" must be done explicitly by wrapping the code into natively compiled block. 

    Also, it is not clear what should be long term strategy regarding UFDs: putting them into native code make them faster, but prevents new optimizer of SQL 2019 to "expand" them so they can work better in relational queries without running many times for every record.

    Tuesday, December 10, 2019 1:51 PM
  • Natively compiled UDFs as 50x faster and don't suffer from that problem. But It is quite strange that in our days, when even the last lazy interpreter has it's own JIT, SQL server continues to operate on a "naive" level, and "JITiing" must be done explicitly by wrapping the code into natively compiled block. 

    It is always easier when you make a fresh design.

    I don't know if Microsoft have considered to change their model, but it would be quite a bit work work, and it could have effects that some customers do not like.

    Also, it is not clear what should be long term strategy regarding UFDs: putting them into native code make them faster, but prevents new optimizer of SQL 2019 to "expand" them so they can work better in relational queries without running many times for every record.

    As often with such question, the answer is: it depends. This particular function is a prime candidate for being rewritten as natively compiled. At the same time, it is not a candidate for inlining thanks to the loop. But for a UDF that only performs a point lookup in in a table, there is not much point with rewriting it as natively compiled, but inlining can be a big success.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Tuesday, December 10, 2019 10:15 PM
  • Just got an access to VMware farm where there is almost no activity, and got almost no overhead, which confirms Erlands findings for a spherical VM in a vacuum.

    OK then.  You could be write about hidden syscalls.  Also, you don't happen to have a server-side trace running just on that machine, do you?  Just asking ...

    Josh

    Wednesday, December 11, 2019 7:41 AM
  • I would like to find a solid proof.

    Researching the tools which might help, anything you know?

    Also I have a meeting today with my colleague with a very strong Windows perf. expertise.

    Wednesday, December 11, 2019 8:21 AM
  • You said you looked at latch waits, did you look at other waits?

    If it doesn't show in waits, or IO stats, then the VM remains the prime suspect.

    I've had very limited success talking to VM sysadmins and getting anywhere with them.  They hoard their own information, and seldom know anything more than the charts show.

    Josh

    Thursday, December 12, 2019 3:33 AM
  • No other waits, and, of course, no IO (why it should be?)

    Now my primary suspect is getting system time call. I believe this is why SQL server does between the statements, as (at least before) it was based on cooperative multitasking, so threads should yield from time to time, so they have to check time to understand if they have to yield.

    Preparing a program to check that hypothesis

    Thursday, December 12, 2019 9:09 AM
  • WFIW, you can hook up a debugger to SQL Server and see what is going on at the win API level. I've never had the need to do it myself, but I'm sure that you can find some starting point with a bit of searching.

    Tibor Karaszi, SQL Server MVP (Web Blog)

    Thursday, December 12, 2019 10:31 AM
    Moderator
  • Since this is primarily a VMWare question, I would suggest you ping them and make sure you have all the current VMWare patches installed.

    Thursday, December 12, 2019 1:23 PM
    Moderator
  • Now my primary suspect is getting system time call. I believe this is why SQL server does between the statements, as (at least before) it was based on cooperative multitasking, so threads should yield from time to time, so they have to check time to understand if they have to yield.

    Preparing a program to check that hypothesis

    But it isn't doing that in some of your VM scenarios …

    Can it be some obscure VM setting that virtualizes that call differently?  Or even on an older processor that doesn't know how to virtualize it efficiently?

    Josh

    Thursday, December 12, 2019 8:00 PM
  • I was able to observe the same (performance drops on VMware under heavy load) with the following code:

    using System;
    using System.Globalization;
    
    public class Tester
    {
       public static void Main()
       {
          for (int i=0; i<16; i++) {
            DateTime starting = DateTime.Now;
            int sec = starting.Second;
            int counter = 0;
            while (DateTime.Now.Second == sec) { counter++; }
            if (i > 0) { Console.WriteLine("{0}", counter); }
          }
       }
    }
    


    Monday, December 16, 2019 8:59 AM
  • I was able to confirm that SQL server is getting time nearly "all the time", lol

    Between every 2 statements.

    ntoskrnl.exe!KeSynchronizeExecution+0x5bf6
    ntoskrnl.exe!KeWaitForMultipleObjects+0x109d
    ntoskrnl.exe!KeWaitForMultipleObjects+0xb3f
    ntoskrnl.exe!KeWaitForSingleObject+0x377
    ntoskrnl.exe!KeQuerySystemTimePrecise+0x881  <-- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    ntoskrnl.exe!ObDereferenceObjectDeferDelete+0x28a
    ntoskrnl.exe!KeSynchronizeExecution+0x2de2
    sqllang.dll!CDiagThreadSafe::PxlvlReplace+0x1a20
    sqllang.dll!CStmtCurState::CStmtCurState+0x11a4
    sqllang.dll!CStmtCurState::CStmtCurState+0xf8f
    sqllang.dll!CSQLObject::~CSQLObject+0x34f
    sqllang.dll!CDiagThreadSafe::PxlvlReplace+0x7f8
    sqllang.dll!CSQLSource::Execute+0x53c
    sqllang.dll!CSQLObject::Parse+0x90b
    sqllang.dll!TVUDFInvoke+0xa73
    sqllang.dll!GetTraceFlagProviderImpl+0x3acc
    sqlTsEs.dll!CDefaultCollation::IHashW+0x2c7
    sqlmin.dll!SMD::LockObjectAndCheckOnlineVersion+0x984
    sqlmin.dll!SMD::LockObjectAndCheckOnlineVersion+0x8c4
    sqlmin.dll!SETruncateRowset+0x3569
    sqlmin.dll!HadrSinglePartId<1>::GetUniqueId+0xae
    sqllang.dll!CDiagThreadSafe::PxlvlReplace+0x1527
    sqllang.dll!CDiagThreadSafe::PxlvlReplace+0x1709
    sqllang.dll!CDiagThreadSafe::PxlvlReplace+0xdb5
    sqllang.dll!CDiagThreadSafe::PxlvlReplace+0x7f8
    sqllang.dll!CSQLSource::Execute+0x53c
    sqllang.dll!SNIGetInfo+0xb5b
    sqllang.dll!SNIPacketRelease+0x785
    sqllang.dll!SNIPacketRelease+0x55f
    sqldk.dll!SOS_OS::GetSOSProcessGlobals+0x403
    sqldk.dll!SOS_WaitStats::Reset+0x84d
    sqldk.dll!SOS_WaitStats::Reset+0x655
    sqldk.dll!SystemThread::MakeMiniSOSThread+0xc10
    sqldk.dll!SystemThread::MakeMiniSOSThread+0x50b
    sqldk.dll!SystemThread::MakeMiniSOSThread+0xa54
    KERNEL32.DLL!BaseThreadInitThunk+0x14
    ntdll.dll!RtlUserThreadStart+0x21

    Wednesday, December 18, 2019 1:33 PM
  • Good stuff.

    So, what does this mean, "VMWare under heavy load", does it show as greater than 100% demand, is the real CPU pegged, does the VMWare report impairment?

    I think we already knew that TSQL is not really cut out for heavy integer math.

    Y'know, it's odd though, at my last gig we were running a central SP that used a cursor, and this was clearly our bottleneck, and it just did a little very simple logic, but it wasn't just an in-core loop, and running on Azure we never really noticed a problem like you describe.

    So it seems limited exactly to non-IO loops, as soon as there's even logical IO the rest of the overhead must overwhelm the inter-statement time checker.

    So what was your actual app doing with all this integer math in the first place?

    Josh


    • Edited by JRStern Thursday, December 19, 2019 4:08 AM
    Thursday, December 19, 2019 4:07 AM
  • Actual app contains some loop for PnL calculation. Not a good idea, but there are multiple legacy procedures and functions here and there. In the previous example DateTime.Now was not good enough because it didn't requested the "precise" time - "regular" time takes a time OS already has, while "precise" time goes deeper.

    The code does the work much better, and from VM to VM (and from Host to Host) the difference is 30x (!!!). 

    Interesting observations:

    Some VMs are SLOW, some are FAST, and there is nothing in between. The test below does 700K loops per second or 20M, and nothing in between! 

    The same VM, Vmotioned across different hosts, can become SLOW or FAST (so it is not a problem of Windows version, patches etc).

    But (!!!) the same host can contain "SLOW" and "FAST" VMs. 

    I am puzzled. Making a demo to my boss today. But at least I can eliminate SQL server so we can open a support ticket to VMware, otherwise VMware and Microsoft would point at each other

        class Program
        {
            [DllImport("kernel32.dll")]
            static extern void GetSystemTimePreciseAsFileTime(out FILE_TIME lpSystemTimeAsFileTime);
    
            [StructLayout(LayoutKind.Sequential)]
            struct FILE_TIME
            {
                public int ftTimeLow;
                public int ftTimeHigh;
            }
    
            static void Main(string[] args)
            {
                for (int i = 0; i < 16; i++)
                {
                    int counter = 0;
    
                    var stopwatch = Stopwatch.StartNew();
    
                    while (stopwatch.ElapsedMilliseconds < 1000)
                    {
                        GetSystemTimePreciseAsFileTime(out var fileTime);
                        counter++;
                    }
    
                    if (i > 0)
                    {
                        Console.WriteLine("{0}", counter);
                    }
                }
            }
        }
    


    • Edited by tzimie Thursday, December 19, 2019 10:03 AM
    Thursday, December 19, 2019 10:02 AM
  • Perhaps one can limit the plethora (Tres Amigos, anyone? :-) ) of calls to the OS using the -x startup switch for sqlservr.exe? The documentation is a little bit thin on info for this, but you never know...? As for actually how much info it disables I can't say, and ideally you want to full range of diag info. But as one piece to the puzzle. From the documentation:

    "Disables the keeping of CPU time and cache-hit ratio statistics. Allows maximum performance."

    https://docs.microsoft.com/en-us/sql/tools/sqlservr-application?view=sql-server-ver15


    Tibor Karaszi, SQL Server MVP (Web Blog)

    Thursday, December 19, 2019 11:18 AM
    Moderator
  • "Disables the keeping of CPU time and cache-hit ratio statistics. Allows maximum performance."


    Unfortunately -x didnt help

    My current working hypothesis is: on NUMA hosts, precise time is on both nodes, but not absolutely in sync, so probably VMware is taking precise time from only one NUMA node. So the other NUMA is "unlucky" and NUMA roundtrip time is added to GetPreciseTime for all VMs running on the "unlucky" NUMA

    Still investigating.

    Tuesday, December 24, 2019 9:17 AM
  • Interesting observations:

    Some VMs are SLOW, some are FAST, and there is nothing in between. The test below does 700K loops per second or 20M, and nothing in between! 

    The same VM, Vmotioned across different hosts, can become SLOW or FAST (so it is not a problem of Windows version, patches etc).

    But (!!!) the same host can contain "SLOW" and "FAST" VMs. 

    OK, it's interesting.

    How can the same host have both slow and fast VMs?  I mean, what did you do to establish that?

    I was going to ask if some of your host servers were just old hardware.  Could also be old VMWare?  Your VM still has to talk to the VMWare on the host.  But if the *same* host shows variability, I'm stumped.

    Josh

    Wednesday, December 25, 2019 11:20 PM
  • How can the same host have both slow and fast VMs?  I mean, what did you do to establish that?

    I was going to ask if some of your host servers were just old hardware.  Could also be old VMWare?  Your VM still has to talk to the VMWare on the host.  But if the *same* host shows variability, I'm stumped.


    To check I RDPed to all VMs on the same host and tested them all. About 50% of them are fast, another 50% are slow. This is why I came to the idea that it is NUMA, I was also stumped.

    VMware is 6.0.0, not new. After Xmas we will open a ticket with VMware

    Thursday, December 26, 2019 9:02 AM