Given your performance numbers, I assume you must be using the 2.0 framework, or something similar? The numbers are much better in 4.0, but the “Marshal.GetDelegate” version is still slower.
The thing is that not all delegates are created equal.
Delegates for managed code functions are essentially just a straight function call (on x86, that’s a __fastcall), with the addition of a little “switcheroo” if you’re calling a static function (but that’s just 3 or 4 instructions on x86).
Delegates created by “Marshal.GetDelegateForFunctionPointer”, on the other hand – are a straight function call into a “stub” function, which does a little overhead (marshalling and whatnot) before calling the unmanaged function. In this case there’s very little marshalling, and the marshalling for this call appears to be pretty much optimized out in 4.0 (but most likely still goes through the ML interpreter on 2.0) – but even in 4.0, there’s a stackWalk demanding unmanaged code permissions that isn’t part of your calli delegate.
I’ve generally found that, short of knowing someone on the .NET dev team, your best bet on figuring out what’s going on w/ managed/unmanaged interop is to do a little digging with WinDbg and SOS.