* Reshuffle SSE #ifdefs to try to avoid compilation errors enabling SSE on 32 bit.
* Remove CUDA kernel launch size exception on Mac, is not needed.
* Make OSL file compilation quiet like c/cpp files.
texture coordinate that should automatically use the default normal or texture
coordinate appropriate for that node, rather than some fixed value specified by
the user.
* On nvidia Kepler GPUs (sm_30 and above), there are now 145 byte images available, instead of 95.
We could extend this to about 200 if needed.
Could not test this, as I don't have a Kepler GPU, so feedback on this would be appreciated.
Thanks to Brecht for review and some fixes. :)
* Add CUDA compiler version detection to cmake/scons/runtime
* Remove noinline in kernel_shader.h and reenable --use_fast_math if CUDA 5.x
is used, these were workarounds for CUDA 4.2 bugs
* Change max number of registers to 32 for sm 2.x (based on performance tests
from Martijn Berger and confirmed here), and also for NVidia OpenCL.
Overall it seems that with these changes and the latest CUDA 5.0 download, that
performance is as good as or better than the 2.67b release with the scenes and
graphics cards I tested.
On the BMW scene, this gives roughly a 10% speedup overall with clang/gcc, and 30%
speedup with visual studio (2008). It turns out visual studio was optimizing the
existing code quite poorly compared to pretty good autovectorization by clang/gcc,
but hand written SSE code also gives a smaller speed boost there.
This code isn't enabled when using the hair minimum width feature yet, need to
make that work with the SSE code still.
* Enable the Non-Progressive integrator on GPU (CUDA) for testing.
In order to compile the CUDA kernel with it, you need at least 6GB of system memory and CUDA Toolkit 5.0 or 5.5.
It should also work with CUDA Toolkit 4.2, but in this case you should have 12GB of RAM.
In case any problems arise, just change line 65 of kernel_types.h to disable Non-Progressive again.
-- #define __NON_PROGRESSIVE__
++ //#define __NON_PROGRESSIVE__
* Replaced the Brute Force version with a nice lookup table, this speeds it up a lot.
Patch by Philipp Oeser (lichtwerk) with some cleanup and changes by myself. Thanks!
ToDo:
* Temperature values between 800 and 804 Kelvin are wrong in SVM, check on this.
* First (brute force) implementation for SVM. This works and delivers the same result as OSL, but it's slow.
* Code inside svm_blackbody.h inspired by a patch by Philipp Oeser (#35698), thanks.
Ideas:
* Use a lookup table to perform the calculations on render/ level.
* Implement it as a RNA property only, and do the calculation like Sun/Sky precompute.
and sm_30 cards, so hopefully it should all work now.
Also includes some warnings fixes related to nvcc compiler arguments, should make
no difference otherwise.
Apparently, it's bad idea to rely on compiler to cast NULL
which is (void*)0 to int -- and in fact if i was a compiler
would also generate an error.
Further, couldn't see why we need to pass NULL or 0 th add_node,
argument value is defautl to 0 already.