This kernel is compiled with AVX2, FMA3, and BMI compiler flags. At the moment only Intel Haswell benefits from this, but future AMD CPUs will have these instructions as well.
Makes rendering on Haswell CPUs a few percent faster, only benchmarked with clang on OS X though.
Part of my GSoC 2014.
Now baking does one AA sample at a time, just like final render. There is
also some code for shader antialiasing that solves T40369 but it is disabled
for now because there may be unpredictable side effects.
The kernel for baking the world texture was the same as the one used for
baking. Now that's separate which allows the kernel to reserve much less
memory.
Fixes T40027. This means we get more CPU usage again when using multiple CUDA,
but the impact on performance is too big a problem with the current code.
Instead of 95, we can use 145 images now. This only affects Kepler and above (sm30, sm_35 and sm_50).
This can be increased further if needed, but let's first test if this does not come with a performance impact.
Originally developed during my GSoC 2013.
Expand Cycles to use the new baking API in Blender.
It works on the selected object, and the panel can be accessed in the Render panel (similar to where it is for the Blender Internal).
It bakes for the active texture of each material of the object. The active texture is currently defined as the active Image Texture node present in the material nodetree. If you don't want the baking to override an existent material, make sure the active Image Texture node is not connected to the nodetree. The active texture is also the texture shown in the viewport in the rendered mode.
Remember to save your images after the baking is complete.
Note: Bake currently only works in the CPU
Note: This is not supported by Cycles standalone because a lot of the work is done in Blender as part of the operator only, not the engine (Cycles).
Documentation:
http://wiki.blender.org/index.php/Doc:2.6/Manual/Render/Cycles/Bake
Supported Passes:
-----------------
Data Passes
* Normal
* UV
* Diffuse/Glossy/Transmission/Subsurface/Emit Color
Light Passes
* AO
* Combined
* Shadow
* Diffuse/Glossy/Transmission/Subsurface/Emit Direct/Indirect
* Environment
Review: D421
Reviewed by: Campbell Barton, Brecht van Lommel, Sergey Sharybin, Thomas Dinge
Original design by Brecht van Lommel.
The entire commit history can be found on the branch: bake-cycles
This also updates the configurations to build kernels for compute capability
5.0 cards, when using and older CUDA toolkit version this will be skipped.
Also includes tweaks to improve performance with this version:
* Increase max registers on sm_30, sm_35 and sm_50
* No longer use texture storage on sm_30
Otherwise devices used for display will lock up the UI too much. This means
you might still get 100% CPU for the display device, but for others CPU usage
should be low still.
The check to see if a device is used for display may not be entirely reliable,
it checks if there is a watchdog timeout on the device, but I'm not entirely
sure that always exists for display devices or is disabled for non-display
devices, though some tools like cuda-gdb seem to make the same assumption.
Ref T39559
This makes it easier to have per kernel number of registers. Also, all the
tunable parameters for this are now in kernel.cu, rather than spread over cmake,
scons and device_cuda.cpp.
This fixes the ptaxs "ACCESS_VIOLATION" error and should allow our Linux and Windows build bots to compile again.
Unfortunately this comes with a performance penalty on sm_2x cards, so this is only a workaround for now. Branched Path is still globally disabled on GPU.
Issue was caused by the wrong usage of OCIO GLSL binding API. To make it
work properly on pre-GLSL-1.3 drivers shader is to be enabled after the
texture is binded to the opengl context. Otherwise it wouldn't know the
proper texture size.
This is actually a regression in 2.70 and to be ported to 'a'.
All textures are sampled bi-linear currently with the exception of OSL there texture sampling is fixed and set to smart bi-cubic.
This patch adds user control to this setting.
Added:
- bits to DNA / RNA in the form of an enum for supporting multiple interpolations types
- changes to the image texture node drawing code ( add enum)
- to ImageManager (this needs to know to allocate second texture when interpolation type is different)
- to node compiler (pass on interpolation type)
- to device tex_alloc this also needs to get the concept of multiple interpolation types
- implementation for doing non interpolated lookup for cuda and cpu
- implementation where we pass this along to osl ( this makes OSL also do linear untill I add smartcubic to the interface / DNA/ RNA)
Reviewers: brecht, dingto
Reviewed By: brecht
CC: dingto, venomgfx
Differential Revision: https://developer.blender.org/D317
This switches api usage for cuda towards using more of the Async calls.
Updating only once every second is sufficiently cheap that I don't think it is worth doing it less often.
Reviewed By: brecht
Differential Revision: https://developer.blender.org/D262
Cycles GPU Performance Regression
From my testing this (what i should have done in the first place) reduces the regression a lot.
Lets hope it is enough or we have to go back to busy waiting.
This patch adds a network_error() function more alike how other devices handle error's
- it adds a check for errors on load_kernels to make sure we do not crash if rendering without a server.
- it uses the non throwing variation of boost::asio::read.
Reviewers: brecht
Reviewed By: brecht
CC: brecht
Differential Revision: https://developer.blender.org/D86
This is my first stab at this and is based on this IRC converstation:
<mib2berlin> brecht: this is meaning as reminder only, I know you have other things to do > http://openvidia.sourceforge.net/index.php/Optimization_Notes#avoiding_busy_waits
<brecht> mib2berlin: thanks, bookmarked
only tested on Ubuntu 14.04 / cuda 5.0 but ill do some more testing tomorrow.
Also unsure about the placement and the lifetime of the stream and the event. But creating / deleting these seems to incur a non trivial cost.
Reviewers: brecht
Reviewed By: brecht
CC: mib2berlin, dingto
Differential Revision: https://developer.blender.org/D262
* AVX is available on Intel Sandy Bridge and newer and AMD Bulldozer and newer.
* We don't use dedicated AVX intrinsics yet, but gcc auto vectorization gives a 3% performance improvement for Caminandes. Tested on an i5-3570, Linux x64.
* No change for Windows yet, MSVC 2008 does not support AVX.
Reviewed by: brecht
Differential Revision: https://developer.blender.org/D216
After update to Mac OS X 10.9.1, OpenCL works now on my Intel CPU in the 2013 Macbook Pro (even the entire kernel).
The Intel Iris Pro GPU still segfaults here though, even when all flags are disabled (building "clay like" kernel only).
Maybe we need the -no-missing-prototypes for AMD hardware still, but I couldn't find a way to distuinguish here.
This actually works somewhat now, although viewport rendering is broken and any
kind of network error or connection failure will kill Blender.
* Experimental WITH_CYCLES_NETWORK cmake option
* Networked Device is shown as an option next to CPU and GPU Compute
* Various updates to work with the latest Cycles code
* Locks and thread safety for RPC calls and tiles
* Refactored pointer mapping code
* Fix error in CPU brand string retrieval code
This includes work by Doug Gale, Martijn Berger and Brecht Van Lommel.
Reviewers: brecht
Differential Revision: http://developer.blender.org/D36
This is mostly work towards enabling the __KERNEL_SSE__ option to start using
SIMD operations for vector math operations. This 4.1 kernel performes about 8%
faster with that option but overall is still slower than without the option.
WITH_CYCLES_OPTIMIZED_KERNEL_SSE41 is the cmake flag for testing this kernel.
Alignment of int3, int4, float3, float4 to 16 bytes seems to give a slight 1-2%
speedup on tested systems with the current kernel already, so is enabled now.
* Remove support for CUDA Toolkit 4.x, only Toolkit 5.0 and above are supported now.
* Remove support for sm_1x cards (< Fermi) for good. We didn't officially support those cards for a few releases already, now remove some special code that was still there.
use arrays instead of textures for general storage on this card (image textures
are still stored as texture). Textures were found to be faster on older cards,
but the limits on 1D texture size have not increased along with the memory size,
which meant that the full 6 GB could not be used.
The performance actually seems to be slightly better with arrays in some tests
on Titan. For older cards there seems to be a bit of a mix, some are better and
others not. We may change those to use arrays too, but more testing is needed,
only Titan and Tesla K20 (sm_35) is changed for now.
The fact that arrays are faster is a bit surprising, as others found textures
to be faster on Kepler. However even if they were, the memory limitation is
more important to solve anyway.
https://research.nvidia.com/publication/understanding-efficiency-ray-traversal-gpus-kepler-and-fermi-addendum