Commit Graph

168 Commits

Author SHA1 Message Date
Thomas Dinges
866c7fb6e6 Cycles: Add an AVX2 CPU kernel.
This kernel is compiled with AVX2, FMA3, and BMI compiler flags. At the moment only Intel Haswell benefits from this, but future AMD CPUs will have these instructions as well.

Makes rendering on Haswell CPUs a few percent faster, only benchmarked with clang on OS X though.

Part of my GSoC 2014.
2014-06-13 22:26:20 +02:00
Brecht Van Lommel
e4e58d4612 Fix T40370: cycles CUDA baking timeout with high number of AA samples.
Now baking does one AA sample at a time, just like final render. There is
also some code for shader antialiasing that solves T40369 but it is disabled
for now because there may be unpredictable side effects.
2014-06-06 15:39:04 +02:00
Brecht Van Lommel
865dfa8a7e Fix T40228: cycles CUDA multi GPU + world MIS giving error. 2014-06-05 18:10:32 +02:00
Brecht Van Lommel
69c7522b24 Fix T40379: world MIS causing too much CUDA memory usage.
The kernel for baking the world texture was the same as the one used for
baking. Now that's separate which allows the kernel to reserve much less
memory.
2014-05-27 15:11:32 +02:00
Brecht van Lommel
0075efc4d2 Fix T40306: cycles baking not distributing work among CPU cores well. 2014-05-26 13:51:11 +02:00
Brecht Van Lommel
3b53fffb77 Cycles: revert async CUDA changes, these are giving too much trouble still.
Fixes T40027. This means we get more CPU usage again when using multiple CUDA,
but the impact on performance is too big a problem with the current code.
2014-05-19 19:33:09 +02:00
Thomas Dinges
c08c931fb6 Cycles / CUDA: Increase maximum image textures on GPU.
Instead of 95, we can use 145 images now. This only affects Kepler and above (sm30, sm_35 and sm_50).

This can be increased further if needed, but let's first test if this does not come with a performance impact.

Originally developed during my GSoC 2013.
2014-05-11 03:38:39 +02:00
Thomas Dinges
fd26a32aa5 Fix T40119, CUDA Toolkit version mismatch 2014-05-10 01:26:04 +02:00
Campbell Barton
dc13969e48 Style cleanup: indentation, braces 2014-05-05 02:19:08 +10:00
Campbell Barton
1618329b00 Code cleanup: style, require ; for cuda_assert, opencl_assert 2014-05-04 03:57:50 +10:00
Brecht Van Lommel
198f5e506a Cycles: CUDA changes for kernel evaluation cancel 2014-05-02 21:19:10 -03:00
Dalai Felinto
eec3eaba08 Cycles Bake
Expand Cycles to use the new baking API in Blender.

It works on the selected object, and the panel can be accessed in the Render panel (similar to where it is for the Blender Internal).

It bakes for the active texture of each material of the object. The active texture is currently defined as the active Image Texture node present in the material nodetree. If you don't want the baking to override an existent material, make sure the active Image Texture node is not connected to the nodetree. The active texture is also the texture shown in the viewport in the rendered mode.

Remember to save your images after the baking is complete.

Note: Bake currently only works in the CPU
Note: This is not supported by Cycles standalone because a lot of the work is done in Blender as part of the operator only, not the engine (Cycles).

Documentation:
http://wiki.blender.org/index.php/Doc:2.6/Manual/Render/Cycles/Bake

Supported Passes:
-----------------
Data Passes
 * Normal
 * UV
 * Diffuse/Glossy/Transmission/Subsurface/Emit Color

Light Passes
 * AO
 * Combined
 * Shadow
 * Diffuse/Glossy/Transmission/Subsurface/Emit Direct/Indirect
 * Environment

Review: D421
Reviewed by: Campbell Barton, Brecht van Lommel, Sergey Sharybin, Thomas Dinge

Original design by Brecht van Lommel.

The entire commit history can be found on the branch: bake-cycles
2014-05-02 21:19:09 -03:00
Campbell Barton
8d16869d83 Code cleanup: Add -Werror=float-conversion to Cycles 2014-05-03 07:31:46 +10:00
Brecht Van Lommel
741f17f05b Cycles CUDA: make CUDA toolkit 6.0 the official supported version.
This also updates the configurations to build kernels for compute capability
5.0 cards, when using and older CUDA toolkit version this will be skipped.

Also includes tweaks to improve performance with this version:
* Increase max registers on sm_30, sm_35 and sm_50
* No longer use texture storage on sm_30
2014-04-30 16:07:27 +02:00
Thomas Dinges
f6abc96b6b Cleanup: Remove OpenCL __MULTI_CLOSURE__ sanity check, not needed anymore after 04a10907dc. 2014-04-21 18:08:01 +02:00
Brecht Van Lommel
39bfde674c Cycles CUDA: don't use cuLaunchGridAsync at all for display devices.
As suggested by Martijn, this is slower than cuLaunchGrid.
2014-04-17 12:18:49 +02:00
Brecht Van Lommel
18da79f471 Cycles CUDA: only do async execution for GPUs not used for display.
Otherwise devices used for display will lock up the UI too much. This means
you might still get 100% CPU for the display device, but for others CPU usage
should be low still.

The check to see if a device is used for display may not be entirely reliable,
it checks if there is a watchdog timeout on the device, but I'm not entirely
sure that always exists for display devices or is disabled for non-display
devices, though some tools like cuda-gdb seem to make the same assumption.

Ref T39559
2014-04-17 12:08:18 +02:00
Brecht Van Lommel
415e10a0ef Fix another compile error with recent commit on visual studio. 2014-04-16 21:36:19 +02:00
Brecht Van Lommel
6f1afdbbfc Cycles CUDA: enabled branched path kernel again, with more registers. 2014-04-16 21:05:04 +02:00
Brecht Van Lommel
2851ed4a55 Cycles code refactor: use __launch_bounds__ instead of -maxrregcount for CUDA.
This makes it easier to have per kernel number of registers. Also, all the
tunable parameters for this are now in kernel.cu, rather than spread over cmake,
scons and device_cuda.cpp.
2014-04-16 21:05:04 +02:00
Thomas Dinges
297a2223b5 Cycles / CUDA: Increase sm_2x registers to 40.
This fixes the ptaxs "ACCESS_VIOLATION" error and should allow our Linux and Windows build bots to compile again.
Unfortunately this comes with a performance penalty on sm_2x cards, so this is only a workaround for now. Branched Path is still globally disabled on GPU.
2014-04-08 23:25:54 +02:00
Martijn Berger
163a3212b4 OpenCL Change opencl_assert to be more like cuda assert where possible.
added some extra warnings and feedback if things go wrong
2014-04-07 16:17:20 +02:00
Thomas Dinges
d923720312 Cycles: Disable Branched Path on all GPUs for now, until we separate the cubins.
SM_20 fails now as well, reported by Zanqdo in IRC.
2014-04-03 22:18:40 +02:00
Brecht Van Lommel
a2e4ebd36a Cycles code internals: add CPU kernel support for 3D image textures. 2014-03-29 13:03:48 +01:00
Thomas Dinges
859039f732 Cycles: Raise a proper error message when using Branched Path on sm_30, this is currently still disabled. 2014-03-27 10:29:22 +01:00
Sergey Sharybin
74518b2826 Fix T39420: Cycles viewport/preview flickers, when moving mouse across editors
Issue was caused by the wrong usage of OCIO GLSL binding API. To make it
work properly on pre-GLSL-1.3 drivers shader is to be enabled after the
texture is binded to the opengl context. Otherwise it wouldn't know the
proper texture size.

This is actually a regression in 2.70 and to be ported to 'a'.
2014-03-26 15:58:53 +06:00
Martijn Berger
28c1a860e2 Fix T39247
Changes to interpolation break texture allocation on sm35 and greater.
2014-03-19 07:37:18 +01:00
Martijn Berger
dd2dca2f7e Add support for multiple interpolation modes on cycles image textures
All textures are sampled bi-linear currently with the exception of OSL there texture sampling is fixed and set to smart bi-cubic.

This patch adds user control to this setting.

Added:
- bits to DNA / RNA in the form of an enum for supporting multiple interpolations types
- changes to the image texture node drawing code ( add enum)
- to ImageManager (this needs to know to allocate second texture when interpolation type is different)
- to node compiler (pass on interpolation type)
- to device tex_alloc this also needs to get the concept of multiple interpolation types
- implementation for doing non interpolated lookup for cuda and cpu
- implementation where we pass this along to osl ( this makes OSL also do linear untill I add smartcubic to the interface / DNA/ RNA)

Reviewers: brecht, dingto

Reviewed By: brecht

CC: dingto, venomgfx

Differential Revision: https://developer.blender.org/D317
2014-03-07 23:16:33 +01:00
Martijn Berger
1d01675833 Cuda use streams and async to avoid busywaiting
This switches api usage for cuda towards using more of the Async calls.

Updating only once every second is sufficiently cheap that I don't think it is worth doing it less often.

Reviewed By: brecht

Differential Revision: https://developer.blender.org/D262
2014-03-06 20:51:46 +01:00
Brecht Van Lommel
6b1a4fc66e Cycle CUDA: revert the f1aeb2ccf4 and 84f958754 busywait fixes for now.
It's unclear what kind of impact they have on performance at the moment, so I
rather play it safe and postpone this for 2.71.

Ref T38679, Ref T38712
2014-02-19 16:08:08 +01:00
Martijn Berger
f1aeb2ccf4 this is an attempted Fix: T38679
Cycles GPU Performance Regression

From my testing this (what i should have done in the first place) reduces the regression a lot.
Lets hope it is enough or we have to go back to busy waiting.
2014-02-17 20:11:45 +01:00
Martijn Berger
0f91f56ce3 Cycles Network rendering, remove some exception throwing, replace with saner error handling
This patch adds a network_error() function more alike how other devices handle error's

- it adds a check for errors on load_kernels to make sure we do not crash if rendering without a server.
- it uses the non throwing variation of boost::asio::read.

Reviewers: brecht

Reviewed By: brecht

CC: brecht

Differential Revision: https://developer.blender.org/D86
2014-02-05 21:55:51 +01:00
Martijn Berger
84f9587540 Cuda use streams and async to avoid busywaiting
This is my first stab at this and is based on this IRC converstation:

<mib2berlin> brecht: this is meaning as reminder only, I know you have other things to do > http://openvidia.sourceforge.net/index.php/Optimization_Notes#avoiding_busy_waits
<brecht> mib2berlin: thanks, bookmarked

only tested on Ubuntu 14.04 / cuda 5.0 but ill do some more testing tomorrow.

Also unsure about the placement and the lifetime of the stream and the event. But creating / deleting these seems to incur a non trivial cost.

Reviewers: brecht

Reviewed By: brecht

CC: mib2berlin, dingto

Differential Revision: https://developer.blender.org/D262
2014-01-28 18:40:08 +01:00
Thomas Dinges
de28a4d4b2 Cycles: Add an AVX kernel for CPU rendering.
* AVX is available on Intel Sandy Bridge and newer and AMD Bulldozer and newer.
* We don't use dedicated AVX intrinsics yet, but gcc auto vectorization gives a 3% performance improvement for Caminandes. Tested on an i5-3570, Linux x64.
* No change for Windows yet, MSVC 2008 does not support AVX.

Reviewed by: brecht
Differential Revision: https://developer.blender.org/D216
2014-01-16 17:04:11 +01:00
Brecht Van Lommel
d9e52ac98b Code cleanup: move half float functions to separate header file. 2014-01-15 15:29:22 +01:00
Thomas Dinges
5d88f7c7db Cycles: Build SSE41 kernel per default, remove build option. This hopefully also fixes some compile errors on various systems. 2014-01-14 22:04:32 +01:00
Thomas Dinges
9351ac0d85 Cycles: Skip the compilation of the dedicated SSE2 kernel on x86-64, we can assume SSE2 here, so just re-use the regular one. Saves 500kb in the blender binary.
Reviewed by: brecht
Differential Revision: https://developer.blender.org/D199
2014-01-14 20:39:54 +01:00
Brecht Van Lommel
241fccaf6a Fix T37817: cycles CUDA detection problem on Windows with non-ascii paths. 2014-01-11 00:47:58 +01:00
Thomas Dinges
ce6dce3b13 Code cleanup / Cycles: else/if for SSE41 kernel functions. 2014-01-06 03:22:14 +01:00
Thomas Dinges
ad0a3de3ce Cycles / OpenCL: Let the OpenCL runtime determine its optimal work-group size automatically, by passing a NULL pointer here.
This is recommended in the Intel OpenCL optimization docs (http://software.intel.com/en-us/vcsource/samples/optimizing-opencl) and I can confirm a small performance increase here (1-2% on nVidia OpenCL, up to 8% on Intel OpenCL).
2013-12-24 20:20:57 +01:00
Thomas Dinges
011ae78857 Cycles / OpenCL: Fix compile error on OS X
After update to Mac OS X 10.9.1, OpenCL works now on my Intel CPU in the 2013 Macbook Pro (even the entire kernel).
The Intel Iris Pro GPU still segfaults here though, even when all flags are disabled (building "clay like" kernel only).

Maybe we need the -no-missing-prototypes for AMD hardware still, but I couldn't find a way to distuinguish here.
2013-12-17 09:59:18 +01:00
Martijn Berger
85a0c5d4e1 Cycles: network render code updated for latest changes and improved
This actually works somewhat now, although viewport rendering is broken and any
kind of network error or connection failure will kill Blender.

* Experimental WITH_CYCLES_NETWORK cmake option
* Networked Device is shown as an option next to CPU and GPU Compute
* Various updates to work with the latest Cycles code
* Locks and thread safety for RPC calls and tiles
* Refactored pointer mapping code
* Fix error in CPU brand string retrieval code

This includes work by Doug Gale, Martijn Berger and Brecht Van Lommel.

Reviewers: brecht

Differential Revision: http://developer.blender.org/D36
2013-12-07 12:26:58 +01:00
Martijn Berger
e3a79258d1 Cycles: test code for sse 4.1 kernel and alignment for some vector types.
This is mostly work towards enabling the __KERNEL_SSE__ option to start using
SIMD operations for vector math operations. This 4.1 kernel performes about 8%
faster with that option but overall is still slower than without the option.

WITH_CYCLES_OPTIMIZED_KERNEL_SSE41 is the cmake flag for testing this kernel.

Alignment of int3, int4, float3, float4 to 16 bytes seems to give a slight 1-2%
speedup on tested systems with the current kernel already, so is enabled now.
2013-11-22 14:42:41 +01:00
Campbell Barton
48c1e0c0fc spelling: use American spelling for canceled 2013-10-26 01:06:19 +00:00
Brecht Van Lommel
451607630e Fix #37134: cycles viewport not displaying correct with multi GPU render
and graphics card that does not support CUDA OpenGL interop.
2013-10-18 20:11:07 +00:00
Brecht Van Lommel
9d7567d6ac Fix #37002: cycles viewport render shows white on old graphics cards with no
support for non-power-of-two textures.
2013-10-12 13:55:52 +00:00
Thomas Dinges
b5a5773fa9 Cycles / CUDA:
* Remove support for  CUDA Toolkit 4.x, only Toolkit 5.0 and above are supported now.
* Remove support for sm_1x cards (< Fermi) for good. We didn't officially support those cards for a few releases already, now remove some special code that was still there.
2013-10-08 15:29:28 +00:00
Brecht Van Lommel
cbb783f1d6 Fix cycles OpenCL compile error on AMD, and fix assert in debug builds. 2013-10-02 14:41:04 +00:00
Brecht Van Lommel
31e6181187 Fix #36873: cycles opencl render status show negative sample count. 2013-09-30 12:11:25 +00:00
Brecht Van Lommel
fa352bb749 Fix #35684: cycles unable to use full 6GB of memory on NVidia Titan GPU. We now
use arrays instead of textures for general storage on this card (image textures
are still stored as texture). Textures were found to be faster on older cards,
but the limits on 1D texture size have not increased along with the memory size,
which meant that the full 6 GB could not be used.

The performance actually seems to be slightly better with arrays in some tests
on Titan. For older cards there seems to be a bit of a mix, some are better and
others not. We may change those to use arrays too, but more testing is needed,
only Titan and Tesla K20 (sm_35) is changed for now.

The fact that arrays are faster is a bit surprising, as others found textures
to be faster on Kepler. However even if they were, the memory limitation is
more important to solve anyway.
https://research.nvidia.com/publication/understanding-efficiency-ray-traversal-gpus-kepler-and-fermi-addendum
2013-09-27 19:09:31 +00:00