Commit Graph

563 Commits

Author SHA1 Message Date
Campbell Barton
be40389165 Merge branch 'master' into blender2.8 2018-01-03 23:44:47 +11:00
Brecht Van Lommel
c621832d3d Cycles: CUDA support for rendering scenes that don't fit on GPU.
In that case it can now fall back to CPU memory, at the cost of reduced
performance. For scenes that fit in GPU memory, this commit should not
cause any noticeable slowdowns.

We don't use all physical system RAM, since that can cause OS instability.
We leave at least half of system RAM or 4GB to other software, whichever
is smaller.

For image textures in host memory, performance was maybe 20-30% slower
in our tests (although this is highly hardware and scene dependent). Once
other type of data doesn't fit on the GPU, performance can be e.g. 10x
slower, and at that point it's probably better to just render on the CPU.

Differential Revision: https://developer.blender.org/D2056
2018-01-02 23:50:18 +01:00
Brecht Van Lommel
6699454fb6 Cycles: make CUDA code a bit more robust to host/device alloc failures.
Fixes a few corner cases found while stress testing host mapped memory.
2018-01-02 23:46:19 +01:00
Sergey Sharybin
9f0d067c2e Merge branch 'master' into blender2.8 2017-12-21 11:17:34 +01:00
Sergey Sharybin
5650fe77e4 Cycles: Cleanup, indentation 2017-12-20 17:42:50 +01:00
Campbell Barton
7ca8af4cc8 Merge branch 'master' into blender2.8 2017-12-06 16:51:37 +11:00
Lukas Stockner
2069102c56 Cycles: Fix constness for load_kernels in device_cpu.cpp 2017-12-06 00:00:18 +01:00
Campbell Barton
03a5eccc94 Merge branch 'master' into blender2.8 2017-11-30 18:30:41 +11:00
Lukas Stockner
fa3d50af95 Cycles: Improve denoising speed on GPUs with small tile sizes
Previously, the NLM kernels would be launched once per offset with one thread per pixel.
However, with the smaller tile sizes that are now feasible, there wasn't enough work to fully occupy GPUs which results in a significant slowdown.

Therefore, the kernels are now launched in a single call that handles all offsets at once.
This has two downsides: Memory accesses to accumulating buffers are now atomic, and more importantly, the temporary memory now has to be allocated for every shift at once, increasing the required memory.
On the other hand, of course, the smaller tiles significantly reduce the size of the memory.

The main bottleneck right now is the construction of the transformation - there is nothing to be parallelized there, one thread per pixel is the maximum.
I tried to parallelize the SVD implementation by storing the matrix in shared memory and launching one block per pixel, but that wasn't really going anywhere.

To make the new code somewhat readable, the handling of rectangular regions was cleaned up a bit and commented, it should be easier to understand what's going on now.
Also, some variables have been renamed to make the difference between buffer width and stride more apparent, in addition to some general style cleanup.
2017-11-30 07:37:08 +01:00
Brecht Van Lommel
d992240bfa Fix unneeded legacy OpenGL call in Cycles viewport drawing. 2017-11-24 00:12:48 +01:00
Julian Eisel
7f96323cd0 Merge branch 'master' into blender2.8 2017-11-19 13:16:14 +01:00
Lukas Stockner
40f528a7da Cycles: Add per-tile render time debug pass
Reviewers: sergey, brecht

Differential Revision: https://developer.blender.org/D2920
2017-11-17 16:40:24 +01:00
Dalai Felinto
1cb6cea71c Merge remote-tracking branch 'origin/master' into blender2.8 2017-11-13 11:48:48 -02:00
Brecht Van Lommel
e568c1a975 Fix T53289: CUDA missing textures not showing pink, after recent changes. 2017-11-12 20:45:47 +01:00
Mai Lavelle
e389ae9dca Cycles: Set error if a split kernel fails to load
To help catch cases where adding a new kernel is missed for one of the
device implementations.
2017-11-11 01:01:14 -05:00
Bastien Montagne
7a6ad2901c Merge branch 'master' into blender2.8 2017-11-10 10:13:19 +01:00
Brecht Van Lommel
bd4bea3e98 Cycles: avoid reallocating tile denoising memory many times during render. 2017-11-09 20:28:00 +01:00
Sergey Sharybin
c99481b632 Merge branch 'master' into blender2.8 2017-11-09 10:59:15 +01:00
Mai Lavelle
087331c495 Cycles: Replace __MAX_CLOSURE__ build option with runtime integrator variable
Goal is to reduce OpenCL kernel recompilations.

Currently viewport renders are still set to use 64 closures as this seems to
be faster and we don't want to cause a performance regression there. Needs
to be investigated.

Reviewed By: brecht

Differential Revision: https://developer.blender.org/D2775
2017-11-09 01:04:06 -05:00
Brecht Van Lommel
7b1d707481 Merge branch 'master' into blender2.8 2017-11-08 00:20:59 +01:00
Brecht Van Lommel
f79f386731 Code refactor: rename subsurface to local traversal, for reuse. 2017-11-07 22:35:12 +01:00
Brecht Van Lommel
ff34e48911 Cycles: add an extra CUDA synchronize before rendering.
It should not be needed as far as I know, but just in case it fixes any
of the recent issues like T52572.
2017-11-07 22:35:12 +01:00
Bastien Montagne
91af8f2ae2 Merge branch 'master' into blender2.8
Conflicts:
	intern/cycles/device/device.cpp
	source/blender/blenkernel/intern/library.c
	source/blender/blenkernel/intern/material.c
	source/blender/editors/object/object_add.c
	source/blender/editors/object/object_relations.c
	source/blender/editors/space_outliner/outliner_draw.c
	source/blender/editors/space_outliner/outliner_edit.c
	source/blender/editors/space_view3d/drawobject.c
	source/blender/editors/util/ed_util.c
	source/blender/windowmanager/intern/wm_files_link.c
2017-11-06 18:02:46 +01:00
Brecht Van Lommel
5801ef71e4 Code refactor: device memory cleanups, preparing for mapped host memory. 2017-11-05 15:22:04 +01:00
Brecht Van Lommel
5475314f49 Cycles: reserve CUDA local memory ahead of time.
This way we can log the amount of memory used, and it will be important
for host mapped memory support.
2017-11-05 15:22:04 +01:00
Campbell Barton
d4fe083b35 Merge branch 'master' into blender2.8 2017-11-04 21:45:52 +11:00
Brecht Van Lommel
33b5e8daff Code refactor: replace CUDA array with linear memory for 1D and 2D textures.
This is a prequisite for getting host memory allocation to work. There appears
to be no support for 3D textures using host memory. The original version of
this code was written by Stefan Werner for D2056.
2017-11-04 02:23:00 +01:00
Brecht Van Lommel
6ec599c682 Fix T53247: mixed CPU + GPU render wrong texture limits. 2017-11-03 20:32:29 +01:00
Campbell Barton
7eb4ef6cac Merge branch 'master' into blender2.8 2017-11-03 00:31:47 +11:00
Mai Lavelle
5cb8730689 Cycles: Add another limit to OpenCL memory usage
Some drivers may report very large allocation sizes, which could cause
unnecessary memory usage. This is now limited to 2gb which should
still be enough to get the needed performance benefits without waste.
2017-11-02 08:14:21 -04:00
Sergey Sharybin
1e107fa514 Merge branch 'master' into blender2.8 2017-10-25 10:13:35 +02:00
Brecht Van Lommel
83877632a3 Fix one more assert being triggered due to recent changes. 2017-10-25 01:22:16 +02:00
Brecht Van Lommel
34fe3f9c06 Code refactor: remove MEM_WRITE_ONLY, always use MEM_READ_WRITE.
It's unlikely the driver can do useful optimizations with this, and if
we sum multiple samples we are reading from the memory anyway.
2017-10-24 23:53:09 +02:00
Brecht Van Lommel
ec49503a33 Fix T53146: incomplete multi GPU and CPU + GPU memory statistics.
Part due to recent changes, part old bug.
2017-10-24 17:40:43 +02:00
Sergey Sharybin
7ea7fd45d0 Merge branch 'master' into blender2.8 2017-10-24 12:19:48 +02:00
Sergey Sharybin
e03df90bf3 Cycles: Fix compilation in debug mode
Please check compilation before committing refactor changes!
2017-10-24 12:09:02 +02:00
Sergey Sharybin
eccd18a91f Cycles: Fix compilation error without C++11 2017-10-24 11:14:01 +02:00
Brecht Van Lommel
a1aad1f8d1 Fix T53134: denoising with CPU + GPU render leaves some tiles noisy. 2017-10-24 04:09:48 +02:00
Brecht Van Lommel
f5456df095 Merge branch 'master' into blender2.8 2017-10-24 02:05:41 +02:00
Brecht Van Lommel
070a668d04 Code refactor: move more memory allocation logic into device API.
* Remove tex_* and pixels_* functions, replace by mem_*.
* Add MEM_TEXTURE and MEM_PIXELS as memory types recognized by devices.
* No longer create device_memory and call mem_* directly, always go
  through device_only_memory, device_vector and device_pixels.
2017-10-24 01:25:19 +02:00
Brecht Van Lommel
aa8b4c5d81 Code refactor: use device_only_memory and device_vector in more places. 2017-10-24 01:25:13 +02:00
Brecht Van Lommel
7ad9333fad Code refactor: store device/interp/extension/type in each device_memory. 2017-10-24 01:03:59 +02:00
Brecht Van Lommel
ae41f38f78 Code refactor: pass device to scene, check OSL with device info. 2017-10-24 01:03:59 +02:00
Julian Eisel
147f9585db Merge branch 'master' into blender2.8 2017-10-23 00:04:20 +02:00
Brecht Van Lommel
57a0cb797d Code refactor: avoid some unnecessary device memory copying. 2017-10-21 20:58:28 +02:00
Brecht Van Lommel
dc9eb8234f Cycles: combined CPU + GPU rendering support.
CPU rendering will be restricted to a BVH2, which is not ideal for raytracing
performance but can be shared with the GPU. Decoupled volume shading will be
disabled to match GPU volume sampling.

The number of CPU rendering threads is reduced to leave one core dedicated to
each GPU. Viewport rendering will also only use GPU rendering still. So along
with the BVH2 usage, perfect scaling should not be expected.

Go to User Preferences > System to enable the CPU to render alongside the GPU.

Differential Revision: https://developer.blender.org/D2873
2017-10-21 20:13:44 +02:00
Sergey Sharybin
0f8a57de68 Merge branch 'master' into blender2.8 2017-10-19 13:58:01 +02:00
Sergey Sharybin
910dd7fb1b Cycles: Add extra logging in CUDA device detection code 2017-10-19 11:26:10 +02:00
Campbell Barton
54f9a6e5da Merge branch 'master' into blender2.8 2017-10-18 16:40:31 +11:00
Brecht Van Lommel
92611dada6 Fix T53098, T53079: OpenCL world texture errors after recent changes. 2017-10-18 03:13:25 +02:00