2023-08-16 00:20:26 +10:00
|
|
|
/* SPDX-FileCopyrightText: 2022-2023 Blender Authors
|
2023-06-15 13:09:04 +10:00
|
|
|
*
|
|
|
|
|
* SPDX-License-Identifier: Apache-2.0 */
|
|
|
|
|
|
Mikktspace: Optimized port to C++
This commit is a big overhaul to the Mikktspace module, which is used
to compute tangents. I'm not calling it a rewrite since it's the
result of a lot of iterations on the original code, but pretty much
everything is reworked somehow.
Overall goal was to a) make it faster and b) make it maintainable.
Notable changes:
- Since the callbacks for requesting geometry data were a big
bottleneck before, I've ported it to C++ and made it header-only,
templating on the data source. That way, the compiler generates code
specific to the caller, which allows it to inline the data source and
specialize for some cases (e.g. subd vs. non-subd in Cycles).
- The one input parameter, an optional angle threshold, was not used
anywhere. Turns out that removing it allows for considerable
algorithmic simplification, removing a lot of the complexity in the
later stages. Therefore, I've just removed the option in the new code.
- The code computes several outputs, but only one (the tangent itself)
is ever used in Blender. Therefore, I've removed the others to
simplify the code. They could easily be brought back if needed, none
of the algorithmic simplifications are conflicting with them.
- The original code had fallback paths for many steps in case temporary
memory allocation fails, but that never actually gets used anyways
since malloc() doesn't really ever return NULL in practise, so I
removed them.
- In general, I've restructured A LOT of the code to make the
algorithms clearer and make use of some C++ features (vectors,
std::array, booleans, classes), though there's still some of cleanup
that could be done.
- Parallelized duplicate detection, neighbor detection, triangle
tangent computation, degenerate triangle handling and tangent space
accumulation.
- Replaced several algorithms with faster equivalents: Duplicate
detection uses a (concurrent) hash set now, neighbor detection uses
Radixsort and splits vertices by index pairs etc.
As for results, the exact speedup depends on the scene of course, but
let's consider the file from T97378:
- Blender 3.1 (before D14675): 6.07sec
- Blender 3.2 (with D14675): 4.62sec
- rBf0a36599007d (last nightly build): 4.42sec
- With this commit: 0.90sec
This speedup will mostly be noticed at the start of Cycles renders and,
even more importantly, in Eevee when doing something that changes the
geometry (e.g. animating) on a model using normal maps.
Differential Revision: https://developer.blender.org/D15589
2022-09-03 17:21:44 +02:00
|
|
|
/** \file
|
|
|
|
|
* \ingroup mikktspace
|
|
|
|
|
*/
|
|
|
|
|
|
|
|
|
|
#pragma once
|
|
|
|
|
|
|
|
|
|
#include <cassert>
|
2025-01-26 20:08:09 +01:00
|
|
|
#include <cfloat>
|
Mikktspace: Optimized port to C++
This commit is a big overhaul to the Mikktspace module, which is used
to compute tangents. I'm not calling it a rewrite since it's the
result of a lot of iterations on the original code, but pretty much
everything is reworked somehow.
Overall goal was to a) make it faster and b) make it maintainable.
Notable changes:
- Since the callbacks for requesting geometry data were a big
bottleneck before, I've ported it to C++ and made it header-only,
templating on the data source. That way, the compiler generates code
specific to the caller, which allows it to inline the data source and
specialize for some cases (e.g. subd vs. non-subd in Cycles).
- The one input parameter, an optional angle threshold, was not used
anywhere. Turns out that removing it allows for considerable
algorithmic simplification, removing a lot of the complexity in the
later stages. Therefore, I've just removed the option in the new code.
- The code computes several outputs, but only one (the tangent itself)
is ever used in Blender. Therefore, I've removed the others to
simplify the code. They could easily be brought back if needed, none
of the algorithmic simplifications are conflicting with them.
- The original code had fallback paths for many steps in case temporary
memory allocation fails, but that never actually gets used anyways
since malloc() doesn't really ever return NULL in practise, so I
removed them.
- In general, I've restructured A LOT of the code to make the
algorithms clearer and make use of some C++ features (vectors,
std::array, booleans, classes), though there's still some of cleanup
that could be done.
- Parallelized duplicate detection, neighbor detection, triangle
tangent computation, degenerate triangle handling and tangent space
accumulation.
- Replaced several algorithms with faster equivalents: Duplicate
detection uses a (concurrent) hash set now, neighbor detection uses
Radixsort and splits vertices by index pairs etc.
As for results, the exact speedup depends on the scene of course, but
let's consider the file from T97378:
- Blender 3.1 (before D14675): 6.07sec
- Blender 3.2 (with D14675): 4.62sec
- rBf0a36599007d (last nightly build): 4.42sec
- With this commit: 0.90sec
This speedup will mostly be noticed at the start of Cycles renders and,
even more importantly, in Eevee when doing something that changes the
geometry (e.g. animating) on a model using normal maps.
Differential Revision: https://developer.blender.org/D15589
2022-09-03 17:21:44 +02:00
|
|
|
#include <cmath>
|
2025-01-26 20:08:09 +01:00
|
|
|
#include <vector>
|
Mikktspace: Optimized port to C++
This commit is a big overhaul to the Mikktspace module, which is used
to compute tangents. I'm not calling it a rewrite since it's the
result of a lot of iterations on the original code, but pretty much
everything is reworked somehow.
Overall goal was to a) make it faster and b) make it maintainable.
Notable changes:
- Since the callbacks for requesting geometry data were a big
bottleneck before, I've ported it to C++ and made it header-only,
templating on the data source. That way, the compiler generates code
specific to the caller, which allows it to inline the data source and
specialize for some cases (e.g. subd vs. non-subd in Cycles).
- The one input parameter, an optional angle threshold, was not used
anywhere. Turns out that removing it allows for considerable
algorithmic simplification, removing a lot of the complexity in the
later stages. Therefore, I've just removed the option in the new code.
- The code computes several outputs, but only one (the tangent itself)
is ever used in Blender. Therefore, I've removed the others to
simplify the code. They could easily be brought back if needed, none
of the algorithmic simplifications are conflicting with them.
- The original code had fallback paths for many steps in case temporary
memory allocation fails, but that never actually gets used anyways
since malloc() doesn't really ever return NULL in practise, so I
removed them.
- In general, I've restructured A LOT of the code to make the
algorithms clearer and make use of some C++ features (vectors,
std::array, booleans, classes), though there's still some of cleanup
that could be done.
- Parallelized duplicate detection, neighbor detection, triangle
tangent computation, degenerate triangle handling and tangent space
accumulation.
- Replaced several algorithms with faster equivalents: Duplicate
detection uses a (concurrent) hash set now, neighbor detection uses
Radixsort and splits vertices by index pairs etc.
As for results, the exact speedup depends on the scene of course, but
let's consider the file from T97378:
- Blender 3.1 (before D14675): 6.07sec
- Blender 3.2 (with D14675): 4.62sec
- rBf0a36599007d (last nightly build): 4.42sec
- With this commit: 0.90sec
This speedup will mostly be noticed at the start of Cycles renders and,
even more importantly, in Eevee when doing something that changes the
geometry (e.g. animating) on a model using normal maps.
Differential Revision: https://developer.blender.org/D15589
2022-09-03 17:21:44 +02:00
|
|
|
|
|
|
|
|
#ifndef M_PI_F
|
|
|
|
|
# define M_PI_F (3.1415926535897932f) /* pi */
|
|
|
|
|
#endif
|
|
|
|
|
|
|
|
|
|
namespace mikk {
|
|
|
|
|
|
|
|
|
|
inline bool not_zero(const float fX)
|
|
|
|
|
{
|
|
|
|
|
return fabsf(fX) > FLT_MIN;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* Helpers for (un)packing a 2-bit vertex index and a 30-bit face index to one integer. */
|
|
|
|
|
static uint pack_index(const uint face, const uint vert)
|
|
|
|
|
{
|
|
|
|
|
assert((vert & 0x3) == vert);
|
|
|
|
|
return (face << 2) | (vert & 0x3);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static void unpack_index(uint &face, uint &vert, const uint indexIn)
|
|
|
|
|
{
|
|
|
|
|
vert = indexIn & 0x3;
|
|
|
|
|
face = indexIn >> 2;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* From intern/cycles/util/math_fast.h */
|
|
|
|
|
inline float fast_acosf(float x)
|
|
|
|
|
{
|
|
|
|
|
const float f = fabsf(x);
|
|
|
|
|
/* clamp and crush denormals. */
|
|
|
|
|
const float m = (f < 1.0f) ? 1.0f - (1.0f - f) : 1.0f;
|
|
|
|
|
/* Based on http://www.pouet.net/topic.php?which=9132&page=2
|
2023-02-07 14:17:01 +11:00
|
|
|
* 85% accurate (ULP 0)
|
Mikktspace: Optimized port to C++
This commit is a big overhaul to the Mikktspace module, which is used
to compute tangents. I'm not calling it a rewrite since it's the
result of a lot of iterations on the original code, but pretty much
everything is reworked somehow.
Overall goal was to a) make it faster and b) make it maintainable.
Notable changes:
- Since the callbacks for requesting geometry data were a big
bottleneck before, I've ported it to C++ and made it header-only,
templating on the data source. That way, the compiler generates code
specific to the caller, which allows it to inline the data source and
specialize for some cases (e.g. subd vs. non-subd in Cycles).
- The one input parameter, an optional angle threshold, was not used
anywhere. Turns out that removing it allows for considerable
algorithmic simplification, removing a lot of the complexity in the
later stages. Therefore, I've just removed the option in the new code.
- The code computes several outputs, but only one (the tangent itself)
is ever used in Blender. Therefore, I've removed the others to
simplify the code. They could easily be brought back if needed, none
of the algorithmic simplifications are conflicting with them.
- The original code had fallback paths for many steps in case temporary
memory allocation fails, but that never actually gets used anyways
since malloc() doesn't really ever return NULL in practise, so I
removed them.
- In general, I've restructured A LOT of the code to make the
algorithms clearer and make use of some C++ features (vectors,
std::array, booleans, classes), though there's still some of cleanup
that could be done.
- Parallelized duplicate detection, neighbor detection, triangle
tangent computation, degenerate triangle handling and tangent space
accumulation.
- Replaced several algorithms with faster equivalents: Duplicate
detection uses a (concurrent) hash set now, neighbor detection uses
Radixsort and splits vertices by index pairs etc.
As for results, the exact speedup depends on the scene of course, but
let's consider the file from T97378:
- Blender 3.1 (before D14675): 6.07sec
- Blender 3.2 (with D14675): 4.62sec
- rBf0a36599007d (last nightly build): 4.42sec
- With this commit: 0.90sec
This speedup will mostly be noticed at the start of Cycles renders and,
even more importantly, in Eevee when doing something that changes the
geometry (e.g. animating) on a model using normal maps.
Differential Revision: https://developer.blender.org/D15589
2022-09-03 17:21:44 +02:00
|
|
|
* Examined 2130706434 values of acos:
|
2023-02-07 14:17:01 +11:00
|
|
|
* 15.2000597 avg ULP diff, 4492 max ULP, 4.51803e-05 max error // without "denormal crush"
|
Mikktspace: Optimized port to C++
This commit is a big overhaul to the Mikktspace module, which is used
to compute tangents. I'm not calling it a rewrite since it's the
result of a lot of iterations on the original code, but pretty much
everything is reworked somehow.
Overall goal was to a) make it faster and b) make it maintainable.
Notable changes:
- Since the callbacks for requesting geometry data were a big
bottleneck before, I've ported it to C++ and made it header-only,
templating on the data source. That way, the compiler generates code
specific to the caller, which allows it to inline the data source and
specialize for some cases (e.g. subd vs. non-subd in Cycles).
- The one input parameter, an optional angle threshold, was not used
anywhere. Turns out that removing it allows for considerable
algorithmic simplification, removing a lot of the complexity in the
later stages. Therefore, I've just removed the option in the new code.
- The code computes several outputs, but only one (the tangent itself)
is ever used in Blender. Therefore, I've removed the others to
simplify the code. They could easily be brought back if needed, none
of the algorithmic simplifications are conflicting with them.
- The original code had fallback paths for many steps in case temporary
memory allocation fails, but that never actually gets used anyways
since malloc() doesn't really ever return NULL in practise, so I
removed them.
- In general, I've restructured A LOT of the code to make the
algorithms clearer and make use of some C++ features (vectors,
std::array, booleans, classes), though there's still some of cleanup
that could be done.
- Parallelized duplicate detection, neighbor detection, triangle
tangent computation, degenerate triangle handling and tangent space
accumulation.
- Replaced several algorithms with faster equivalents: Duplicate
detection uses a (concurrent) hash set now, neighbor detection uses
Radixsort and splits vertices by index pairs etc.
As for results, the exact speedup depends on the scene of course, but
let's consider the file from T97378:
- Blender 3.1 (before D14675): 6.07sec
- Blender 3.2 (with D14675): 4.62sec
- rBf0a36599007d (last nightly build): 4.42sec
- With this commit: 0.90sec
This speedup will mostly be noticed at the start of Cycles renders and,
even more importantly, in Eevee when doing something that changes the
geometry (e.g. animating) on a model using normal maps.
Differential Revision: https://developer.blender.org/D15589
2022-09-03 17:21:44 +02:00
|
|
|
* Examined 2130706434 values of acos:
|
2023-02-07 14:17:01 +11:00
|
|
|
* 15.2007108 avg ULP diff, 4492 max ULP, 4.51803e-05 max error // with "denormal crush"
|
Mikktspace: Optimized port to C++
This commit is a big overhaul to the Mikktspace module, which is used
to compute tangents. I'm not calling it a rewrite since it's the
result of a lot of iterations on the original code, but pretty much
everything is reworked somehow.
Overall goal was to a) make it faster and b) make it maintainable.
Notable changes:
- Since the callbacks for requesting geometry data were a big
bottleneck before, I've ported it to C++ and made it header-only,
templating on the data source. That way, the compiler generates code
specific to the caller, which allows it to inline the data source and
specialize for some cases (e.g. subd vs. non-subd in Cycles).
- The one input parameter, an optional angle threshold, was not used
anywhere. Turns out that removing it allows for considerable
algorithmic simplification, removing a lot of the complexity in the
later stages. Therefore, I've just removed the option in the new code.
- The code computes several outputs, but only one (the tangent itself)
is ever used in Blender. Therefore, I've removed the others to
simplify the code. They could easily be brought back if needed, none
of the algorithmic simplifications are conflicting with them.
- The original code had fallback paths for many steps in case temporary
memory allocation fails, but that never actually gets used anyways
since malloc() doesn't really ever return NULL in practise, so I
removed them.
- In general, I've restructured A LOT of the code to make the
algorithms clearer and make use of some C++ features (vectors,
std::array, booleans, classes), though there's still some of cleanup
that could be done.
- Parallelized duplicate detection, neighbor detection, triangle
tangent computation, degenerate triangle handling and tangent space
accumulation.
- Replaced several algorithms with faster equivalents: Duplicate
detection uses a (concurrent) hash set now, neighbor detection uses
Radixsort and splits vertices by index pairs etc.
As for results, the exact speedup depends on the scene of course, but
let's consider the file from T97378:
- Blender 3.1 (before D14675): 6.07sec
- Blender 3.2 (with D14675): 4.62sec
- rBf0a36599007d (last nightly build): 4.42sec
- With this commit: 0.90sec
This speedup will mostly be noticed at the start of Cycles renders and,
even more importantly, in Eevee when doing something that changes the
geometry (e.g. animating) on a model using normal maps.
Differential Revision: https://developer.blender.org/D15589
2022-09-03 17:21:44 +02:00
|
|
|
*/
|
|
|
|
|
const float a = sqrtf(1.0f - m) *
|
|
|
|
|
(1.5707963267f + m * (-0.213300989f + m * (0.077980478f + m * -0.02164095f)));
|
|
|
|
|
return x < 0 ? M_PI_F - a : a;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static uint rotl(uint x, uint k)
|
|
|
|
|
{
|
|
|
|
|
return (x << k) | (x >> (32 - k));
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static uint hash_uint3(uint kx, uint ky, uint kz)
|
|
|
|
|
{
|
|
|
|
|
uint a, b, c;
|
|
|
|
|
a = b = c = 0xdeadbeef + (2 << 2) + 13;
|
|
|
|
|
|
|
|
|
|
c += kz;
|
|
|
|
|
b += ky;
|
|
|
|
|
a += kx;
|
|
|
|
|
|
|
|
|
|
c = (c ^ b) - rotl(b, 14);
|
|
|
|
|
a = (a ^ c) - rotl(c, 11);
|
|
|
|
|
b = (b ^ a) - rotl(a, 25);
|
|
|
|
|
c = (c ^ b) - rotl(b, 16);
|
|
|
|
|
|
|
|
|
|
return c;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static uint hash_uint3_fast(const uint x, const uint y, const uint z)
|
|
|
|
|
{
|
|
|
|
|
return (x * 73856093) ^ (y * 19349663) ^ (z * 83492791);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static uint float_as_uint(const float v)
|
|
|
|
|
{
|
|
|
|
|
return *((uint *)(&v));
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static float uint_as_float(const uint v)
|
|
|
|
|
{
|
|
|
|
|
return *((float *)(&v));
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static uint hash_float3_fast(const float x, const float y, const float z)
|
|
|
|
|
{
|
|
|
|
|
return hash_uint3_fast(float_as_uint(x), float_as_uint(y), float_as_uint(z));
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static uint hash_float3x3(const float3 &x, const float3 &y, const float3 &z)
|
|
|
|
|
{
|
|
|
|
|
return hash_uint3(hash_float3_fast(x.x, x.y, x.z),
|
|
|
|
|
hash_float3_fast(y.x, y.y, y.z),
|
|
|
|
|
hash_float3_fast(z.x, z.y, z.z));
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
template<typename T, typename KeyGetter>
|
|
|
|
|
void radixsort(std::vector<T> &data, std::vector<T> &data2, KeyGetter getKey)
|
|
|
|
|
{
|
2025-01-26 20:08:09 +01:00
|
|
|
using key_t = decltype(getKey(data[0]));
|
Mikktspace: Optimized port to C++
This commit is a big overhaul to the Mikktspace module, which is used
to compute tangents. I'm not calling it a rewrite since it's the
result of a lot of iterations on the original code, but pretty much
everything is reworked somehow.
Overall goal was to a) make it faster and b) make it maintainable.
Notable changes:
- Since the callbacks for requesting geometry data were a big
bottleneck before, I've ported it to C++ and made it header-only,
templating on the data source. That way, the compiler generates code
specific to the caller, which allows it to inline the data source and
specialize for some cases (e.g. subd vs. non-subd in Cycles).
- The one input parameter, an optional angle threshold, was not used
anywhere. Turns out that removing it allows for considerable
algorithmic simplification, removing a lot of the complexity in the
later stages. Therefore, I've just removed the option in the new code.
- The code computes several outputs, but only one (the tangent itself)
is ever used in Blender. Therefore, I've removed the others to
simplify the code. They could easily be brought back if needed, none
of the algorithmic simplifications are conflicting with them.
- The original code had fallback paths for many steps in case temporary
memory allocation fails, but that never actually gets used anyways
since malloc() doesn't really ever return NULL in practise, so I
removed them.
- In general, I've restructured A LOT of the code to make the
algorithms clearer and make use of some C++ features (vectors,
std::array, booleans, classes), though there's still some of cleanup
that could be done.
- Parallelized duplicate detection, neighbor detection, triangle
tangent computation, degenerate triangle handling and tangent space
accumulation.
- Replaced several algorithms with faster equivalents: Duplicate
detection uses a (concurrent) hash set now, neighbor detection uses
Radixsort and splits vertices by index pairs etc.
As for results, the exact speedup depends on the scene of course, but
let's consider the file from T97378:
- Blender 3.1 (before D14675): 6.07sec
- Blender 3.2 (with D14675): 4.62sec
- rBf0a36599007d (last nightly build): 4.42sec
- With this commit: 0.90sec
This speedup will mostly be noticed at the start of Cycles renders and,
even more importantly, in Eevee when doing something that changes the
geometry (e.g. animating) on a model using normal maps.
Differential Revision: https://developer.blender.org/D15589
2022-09-03 17:21:44 +02:00
|
|
|
constexpr size_t datasize = sizeof(key_t);
|
|
|
|
|
static_assert(datasize % 2 == 0);
|
2025-01-26 20:08:09 +01:00
|
|
|
static_assert(std::is_integral_v<key_t>);
|
Mikktspace: Optimized port to C++
This commit is a big overhaul to the Mikktspace module, which is used
to compute tangents. I'm not calling it a rewrite since it's the
result of a lot of iterations on the original code, but pretty much
everything is reworked somehow.
Overall goal was to a) make it faster and b) make it maintainable.
Notable changes:
- Since the callbacks for requesting geometry data were a big
bottleneck before, I've ported it to C++ and made it header-only,
templating on the data source. That way, the compiler generates code
specific to the caller, which allows it to inline the data source and
specialize for some cases (e.g. subd vs. non-subd in Cycles).
- The one input parameter, an optional angle threshold, was not used
anywhere. Turns out that removing it allows for considerable
algorithmic simplification, removing a lot of the complexity in the
later stages. Therefore, I've just removed the option in the new code.
- The code computes several outputs, but only one (the tangent itself)
is ever used in Blender. Therefore, I've removed the others to
simplify the code. They could easily be brought back if needed, none
of the algorithmic simplifications are conflicting with them.
- The original code had fallback paths for many steps in case temporary
memory allocation fails, but that never actually gets used anyways
since malloc() doesn't really ever return NULL in practise, so I
removed them.
- In general, I've restructured A LOT of the code to make the
algorithms clearer and make use of some C++ features (vectors,
std::array, booleans, classes), though there's still some of cleanup
that could be done.
- Parallelized duplicate detection, neighbor detection, triangle
tangent computation, degenerate triangle handling and tangent space
accumulation.
- Replaced several algorithms with faster equivalents: Duplicate
detection uses a (concurrent) hash set now, neighbor detection uses
Radixsort and splits vertices by index pairs etc.
As for results, the exact speedup depends on the scene of course, but
let's consider the file from T97378:
- Blender 3.1 (before D14675): 6.07sec
- Blender 3.2 (with D14675): 4.62sec
- rBf0a36599007d (last nightly build): 4.42sec
- With this commit: 0.90sec
This speedup will mostly be noticed at the start of Cycles renders and,
even more importantly, in Eevee when doing something that changes the
geometry (e.g. animating) on a model using normal maps.
Differential Revision: https://developer.blender.org/D15589
2022-09-03 17:21:44 +02:00
|
|
|
|
2022-09-09 15:03:04 +02:00
|
|
|
uint bins[datasize][257] = {{0}};
|
Mikktspace: Optimized port to C++
This commit is a big overhaul to the Mikktspace module, which is used
to compute tangents. I'm not calling it a rewrite since it's the
result of a lot of iterations on the original code, but pretty much
everything is reworked somehow.
Overall goal was to a) make it faster and b) make it maintainable.
Notable changes:
- Since the callbacks for requesting geometry data were a big
bottleneck before, I've ported it to C++ and made it header-only,
templating on the data source. That way, the compiler generates code
specific to the caller, which allows it to inline the data source and
specialize for some cases (e.g. subd vs. non-subd in Cycles).
- The one input parameter, an optional angle threshold, was not used
anywhere. Turns out that removing it allows for considerable
algorithmic simplification, removing a lot of the complexity in the
later stages. Therefore, I've just removed the option in the new code.
- The code computes several outputs, but only one (the tangent itself)
is ever used in Blender. Therefore, I've removed the others to
simplify the code. They could easily be brought back if needed, none
of the algorithmic simplifications are conflicting with them.
- The original code had fallback paths for many steps in case temporary
memory allocation fails, but that never actually gets used anyways
since malloc() doesn't really ever return NULL in practise, so I
removed them.
- In general, I've restructured A LOT of the code to make the
algorithms clearer and make use of some C++ features (vectors,
std::array, booleans, classes), though there's still some of cleanup
that could be done.
- Parallelized duplicate detection, neighbor detection, triangle
tangent computation, degenerate triangle handling and tangent space
accumulation.
- Replaced several algorithms with faster equivalents: Duplicate
detection uses a (concurrent) hash set now, neighbor detection uses
Radixsort and splits vertices by index pairs etc.
As for results, the exact speedup depends on the scene of course, but
let's consider the file from T97378:
- Blender 3.1 (before D14675): 6.07sec
- Blender 3.2 (with D14675): 4.62sec
- rBf0a36599007d (last nightly build): 4.42sec
- With this commit: 0.90sec
This speedup will mostly be noticed at the start of Cycles renders and,
even more importantly, in Eevee when doing something that changes the
geometry (e.g. animating) on a model using normal maps.
Differential Revision: https://developer.blender.org/D15589
2022-09-03 17:21:44 +02:00
|
|
|
|
|
|
|
|
/* Count number of elements per bin. */
|
|
|
|
|
for (const T &item : data) {
|
|
|
|
|
key_t key = getKey(item);
|
2025-01-26 20:08:09 +01:00
|
|
|
for (uint pass = 0; pass < datasize; pass++) {
|
Mikktspace: Optimized port to C++
This commit is a big overhaul to the Mikktspace module, which is used
to compute tangents. I'm not calling it a rewrite since it's the
result of a lot of iterations on the original code, but pretty much
everything is reworked somehow.
Overall goal was to a) make it faster and b) make it maintainable.
Notable changes:
- Since the callbacks for requesting geometry data were a big
bottleneck before, I've ported it to C++ and made it header-only,
templating on the data source. That way, the compiler generates code
specific to the caller, which allows it to inline the data source and
specialize for some cases (e.g. subd vs. non-subd in Cycles).
- The one input parameter, an optional angle threshold, was not used
anywhere. Turns out that removing it allows for considerable
algorithmic simplification, removing a lot of the complexity in the
later stages. Therefore, I've just removed the option in the new code.
- The code computes several outputs, but only one (the tangent itself)
is ever used in Blender. Therefore, I've removed the others to
simplify the code. They could easily be brought back if needed, none
of the algorithmic simplifications are conflicting with them.
- The original code had fallback paths for many steps in case temporary
memory allocation fails, but that never actually gets used anyways
since malloc() doesn't really ever return NULL in practise, so I
removed them.
- In general, I've restructured A LOT of the code to make the
algorithms clearer and make use of some C++ features (vectors,
std::array, booleans, classes), though there's still some of cleanup
that could be done.
- Parallelized duplicate detection, neighbor detection, triangle
tangent computation, degenerate triangle handling and tangent space
accumulation.
- Replaced several algorithms with faster equivalents: Duplicate
detection uses a (concurrent) hash set now, neighbor detection uses
Radixsort and splits vertices by index pairs etc.
As for results, the exact speedup depends on the scene of course, but
let's consider the file from T97378:
- Blender 3.1 (before D14675): 6.07sec
- Blender 3.2 (with D14675): 4.62sec
- rBf0a36599007d (last nightly build): 4.42sec
- With this commit: 0.90sec
This speedup will mostly be noticed at the start of Cycles renders and,
even more importantly, in Eevee when doing something that changes the
geometry (e.g. animating) on a model using normal maps.
Differential Revision: https://developer.blender.org/D15589
2022-09-03 17:21:44 +02:00
|
|
|
bins[pass][((key >> (8 * pass)) & 0xff) + 1]++;
|
2025-01-26 20:08:09 +01:00
|
|
|
}
|
Mikktspace: Optimized port to C++
This commit is a big overhaul to the Mikktspace module, which is used
to compute tangents. I'm not calling it a rewrite since it's the
result of a lot of iterations on the original code, but pretty much
everything is reworked somehow.
Overall goal was to a) make it faster and b) make it maintainable.
Notable changes:
- Since the callbacks for requesting geometry data were a big
bottleneck before, I've ported it to C++ and made it header-only,
templating on the data source. That way, the compiler generates code
specific to the caller, which allows it to inline the data source and
specialize for some cases (e.g. subd vs. non-subd in Cycles).
- The one input parameter, an optional angle threshold, was not used
anywhere. Turns out that removing it allows for considerable
algorithmic simplification, removing a lot of the complexity in the
later stages. Therefore, I've just removed the option in the new code.
- The code computes several outputs, but only one (the tangent itself)
is ever used in Blender. Therefore, I've removed the others to
simplify the code. They could easily be brought back if needed, none
of the algorithmic simplifications are conflicting with them.
- The original code had fallback paths for many steps in case temporary
memory allocation fails, but that never actually gets used anyways
since malloc() doesn't really ever return NULL in practise, so I
removed them.
- In general, I've restructured A LOT of the code to make the
algorithms clearer and make use of some C++ features (vectors,
std::array, booleans, classes), though there's still some of cleanup
that could be done.
- Parallelized duplicate detection, neighbor detection, triangle
tangent computation, degenerate triangle handling and tangent space
accumulation.
- Replaced several algorithms with faster equivalents: Duplicate
detection uses a (concurrent) hash set now, neighbor detection uses
Radixsort and splits vertices by index pairs etc.
As for results, the exact speedup depends on the scene of course, but
let's consider the file from T97378:
- Blender 3.1 (before D14675): 6.07sec
- Blender 3.2 (with D14675): 4.62sec
- rBf0a36599007d (last nightly build): 4.42sec
- With this commit: 0.90sec
This speedup will mostly be noticed at the start of Cycles renders and,
even more importantly, in Eevee when doing something that changes the
geometry (e.g. animating) on a model using normal maps.
Differential Revision: https://developer.blender.org/D15589
2022-09-03 17:21:44 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* Compute prefix sum to find position of each bin in the sorted array. */
|
|
|
|
|
for (uint pass = 0; pass < datasize; pass++) {
|
|
|
|
|
for (uint i = 2; i < 256; i++) {
|
|
|
|
|
bins[pass][i] += bins[pass][i - 1];
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
int shift = 0;
|
|
|
|
|
for (uint pass = 0; pass < datasize; pass++, shift += 8) {
|
|
|
|
|
/* Insert the elements in their correct location based on their bin. */
|
|
|
|
|
for (const T &item : data) {
|
|
|
|
|
uint pos = bins[pass][(getKey(item) >> shift) & 0xff]++;
|
|
|
|
|
data2[pos] = item;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/* Swap arrays. */
|
|
|
|
|
std::swap(data, data2);
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
static void float_add_atomic(float *val, float add)
|
|
|
|
|
{
|
2022-09-23 14:33:43 +10:00
|
|
|
/* Hacky, but atomic floats are only supported from C++20 onward.
|
|
|
|
|
* This works in practice since `std::atomic<uint32_t>` is really just an `uint32_t` in memory,
|
Mikktspace: Optimized port to C++
This commit is a big overhaul to the Mikktspace module, which is used
to compute tangents. I'm not calling it a rewrite since it's the
result of a lot of iterations on the original code, but pretty much
everything is reworked somehow.
Overall goal was to a) make it faster and b) make it maintainable.
Notable changes:
- Since the callbacks for requesting geometry data were a big
bottleneck before, I've ported it to C++ and made it header-only,
templating on the data source. That way, the compiler generates code
specific to the caller, which allows it to inline the data source and
specialize for some cases (e.g. subd vs. non-subd in Cycles).
- The one input parameter, an optional angle threshold, was not used
anywhere. Turns out that removing it allows for considerable
algorithmic simplification, removing a lot of the complexity in the
later stages. Therefore, I've just removed the option in the new code.
- The code computes several outputs, but only one (the tangent itself)
is ever used in Blender. Therefore, I've removed the others to
simplify the code. They could easily be brought back if needed, none
of the algorithmic simplifications are conflicting with them.
- The original code had fallback paths for many steps in case temporary
memory allocation fails, but that never actually gets used anyways
since malloc() doesn't really ever return NULL in practise, so I
removed them.
- In general, I've restructured A LOT of the code to make the
algorithms clearer and make use of some C++ features (vectors,
std::array, booleans, classes), though there's still some of cleanup
that could be done.
- Parallelized duplicate detection, neighbor detection, triangle
tangent computation, degenerate triangle handling and tangent space
accumulation.
- Replaced several algorithms with faster equivalents: Duplicate
detection uses a (concurrent) hash set now, neighbor detection uses
Radixsort and splits vertices by index pairs etc.
As for results, the exact speedup depends on the scene of course, but
let's consider the file from T97378:
- Blender 3.1 (before D14675): 6.07sec
- Blender 3.2 (with D14675): 4.62sec
- rBf0a36599007d (last nightly build): 4.42sec
- With this commit: 0.90sec
This speedup will mostly be noticed at the start of Cycles renders and,
even more importantly, in Eevee when doing something that changes the
geometry (e.g. animating) on a model using normal maps.
Differential Revision: https://developer.blender.org/D15589
2022-09-03 17:21:44 +02:00
|
|
|
* so this cast lets us do a 32-bit CAS operation (which is used to build the atomic float
|
|
|
|
|
* operation) without needing any external libraries or compiler-specific builtins. */
|
|
|
|
|
std::atomic<uint32_t> *atomic_val = reinterpret_cast<std::atomic<uint32_t> *>(val);
|
|
|
|
|
for (;;) {
|
|
|
|
|
uint32_t old_v = atomic_val->load();
|
|
|
|
|
uint32_t new_v = float_as_uint(uint_as_float(old_v) + add);
|
|
|
|
|
if (atomic_val->compare_exchange_weak(old_v, new_v)) {
|
|
|
|
|
return;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
} // namespace mikk
|