0/* stb_image_resize2 - v2.12 - public domain image resizing
2 by Jeff Roberts (v2) and Jorge L Rodriguez
3 http://github.com/nothings/stb
5 Can be threaded with the extended API. SSE2, AVX, Neon and WASM SIMD support. Only
6 scaling and translation is supported, no rotations or shears.
8 COMPILING & LINKING
9 In one C/C++ file that #includes this file, do this:
10 #define STB_IMAGE_RESIZE_IMPLEMENTATION
11 before the #include. That will create the implementation in that file.
13 EASY API CALLS:
14 Easy API downsamples w/Mitchell filter, upsamples w/cubic interpolation, clamps to edge.
16 stbir_resize_uint8_srgb( input_pixels, input_w, input_h, input_stride_in_bytes,
17 output_pixels, output_w, output_h, output_stride_in_bytes,
18 pixel_layout_enum )
20 stbir_resize_uint8_linear( input_pixels, input_w, input_h, input_stride_in_bytes,
21 output_pixels, output_w, output_h, output_stride_in_bytes,
22 pixel_layout_enum )
24 stbir_resize_float_linear( input_pixels, input_w, input_h, input_stride_in_bytes,
25 output_pixels, output_w, output_h, output_stride_in_bytes,
26 pixel_layout_enum )
28 If you pass NULL or zero for the output_pixels, we will allocate the output buffer
29 for you and return it from the function (free with free() or STBIR_FREE).
30 As a special case, XX_stride_in_bytes of 0 means packed continuously in memory.
32 API LEVELS
33 There are three levels of API - easy-to-use, medium-complexity and extended-complexity.
35 See the "header file" section of the source for API documentation.
37 ADDITIONAL DOCUMENTATION
39 MEMORY ALLOCATION
40 By default, we use malloc and free for memory allocation. To override the
41 memory allocation, before the implementation #include, add a:
43 #define STBIR_MALLOC(size,user_data) ...
44 #define STBIR_FREE(ptr,user_data) ...
46 Each resize makes exactly one call to malloc/free (unless you use the
47 extended API where you can do one allocation for many resizes). Under
48 address sanitizer, we do separate allocations to find overread/writes.
50 PERFORMANCE
51 This library was written with an emphasis on performance. When testing
52 stb_image_resize with RGBA, the fastest mode is STBIR_4CHANNEL with
53 STBIR_TYPE_UINT8 pixels and CLAMPed edges (which is what many other resize
54 libs do by default). Also, make sure SIMD is turned on of course (default
55 for 64-bit targets). Avoid WRAP edge mode if you want the fastest speed.
57 This library also comes with profiling built-in. If you define STBIR_PROFILE,
58 you can use the advanced API and get low-level profiling information by
59 calling stbir_resize_extended_profile_info() or stbir_resize_split_profile_info()
60 after a resize.
62 SIMD
63 Most of the routines have optimized SSE2, AVX, NEON and WASM versions.
65 On Microsoft compilers, we automatically turn on SIMD for 64-bit x64 and
66 ARM; for 32-bit x86 and ARM, you select SIMD mode by defining STBIR_SSE2 or
67 STBIR_NEON. For AVX and AVX2, we auto-select it by detecting the /arch:AVX
68 or /arch:AVX2 switches. You can also always manually turn SSE2, AVX or AVX2
69 support on by defining STBIR_SSE2, STBIR_AVX or STBIR_AVX2.
71 On Linux, SSE2 and Neon is on by default for 64-bit x64 or ARM64. For 32-bit,
72 we select x86 SIMD mode by whether you have -msse2, -mavx or -mavx2 enabled
73 on the command line. For 32-bit ARM, you must pass -mfpu=neon-vfpv4 for both
74 clang and GCC, but GCC also requires an additional -mfp16-format=ieee to
75 automatically enable NEON.
77 On x86 platforms, you can also define STBIR_FP16C to turn on FP16C instructions
78 for converting back and forth to half-floats. This is autoselected when we
79 are using AVX2. Clang and GCC also require the -mf16c switch. ARM always uses
80 the built-in half float hardware NEON instructions.
82 You can also tell us to use multiply-add instructions with STBIR_USE_FMA.
83 Because x86 doesn't always have fma, we turn it off by default to maintain
84 determinism across all platforms. If you don't care about non-FMA determinism
85 and are willing to restrict yourself to more recent x86 CPUs (around the AVX
86 timeframe), then fma will give you around a 15% speedup.
88 You can force off SIMD in all cases by defining STBIR_NO_SIMD. You can turn
89 off AVX or AVX2 specifically with STBIR_NO_AVX or STBIR_NO_AVX2. AVX is 10%
90 to 40% faster, and AVX2 is generally another 12%.
92 ALPHA CHANNEL
93 Most of the resizing functions provide the ability to control how the alpha
94 channel of an image is processed.
96 When alpha represents transparency, it is important that when combining
97 colors with filtering, the pixels should not be treated equally; they
98 should use a weighted average based on their alpha values. For example,
99 if a pixel is 1% opaque bright green and another pixel is 99% opaque
100 black and you average them, the average will be 50% opaque, but the
101 unweighted average and will be a middling green color, while the weighted
102 average will be nearly black. This means the unweighted version introduced
103 green energy that didn't exist in the source image.
105 (If you want to know why this makes sense, you can work out the math for
106 the following: consider what happens if you alpha composite a source image
107 over a fixed color and then average the output, vs. if you average the
108 source image pixels and then composite that over the same fixed color.
109 Only the weighted average produces the same result as the ground truth
110 composite-then-average result.)
112 Therefore, it is in general best to "alpha weight" the pixels when applying
113 filters to them. This essentially means multiplying the colors by the alpha
114 values before combining them, and then dividing by the alpha value at the
115 end.
117 The computer graphics industry introduced a technique called "premultiplied
118 alpha" or "associated alpha" in which image colors are stored in image files
119 already multiplied by their alpha. This saves some math when compositing,
120 and also avoids the need to divide by the alpha at the end (which is quite
121 inefficient). However, while premultiplied alpha is common in the movie CGI
122 industry, it is not commonplace in other industries like videogames, and most
123 consumer file formats are generally expected to contain not-premultiplied
124 colors. For example, Photoshop saves PNG files "unpremultiplied", and web
125 browsers like Chrome and Firefox expect PNG images to be unpremultiplied.
127 Note that there are three possibilities that might describe your image
128 and resize expectation:
130 1. images are not premultiplied, alpha weighting is desired
131 2. images are not premultiplied, alpha weighting is not desired
132 3. images are premultiplied
134 Both case #2 and case #3 require the exact same math: no alpha weighting
135 should be applied or removed. Only case 1 requires extra math operations;
136 the other two cases can be handled identically.
138 stb_image_resize expects case #1 by default, applying alpha weighting to
139 images, expecting the input images to be unpremultiplied. This is what the
140 COLOR+ALPHA buffer types tell the resizer to do.
142 When you use the pixel layouts STBIR_RGBA, STBIR_BGRA, STBIR_ARGB,
143 STBIR_ABGR, STBIR_RX, or STBIR_XR you are telling us that the pixels are
144 non-premultiplied. In these cases, the resizer will alpha weight the colors
145 (effectively creating the premultiplied image), do the filtering, and then
146 convert back to non-premult on exit.
148 When you use the pixel layouts STBIR_RGBA_PM, STBIR_RGBA_PM, STBIR_RGBA_PM,
149 STBIR_RGBA_PM, STBIR_RX_PM or STBIR_XR_PM, you are telling that the pixels
150 ARE premultiplied. In this case, the resizer doesn't have to do the
151 premultipling - it can filter directly on the input. This about twice as
152 fast as the non-premultiplied case, so it's the right option if your data is
153 already setup correctly.
155 When you use the pixel layout STBIR_4CHANNEL or STBIR_2CHANNEL, you are
156 telling us that there is no channel that represents transparency; it may be
157 RGB and some unrelated fourth channel that has been stored in the alpha
158 channel, but it is actually not alpha. No special processing will be
159 performed.
161 The difference between the generic 4 or 2 channel layouts, and the
162 specialized _PM versions is with the _PM versions you are telling us that
163 the data *is* alpha, just don't premultiply it. That's important when
164 using SRGB pixel formats, we need to know where the alpha is, because
165 it is converted linearly (rather than with the SRGB converters).
167 Because alpha weighting produces the same effect as premultiplying, you
168 even have the option with non-premultiplied inputs to let the resizer
169 produce a premultiplied output. Because the intially computed alpha-weighted
170 output image is effectively premultiplied, this is actually more performant
171 than the normal path which un-premultiplies the output image as a final step.
173 Finally, when converting both in and out of non-premulitplied space (for
174 example, when using STBIR_RGBA), we go to somewhat heroic measures to
175 ensure that areas with zero alpha value pixels get something reasonable
176 in the RGB values. If you don't care about the RGB values of zero alpha
177 pixels, you can call the stbir_set_non_pm_alpha_speed_over_quality()
178 function - this runs a premultiplied resize about 25% faster. That said,
179 when you really care about speed, using premultiplied pixels for both in
180 and out (STBIR_RGBA_PM, etc) much faster than both of these premultiplied
181 options.
183 PIXEL LAYOUT CONVERSION
184 The resizer can convert from some pixel layouts to others. When using the
185 stbir_set_pixel_layouts(), you can, for example, specify STBIR_RGBA
186 on input, and STBIR_ARGB on output, and it will re-organize the channels
187 during the resize. Currently, you can only convert between two pixel
188 layouts with the same number of channels.
190 DETERMINISM
191 We commit to being deterministic (from x64 to ARM to scalar to SIMD, etc).
192 This requires compiling with fast-math off (using at least /fp:precise).
193 Also, you must turn off fp-contracting (which turns mult+adds into fmas)!
194 We attempt to do this with pragmas, but with Clang, you usually want to add
195 -ffp-contract=off to the command line as well.
197 For 32-bit x86, you must use SSE and SSE2 codegen for determinism. That is,
198 if the scalar x87 unit gets used at all, we immediately lose determinism.
199 On Microsoft Visual Studio 2008 and earlier, from what we can tell there is
200 no way to be deterministic in 32-bit x86 (some x87 always leaks in, even
201 with fp:strict). On 32-bit x86 GCC, determinism requires both -msse2 and
202 -fpmath=sse.
204 Note that we will not be deterministic with float data containing NaNs -
205 the NaNs will propagate differently on different SIMD and platforms.
207 If you turn on STBIR_USE_FMA, then we will be deterministic with other
208 fma targets, but we will differ from non-fma targets (this is unavoidable,
209 because a fma isn't simply an add with a mult - it also introduces a
210 rounding difference compared to non-fma instruction sequences.
212 FLOAT PIXEL FORMAT RANGE
213 Any range of values can be used for the non-alpha float data that you pass
214 in (0 to 1, -1 to 1, whatever). However, if you are inputting float values
215 but *outputting* bytes or shorts, you must use a range of 0 to 1 so that we
216 scale back properly. The alpha channel must also be 0 to 1 for any format
217 that does premultiplication prior to resizing.
219 Note also that with float output, using filters with negative lobes, the
220 output filtered values might go slightly out of range. You can define
221 STBIR_FLOAT_LOW_CLAMP and/or STBIR_FLOAT_HIGH_CLAMP to specify the range
222 to clamp to on output, if that's important.
224 MAX/MIN SCALE FACTORS
225 The input pixel resolutions are in integers, and we do the internal pointer
226 resolution in size_t sized integers. However, the scale ratio from input
227 resolution to output resolution is calculated in float form. This means
228 the effective possible scale ratio is limited to 24 bits (or 16 million
229 to 1). As you get close to the size of the float resolution (again, 16
230 million pixels wide or high), you might start seeing float inaccuracy
231 issues in general in the pipeline. If you have to do extreme resizes,
232 you can usually do this is multiple stages (using float intermediate
233 buffers).
235 FLIPPED IMAGES
236 Stride is just the delta from one scanline to the next. This means you can
237 use a negative stride to handle inverted images (point to the final
238 scanline and use a negative stride). You can invert the input or output,
239 using negative strides.
241 DEFAULT FILTERS
242 For functions which don't provide explicit control over what filters to
243 use, you can change the compile-time defaults with:
245 #define STBIR_DEFAULT_FILTER_UPSAMPLE STBIR_FILTER_something
246 #define STBIR_DEFAULT_FILTER_DOWNSAMPLE STBIR_FILTER_something
248 See stbir_filter in the header-file section for the list of filters.
250 NEW FILTERS
251 A number of 1D filter kernels are supplied. For a list of supported
252 filters, see the stbir_filter enum. You can install your own filters by
253 using the stbir_set_filter_callbacks function.
255 PROGRESS
256 For interactive use with slow resize operations, you can use the the
257 scanline callbacks in the extended API. It would have to be a *very* large
258 image resample to need progress though - we're very fast.
260 CEIL and FLOOR
261 In scalar mode, the only functions we use from math.h are ceilf and floorf,
262 but if you have your own versions, you can define the STBIR_CEILF(v) and
263 STBIR_FLOORF(v) macros and we'll use them instead. In SIMD, we just use
264 our own versions.
266 ASSERT
267 Define STBIR_ASSERT(boolval) to override assert() and not use assert.h
269 PORTING FROM VERSION 1
270 The API has changed. You can continue to use the old version of stb_image_resize.h,
271 which is available in the "deprecated/" directory.
273 If you're using the old simple-to-use API, porting is straightforward.
274 (For more advanced APIs, read the documentation.)
276 stbir_resize_uint8():
277 - call `stbir_resize_uint8_linear`, cast channel count to `stbir_pixel_layout`
279 stbir_resize_float():
280 - call `stbir_resize_float_linear`, cast channel count to `stbir_pixel_layout`
282 stbir_resize_uint8_srgb():
283 - function name is unchanged
284 - cast channel count to `stbir_pixel_layout`
285 - above is sufficient unless your image has alpha and it's not RGBA/BGRA
286 - in that case, follow the below instructions for stbir_resize_uint8_srgb_edgemode
288 stbir_resize_uint8_srgb_edgemode()
289 - switch to the "medium complexity" API
290 - stbir_resize(), very similar API but a few more parameters:
291 - pixel_layout: cast channel count to `stbir_pixel_layout`
292 - data_type: STBIR_TYPE_UINT8_SRGB
293 - edge: unchanged (STBIR_EDGE_WRAP, etc.)
294 - filter: STBIR_FILTER_DEFAULT
295 - which channel is alpha is specified in stbir_pixel_layout, see enum for details
297 FUTURE TODOS
298 * For polyphase integral filters, we just memcpy the coeffs to dupe
299 them, but we should indirect and use the same coeff memory.
300 * Add pixel layout conversions for sensible different channel counts
301 (maybe, 1->3/4, 3->4, 4->1, 3->1).
302 * For SIMD encode and decode scanline routines, do any pre-aligning
303 for bad input/output buffer alignments and pitch?
304 * For very wide scanlines, we should we do vertical strips to stay within
305 L2 cache. Maybe do chunks of 1K pixels at a time. There would be
306 some pixel reconversion, but probably dwarfed by things falling out
307 of cache. Probably also something possible with alternating between
308 scattering and gathering at high resize scales?
309 * Rewrite the coefficient generator to do many at once.
310 * AVX-512 vertical kernels - worried about downclocking here.
311 * Convert the reincludes to macros when we know they aren't changing.
312 * Experiment with pivoting the horizontal and always using the
313 vertical filters (which are faster, but perhaps not enough to overcome
314 the pivot cost and the extra memory touches). Need to buffer the whole
315 image so have to balance memory use.
316 * Most of our code is internally function pointers, should we compile
317 all the SIMD stuff always and dynamically dispatch?
319 CONTRIBUTORS
320 Jeff Roberts: 2.0 implementation, optimizations, SIMD
321 Martins Mozeiko: NEON simd, WASM simd, clang and GCC whisperer
322 Fabian Giesen: half float and srgb converters
323 Sean Barrett: API design, optimizations
324 Jorge L Rodriguez: Original 1.0 implementation
325 Aras Pranckevicius: bugfixes
326 Nathan Reed: warning fixes for 1.0
328 REVISIONS
329 2.12 (2024-10-18) fix incorrect use of user_data with STBIR_FREE
330 2.11 (2024-09-08) fix harmless asan warnings in 2-channel and 3-channel mode
331 with AVX-2, fix some weird scaling edge conditions with
332 point sample mode.
333 2.10 (2024-07-27) fix the defines GCC and mingw for loop unroll control,
334 fix MSVC 32-bit arm half float routines.
335 2.09 (2024-06-19) fix the defines for 32-bit ARM GCC builds (was selecting
336 hardware half floats).
337 2.08 (2024-06-10) fix for RGB->BGR three channel flips and add SIMD (thanks
338 to Ryan Salsbury), fix for sub-rect resizes, use the
339 pragmas to control unrolling when they are available.
340 2.07 (2024-05-24) fix for slow final split during threaded conversions of very
341 wide scanlines when downsampling (caused by extra input
342 converting), fix for wide scanline resamples with many
343 splits (int overflow), fix GCC warning.
344 2.06 (2024-02-10) fix for identical width/height 3x or more down-scaling
345 undersampling a single row on rare resize ratios (about 1%).
346 2.05 (2024-02-07) fix for 2 pixel to 1 pixel resizes with wrap (thanks Aras),
347 fix for output callback (thanks Julien Koenen).
348 2.04 (2023-11-17) fix for rare AVX bug, shadowed symbol (thanks Nikola Smiljanic).
349 2.03 (2023-11-01) ASAN and TSAN warnings fixed, minor tweaks.
350 2.00 (2023-10-10) mostly new source: new api, optimizations, simd, vertical-first, etc
351 2x-5x faster without simd, 4x-12x faster with simd,
352 in some cases, 20x to 40x faster esp resizing large to very small.
353 0.96 (2019-03-04) fixed warnings
354 0.95 (2017-07-23) fixed warnings
355 0.94 (2017-03-18) fixed warnings
356 0.93 (2017-03-03) fixed bug with certain combinations of heights
357 0.92 (2017-01-02) fix integer overflow on large (>2GB) images
358 0.91 (2016-04-02) fix warnings; fix handling of subpixel regions
359 0.90 (2014-09-17) first released version
361 LICENSE
362 See end of file for license information.
363*/
365#if !defined(STB_IMAGE_RESIZE_DO_HORIZONTALS) && !defined(STB_IMAGE_RESIZE_DO_VERTICALS) && !defined(STB_IMAGE_RESIZE_DO_CODERS) // for internal re-includes
367#ifndef STBIR_INCLUDE_STB_IMAGE_RESIZE2_H
368#define STBIR_INCLUDE_STB_IMAGE_RESIZE2_H
370#include <stddef.h>
371#ifdef _MSC_VER
372typedef unsigned char stbir_uint8;
373typedef unsigned short stbir_uint16;
374typedef unsigned int stbir_uint32;
375typedef unsigned __int64 stbir_uint64;
376#else
377#include <stdint.h>
378typedef uint8_t stbir_uint8;
379typedef uint16_t stbir_uint16;
380typedef uint32_t stbir_uint32;
381typedef uint64_t stbir_uint64;
382#endif
384#ifdef _M_IX86_FP
385#if ( _M_IX86_FP >= 1 )
386#ifndef STBIR_SSE
387#define STBIR_SSE
388#endif
389#endif
390#endif
392#if defined(_x86_64) || defined( __x86_64__ ) || defined( _M_X64 ) || defined(__x86_64) || defined(_M_AMD64) || defined(__SSE2__) || defined(STBIR_SSE) || defined(STBIR_SSE2)
393 #ifndef STBIR_SSE2
394 #define STBIR_SSE2
395 #endif
396 #if defined(__AVX__) || defined(STBIR_AVX2)
397 #ifndef STBIR_AVX
398 #ifndef STBIR_NO_AVX
399 #define STBIR_AVX
400 #endif
401 #endif
402 #endif
403 #if defined(__AVX2__) || defined(STBIR_AVX2)
404 #ifndef STBIR_NO_AVX2
405 #ifndef STBIR_AVX2
406 #define STBIR_AVX2
407 #endif
408 #if defined( _MSC_VER ) && !defined(__clang__)
409 #ifndef STBIR_FP16C // FP16C instructions are on all AVX2 cpus, so we can autoselect it here on microsoft - clang needs -m16c
410 #define STBIR_FP16C
411 #endif
412 #endif
413 #endif
414 #endif
415 #ifdef __F16C__
416 #ifndef STBIR_FP16C // turn on FP16C instructions if the define is set (for clang and gcc)
417 #define STBIR_FP16C
418 #endif
419 #endif
420#endif
422#if defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) || ((__ARM_NEON_FP & 4) != 0) || defined(__ARM_NEON__)
423#ifndef STBIR_NEON
424#define STBIR_NEON
425#endif
426#endif
428#if defined(_M_ARM) || defined(__arm__)
429#ifdef STBIR_USE_FMA
430#undef STBIR_USE_FMA // no FMA for 32-bit arm on MSVC
431#endif
432#endif
434#if defined(__wasm__) && defined(__wasm_simd128__)
435#ifndef STBIR_WASM
436#define STBIR_WASM
437#endif
438#endif
440#ifndef STBIRDEF
441#ifdef STB_IMAGE_RESIZE_STATIC
442#define STBIRDEF static
443#else
444#ifdef __cplusplus
445#define STBIRDEF extern "C"
446#else
447#define STBIRDEF extern
448#endif
449#endif
450#endif
452//////////////////////////////////////////////////////////////////////////////
453//// start "header file" ///////////////////////////////////////////////////
454//
455// Easy-to-use API:
456//
457// * stride is the offset between successive rows of image data
458// in memory, in bytes. specify 0 for packed continuously in memory
459// * colorspace is linear or sRGB as specified by function name
460// * Uses the default filters
461// * Uses edge mode clamped
462// * returned result is 1 for success or 0 in case of an error.
465// stbir_pixel_layout specifies:
466// number of channels
467// order of channels
468// whether color is premultiplied by alpha
469// for back compatibility, you can cast the old channel count to an stbir_pixel_layout
470typedef enum
471{
472 STBIR_1CHANNEL = 1,
473 STBIR_2CHANNEL = 2,
474 STBIR_RGB = 3, // 3-chan, with order specified (for channel flipping)
475 STBIR_BGR = 0, // 3-chan, with order specified (for channel flipping)
476 STBIR_4CHANNEL = 5,
478 STBIR_RGBA = 4, // alpha formats, where alpha is NOT premultiplied into color channels
479 STBIR_BGRA = 6,
480 STBIR_ARGB = 7,
481 STBIR_ABGR = 8,
482 STBIR_RA = 9,
483 STBIR_AR = 10,
485 STBIR_RGBA_PM = 11, // alpha formats, where alpha is premultiplied into color channels
486 STBIR_BGRA_PM = 12,
487 STBIR_ARGB_PM = 13,
488 STBIR_ABGR_PM = 14,
489 STBIR_RA_PM = 15,
490 STBIR_AR_PM = 16,
492 STBIR_RGBA_NO_AW = 11, // alpha formats, where NO alpha weighting is applied at all!
493 STBIR_BGRA_NO_AW = 12, // these are just synonyms for the _PM flags (which also do
494 STBIR_ARGB_NO_AW = 13, // no alpha weighting). These names just make it more clear
495 STBIR_ABGR_NO_AW = 14, // for some folks).
496 STBIR_RA_NO_AW = 15,
497 STBIR_AR_NO_AW = 16,
499} stbir_pixel_layout;
501//===============================================================
502// Simple-complexity API
503//
504// If output_pixels is NULL (0), then we will allocate the buffer and return it to you.
505//--------------------------------
507STBIRDEF unsigned char * stbir_resize_uint8_srgb( const unsigned char *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
508 unsigned char *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
509 stbir_pixel_layout pixel_type );
511STBIRDEF unsigned char * stbir_resize_uint8_linear( const unsigned char *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
512 unsigned char *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
513 stbir_pixel_layout pixel_type );
515STBIRDEF float * stbir_resize_float_linear( const float *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
516 float *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
517 stbir_pixel_layout pixel_type );
518//===============================================================
520//===============================================================
521// Medium-complexity API
522//
523// This extends the easy-to-use API as follows:
524//
525// * Can specify the datatype - U8, U8_SRGB, U16, FLOAT, HALF_FLOAT
526// * Edge wrap can selected explicitly
527// * Filter can be selected explicitly
528//--------------------------------
530typedef enum
531{
532 STBIR_EDGE_CLAMP = 0,
533 STBIR_EDGE_REFLECT = 1,
534 STBIR_EDGE_WRAP = 2, // this edge mode is slower and uses more memory
535 STBIR_EDGE_ZERO = 3,
536} stbir_edge;
538typedef enum
539{
540 STBIR_FILTER_DEFAULT = 0, // use same filter type that easy-to-use API chooses
541 STBIR_FILTER_BOX = 1, // A trapezoid w/1-pixel wide ramps, same result as box for integer scale ratios
542 STBIR_FILTER_TRIANGLE = 2, // On upsampling, produces same results as bilinear texture filtering
543 STBIR_FILTER_CUBICBSPLINE = 3, // The cubic b-spline (aka Mitchell-Netrevalli with B=1,C=0), gaussian-esque
544 STBIR_FILTER_CATMULLROM = 4, // An interpolating cubic spline
545 STBIR_FILTER_MITCHELL = 5, // Mitchell-Netrevalli filter with B=1/3, C=1/3
546 STBIR_FILTER_POINT_SAMPLE = 6, // Simple point sampling
547 STBIR_FILTER_OTHER = 7, // User callback specified
548} stbir_filter;
550typedef enum
551{
552 STBIR_TYPE_UINT8 = 0,
553 STBIR_TYPE_UINT8_SRGB = 1,
554 STBIR_TYPE_UINT8_SRGB_ALPHA = 2, // alpha channel, when present, should also be SRGB (this is very unusual)
555 STBIR_TYPE_UINT16 = 3,
556 STBIR_TYPE_FLOAT = 4,
557 STBIR_TYPE_HALF_FLOAT = 5
558} stbir_datatype;
560// medium api
561STBIRDEF void * stbir_resize( const void *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
562 void *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
563 stbir_pixel_layout pixel_layout, stbir_datatype data_type,
564 stbir_edge edge, stbir_filter filter );
565//===============================================================
569//===============================================================
570// Extended-complexity API
571//
572// This API exposes all resize functionality.
573//
574// * Separate filter types for each axis
575// * Separate edge modes for each axis
576// * Separate input and output data types
577// * Can specify regions with subpixel correctness
578// * Can specify alpha flags
579// * Can specify a memory callback
580// * Can specify a callback data type for pixel input and output
581// * Can be threaded for a single resize
582// * Can be used to resize many frames without recalculating the sampler info
583//
584// Use this API as follows:
585// 1) Call the stbir_resize_init function on a local STBIR_RESIZE structure
586// 2) Call any of the stbir_set functions
587// 3) Optionally call stbir_build_samplers() if you are going to resample multiple times
588// with the same input and output dimensions (like resizing video frames)
589// 4) Resample by calling stbir_resize_extended().
590// 5) Call stbir_free_samplers() if you called stbir_build_samplers()
591//--------------------------------
594// Types:
596// INPUT CALLBACK: this callback is used for input scanlines
597typedef void const * stbir_input_callback( void * optional_output, void const * input_ptr, int num_pixels, int x, int y, void * context );
599// OUTPUT CALLBACK: this callback is used for output scanlines
600typedef void stbir_output_callback( void const * output_ptr, int num_pixels, int y, void * context );
602// callbacks for user installed filters
603typedef float stbir__kernel_callback( float x, float scale, void * user_data ); // centered at zero
604typedef float stbir__support_callback( float scale, void * user_data );
606// internal structure with precomputed scaling
607typedef struct stbir__info stbir__info;
609typedef struct STBIR_RESIZE // use the stbir_resize_init and stbir_override functions to set these values for future compatibility
610{
611 void * user_data;
612 void const * input_pixels;
613 int input_w, input_h;
614 double input_s0, input_t0, input_s1, input_t1;
615 stbir_input_callback * input_cb;
616 void * output_pixels;
617 int output_w, output_h;
618 int output_subx, output_suby, output_subw, output_subh;
619 stbir_output_callback * output_cb;
620 int input_stride_in_bytes;
621 int output_stride_in_bytes;
622 int splits;
623 int fast_alpha;
624 int needs_rebuild;
625 int called_alloc;
626 stbir_pixel_layout input_pixel_layout_public;
627 stbir_pixel_layout output_pixel_layout_public;
628 stbir_datatype input_data_type;
629 stbir_datatype output_data_type;
630 stbir_filter horizontal_filter, vertical_filter;
631 stbir_edge horizontal_edge, vertical_edge;
632 stbir__kernel_callback * horizontal_filter_kernel; stbir__support_callback * horizontal_filter_support;
633 stbir__kernel_callback * vertical_filter_kernel; stbir__support_callback * vertical_filter_support;
634 stbir__info * samplers;
635} STBIR_RESIZE;
637// extended complexity api
640// First off, you must ALWAYS call stbir_resize_init on your resize structure before any of the other calls!
641STBIRDEF void stbir_resize_init( STBIR_RESIZE * resize,
642 const void *input_pixels, int input_w, int input_h, int input_stride_in_bytes, // stride can be zero
643 void *output_pixels, int output_w, int output_h, int output_stride_in_bytes, // stride can be zero
644 stbir_pixel_layout pixel_layout, stbir_datatype data_type );
646//===============================================================
647// You can update these parameters any time after resize_init and there is no cost
648//--------------------------------
650STBIRDEF void stbir_set_datatypes( STBIR_RESIZE * resize, stbir_datatype input_type, stbir_datatype output_type );
651STBIRDEF void stbir_set_pixel_callbacks( STBIR_RESIZE * resize, stbir_input_callback * input_cb, stbir_output_callback * output_cb ); // no callbacks by default
652STBIRDEF void stbir_set_user_data( STBIR_RESIZE * resize, void * user_data ); // pass back STBIR_RESIZE* by default
653STBIRDEF void stbir_set_buffer_ptrs( STBIR_RESIZE * resize, const void * input_pixels, int input_stride_in_bytes, void * output_pixels, int output_stride_in_bytes );
655//===============================================================
658//===============================================================
659// If you call any of these functions, you will trigger a sampler rebuild!
660//--------------------------------
662STBIRDEF int stbir_set_pixel_layouts( STBIR_RESIZE * resize, stbir_pixel_layout input_pixel_layout, stbir_pixel_layout output_pixel_layout ); // sets new buffer layouts
663STBIRDEF int stbir_set_edgemodes( STBIR_RESIZE * resize, stbir_edge horizontal_edge, stbir_edge vertical_edge ); // CLAMP by default
665STBIRDEF int stbir_set_filters( STBIR_RESIZE * resize, stbir_filter horizontal_filter, stbir_filter vertical_filter ); // STBIR_DEFAULT_FILTER_UPSAMPLE/DOWNSAMPLE by default
666STBIRDEF int stbir_set_filter_callbacks( STBIR_RESIZE * resize, stbir__kernel_callback * horizontal_filter, stbir__support_callback * horizontal_support, stbir__kernel_callback * vertical_filter, stbir__support_callback * vertical_support );
668STBIRDEF int stbir_set_pixel_subrect( STBIR_RESIZE * resize, int subx, int suby, int subw, int subh ); // sets both sub-regions (full regions by default)
669STBIRDEF int stbir_set_input_subrect( STBIR_RESIZE * resize, double s0, double t0, double s1, double t1 ); // sets input sub-region (full region by default)
670STBIRDEF int stbir_set_output_pixel_subrect( STBIR_RESIZE * resize, int subx, int suby, int subw, int subh ); // sets output sub-region (full region by default)
672// when inputting AND outputting non-premultiplied alpha pixels, we use a slower but higher quality technique
673// that fills the zero alpha pixel's RGB values with something plausible. If you don't care about areas of
674// zero alpha, you can call this function to get about a 25% speed improvement for STBIR_RGBA to STBIR_RGBA
675// types of resizes.
676STBIRDEF int stbir_set_non_pm_alpha_speed_over_quality( STBIR_RESIZE * resize, int non_pma_alpha_speed_over_quality );
677//===============================================================
680//===============================================================
681// You can call build_samplers to prebuild all the internal data we need to resample.
682// Then, if you call resize_extended many times with the same resize, you only pay the
683// cost once.
684// If you do call build_samplers, you MUST call free_samplers eventually.
685//--------------------------------
687// This builds the samplers and does one allocation
688STBIRDEF int stbir_build_samplers( STBIR_RESIZE * resize );
690// You MUST call this, if you call stbir_build_samplers or stbir_build_samplers_with_splits
691STBIRDEF void stbir_free_samplers( STBIR_RESIZE * resize );
692//===============================================================
695// And this is the main function to perform the resize synchronously on one thread.
696STBIRDEF int stbir_resize_extended( STBIR_RESIZE * resize );
699//===============================================================
700// Use these functions for multithreading.
701// 1) You call stbir_build_samplers_with_splits first on the main thread
702// 2) Then stbir_resize_with_split on each thread
703// 3) stbir_free_samplers when done on the main thread
704//--------------------------------
706// This will build samplers for threading.
707// You can pass in the number of threads you'd like to use (try_splits).
708// It returns the number of splits (threads) that you can call it with.
709/// It might be less if the image resize can't be split up that many ways.
711STBIRDEF int stbir_build_samplers_with_splits( STBIR_RESIZE * resize, int try_splits );
713// This function does a split of the resizing (you call this fuction for each
714// split, on multiple threads). A split is a piece of the output resize pixel space.
716// Note that you MUST call stbir_build_samplers_with_splits before stbir_resize_extended_split!
718// Usually, you will always call stbir_resize_split with split_start as the thread_index
719// and "1" for the split_count.
720// But, if you have a weird situation where you MIGHT want 8 threads, but sometimes
721// only 4 threads, you can use 0,2,4,6 for the split_start's and use "2" for the
722// split_count each time to turn in into a 4 thread resize. (This is unusual).
724STBIRDEF int stbir_resize_extended_split( STBIR_RESIZE * resize, int split_start, int split_count );
725//===============================================================
728//===============================================================
729// Pixel Callbacks info:
730//--------------------------------
732// The input callback is super flexible - it calls you with the input address
733// (based on the stride and base pointer), it gives you an optional_output
734// pointer that you can fill, or you can just return your own pointer into
735// your own data.
736//
737// You can also do conversion from non-supported data types if necessary - in
738// this case, you ignore the input_ptr and just use the x and y parameters to
739// calculate your own input_ptr based on the size of each non-supported pixel.
740// (Something like the third example below.)
741//
742// You can also install just an input or just an output callback by setting the
743// callback that you don't want to zero.
744//
745// First example, progress: (getting a callback that you can monitor the progress):
746// void const * my_callback( void * optional_output, void const * input_ptr, int num_pixels, int x, int y, void * context )
747// {
748// percentage_done = y / input_height;
749// return input_ptr; // use buffer from call
750// }
751//
752// Next example, copying: (copy from some other buffer or stream):
753// void const * my_callback( void * optional_output, void const * input_ptr, int num_pixels, int x, int y, void * context )
754// {
755// CopyOrStreamData( optional_output, other_data_src, num_pixels * pixel_width_in_bytes );
756// return optional_output; // return the optional buffer that we filled
757// }
758//
759// Third example, input another buffer without copying: (zero-copy from other buffer):
760// void const * my_callback( void * optional_output, void const * input_ptr, int num_pixels, int x, int y, void * context )
761// {
762// void * pixels = ( (char*) other_image_base ) + ( y * other_image_stride ) + ( x * other_pixel_width_in_bytes );
763// return pixels; // return pointer to your data without copying
764// }
765//
766//
767// The output callback is considerably simpler - it just calls you so that you can dump
768// out each scanline. You could even directly copy out to disk if you have a simple format
769// like TGA or BMP. You can also convert to other output types here if you want.
770//
771// Simple example:
772// void const * my_output( void * output_ptr, int num_pixels, int y, void * context )
773// {
774// percentage_done = y / output_height;
775// fwrite( output_ptr, pixel_width_in_bytes, num_pixels, output_file );
776// }
777//===============================================================
782//===============================================================
783// optional built-in profiling API
784//--------------------------------
786#ifdef STBIR_PROFILE
788typedef struct STBIR_PROFILE_INFO
789{
790 stbir_uint64 total_clocks;
792 // how many clocks spent (of total_clocks) in the various resize routines, along with a string description
793 // there are "resize_count" number of zones
794 stbir_uint64 clocks[ 8 ];
795 char const ** descriptions;
797 // count of clocks and descriptions
798 stbir_uint32 count;
799} STBIR_PROFILE_INFO;
801// use after calling stbir_resize_extended (or stbir_build_samplers or stbir_build_samplers_with_splits)
802STBIRDEF void stbir_resize_build_profile_info( STBIR_PROFILE_INFO * out_info, STBIR_RESIZE const * resize );
804// use after calling stbir_resize_extended
805STBIRDEF void stbir_resize_extended_profile_info( STBIR_PROFILE_INFO * out_info, STBIR_RESIZE const * resize );
807// use after calling stbir_resize_extended_split
808STBIRDEF void stbir_resize_split_profile_info( STBIR_PROFILE_INFO * out_info, STBIR_RESIZE const * resize, int split_start, int split_num );
810//===============================================================
812#endif
815//// end header file /////////////////////////////////////////////////////
816#endif // STBIR_INCLUDE_STB_IMAGE_RESIZE2_H
818#if defined(STB_IMAGE_RESIZE_IMPLEMENTATION) || defined(STB_IMAGE_RESIZE2_IMPLEMENTATION)
820#ifndef STBIR_ASSERT
821#include <assert.h>
822#define STBIR_ASSERT(x) assert(x)
823#endif
825#ifndef STBIR_MALLOC
826#include <stdlib.h>
827#define STBIR_MALLOC(size,user_data) ((void)(user_data), malloc(size))
828#define STBIR_FREE(ptr,user_data) ((void)(user_data), free(ptr))
829// (we used the comma operator to evaluate user_data, to avoid "unused parameter" warnings)
830#endif
832#ifdef _MSC_VER
834#define stbir__inline __forceinline
836#else
838#define stbir__inline __inline__
840// Clang address sanitizer
841#if defined(__has_feature)
842 #if __has_feature(address_sanitizer) || __has_feature(memory_sanitizer)
843 #ifndef STBIR__SEPARATE_ALLOCATIONS
844 #define STBIR__SEPARATE_ALLOCATIONS
845 #endif
846 #endif
847#endif
849#endif
851// GCC and MSVC
852#if defined(__SANITIZE_ADDRESS__)
853 #ifndef STBIR__SEPARATE_ALLOCATIONS
854 #define STBIR__SEPARATE_ALLOCATIONS
855 #endif
856#endif
858// Always turn off automatic FMA use - use STBIR_USE_FMA if you want.
859// Otherwise, this is a determinism disaster.
860#ifndef STBIR_DONT_CHANGE_FP_CONTRACT // override in case you don't want this behavior
861#if defined(_MSC_VER) && !defined(__clang__)
862#if _MSC_VER > 1200
863#pragma fp_contract(off)
864#endif
865#elif defined(__GNUC__) && !defined(__clang__)
866#pragma GCC optimize("fp-contract=off")
867#else
868#pragma STDC FP_CONTRACT OFF
869#endif
870#endif
872#ifdef _MSC_VER
873#define STBIR__UNUSED(v) (void)(v)
874#else
875#define STBIR__UNUSED(v) (void)sizeof(v)
876#endif
878#define STBIR__ARRAY_SIZE(a) (sizeof((a))/sizeof((a)[0]))
881#ifndef STBIR_DEFAULT_FILTER_UPSAMPLE
882#define STBIR_DEFAULT_FILTER_UPSAMPLE STBIR_FILTER_CATMULLROM
883#endif
885#ifndef STBIR_DEFAULT_FILTER_DOWNSAMPLE
886#define STBIR_DEFAULT_FILTER_DOWNSAMPLE STBIR_FILTER_MITCHELL
887#endif
890#ifndef STBIR__HEADER_FILENAME
891#define STBIR__HEADER_FILENAME "stb_image_resize2.h"
892#endif
894// the internal pixel layout enums are in a different order, so we can easily do range comparisons of types
895// the public pixel layout is ordered in a way that if you cast num_channels (1-4) to the enum, you get something sensible
896typedef enum
897{
898 STBIRI_1CHANNEL = 0,
899 STBIRI_2CHANNEL = 1,
900 STBIRI_RGB = 2,
901 STBIRI_BGR = 3,
902 STBIRI_4CHANNEL = 4,
904 STBIRI_RGBA = 5,
905 STBIRI_BGRA = 6,
906 STBIRI_ARGB = 7,
907 STBIRI_ABGR = 8,
908 STBIRI_RA = 9,
909 STBIRI_AR = 10,
911 STBIRI_RGBA_PM = 11,
912 STBIRI_BGRA_PM = 12,
913 STBIRI_ARGB_PM = 13,
914 STBIRI_ABGR_PM = 14,
915 STBIRI_RA_PM = 15,
916 STBIRI_AR_PM = 16,
917} stbir_internal_pixel_layout;
919// define the public pixel layouts to not compile inside the implementation (to avoid accidental use)
920#define STBIR_BGR bad_dont_use_in_implementation
921#define STBIR_1CHANNEL STBIR_BGR
922#define STBIR_2CHANNEL STBIR_BGR
923#define STBIR_RGB STBIR_BGR
924#define STBIR_RGBA STBIR_BGR
925#define STBIR_4CHANNEL STBIR_BGR
926#define STBIR_BGRA STBIR_BGR
927#define STBIR_ARGB STBIR_BGR
928#define STBIR_ABGR STBIR_BGR
929#define STBIR_RA STBIR_BGR
930#define STBIR_AR STBIR_BGR
931#define STBIR_RGBA_PM STBIR_BGR
932#define STBIR_BGRA_PM STBIR_BGR
933#define STBIR_ARGB_PM STBIR_BGR
934#define STBIR_ABGR_PM STBIR_BGR
935#define STBIR_RA_PM STBIR_BGR
936#define STBIR_AR_PM STBIR_BGR
938// must match stbir_datatype
939static unsigned char stbir__type_size[] = {
940 1,1,1,2,4,2 // STBIR_TYPE_UINT8,STBIR_TYPE_UINT8_SRGB,STBIR_TYPE_UINT8_SRGB_ALPHA,STBIR_TYPE_UINT16,STBIR_TYPE_FLOAT,STBIR_TYPE_HALF_FLOAT
941};
943// When gathering, the contributors are which source pixels contribute.
944// When scattering, the contributors are which destination pixels are contributed to.
945typedef struct
946{
947 int n0; // First contributing pixel
948 int n1; // Last contributing pixel
949} stbir__contributors;
951typedef struct
952{
953 int lowest; // First sample index for whole filter
954 int highest; // Last sample index for whole filter
955 int widest; // widest single set of samples for an output
956} stbir__filter_extent_info;
958typedef struct
959{
960 int n0; // First pixel of decode buffer to write to
961 int n1; // Last pixel of decode that will be written to
962 int pixel_offset_for_input; // Pixel offset into input_scanline
963} stbir__span;
965typedef struct stbir__scale_info
966{
967 int input_full_size;
968 int output_sub_size;
969 float scale;
970 float inv_scale;
971 float pixel_shift; // starting shift in output pixel space (in pixels)
972 int scale_is_rational;
973 stbir_uint32 scale_numerator, scale_denominator;
974} stbir__scale_info;
976typedef struct
977{
978 stbir__contributors * contributors;
979 float* coefficients;
980 stbir__contributors * gather_prescatter_contributors;
981 float * gather_prescatter_coefficients;
982 stbir__scale_info scale_info;
983 float support;
984 stbir_filter filter_enum;
985 stbir__kernel_callback * filter_kernel;
986 stbir__support_callback * filter_support;
987 stbir_edge edge;
988 int coefficient_width;
989 int filter_pixel_width;
990 int filter_pixel_margin;
991 int num_contributors;
992 int contributors_size;
993 int coefficients_size;
994 stbir__filter_extent_info extent_info;
995 int is_gather; // 0 = scatter, 1 = gather with scale >= 1, 2 = gather with scale < 1
996 int gather_prescatter_num_contributors;
997 int gather_prescatter_coefficient_width;
998 int gather_prescatter_contributors_size;
999 int gather_prescatter_coefficients_size;
1000} stbir__sampler;
1002typedef struct
1003{
1004 stbir__contributors conservative;
1005 int edge_sizes[2]; // this can be less than filter_pixel_margin, if the filter and scaling falls off
1006 stbir__span spans[2]; // can be two spans, if doing input subrect with clamp mode WRAP
1007} stbir__extents;
1009typedef struct
1010{
1011#ifdef STBIR_PROFILE
1012 union
1013 {
1014 struct { stbir_uint64 total, looping, vertical, horizontal, decode, encode, alpha, unalpha; } named;
1015 stbir_uint64 array[8];
1016 } profile;
1017 stbir_uint64 * current_zone_excluded_ptr;
1018#endif
1019 float* decode_buffer;
1021 int ring_buffer_first_scanline;
1022 int ring_buffer_last_scanline;
1023 int ring_buffer_begin_index; // first_scanline is at this index in the ring buffer
1024 int start_output_y, end_output_y;
1025 int start_input_y, end_input_y; // used in scatter only
1027 #ifdef STBIR__SEPARATE_ALLOCATIONS
1028 float** ring_buffers; // one pointer for each ring buffer
1029 #else
1030 float* ring_buffer; // one big buffer that we index into
1031 #endif
1033 float* vertical_buffer;
1035 char no_cache_straddle[64];
1036} stbir__per_split_info;
1038typedef void stbir__decode_pixels_func( float * decode, int width_times_channels, void const * input );
1039typedef void stbir__alpha_weight_func( float * decode_buffer, int width_times_channels );
1040typedef void stbir__horizontal_gather_channels_func( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer,
1041 stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width );
1042typedef void stbir__alpha_unweight_func(float * encode_buffer, int width_times_channels );
1043typedef void stbir__encode_pixels_func( void * output, int width_times_channels, float const * encode );
1045struct stbir__info
1046{
1047#ifdef STBIR_PROFILE
1048 union
1049 {
1050 struct { stbir_uint64 total, build, alloc, horizontal, vertical, cleanup, pivot; } named;
1051 stbir_uint64 array[7];
1052 } profile;
1053 stbir_uint64 * current_zone_excluded_ptr;
1054#endif
1055 stbir__sampler horizontal;
1056 stbir__sampler vertical;
1058 void const * input_data;
1059 void * output_data;
1061 int input_stride_bytes;
1062 int output_stride_bytes;
1063 int ring_buffer_length_bytes; // The length of an individual entry in the ring buffer. The total number of ring buffers is stbir__get_filter_pixel_width(filter)
1064 int ring_buffer_num_entries; // Total number of entries in the ring buffer.
1066 stbir_datatype input_type;
1067 stbir_datatype output_type;
1069 stbir_input_callback * in_pixels_cb;
1070 void * user_data;
1071 stbir_output_callback * out_pixels_cb;
1073 stbir__extents scanline_extents;
1075 void * alloced_mem;
1076 stbir__per_split_info * split_info; // by default 1, but there will be N of these allocated based on the thread init you did
1078 stbir__decode_pixels_func * decode_pixels;
1079 stbir__alpha_weight_func * alpha_weight;
1080 stbir__horizontal_gather_channels_func * horizontal_gather_channels;
1081 stbir__alpha_unweight_func * alpha_unweight;
1082 stbir__encode_pixels_func * encode_pixels;
1084 int alloc_ring_buffer_num_entries; // Number of entries in the ring buffer that will be allocated
1085 int splits; // count of splits
1087 stbir_internal_pixel_layout input_pixel_layout_internal;
1088 stbir_internal_pixel_layout output_pixel_layout_internal;
1090 int input_color_and_type;
1091 int offset_x, offset_y; // offset within output_data
1092 int vertical_first;
1093 int channels;
1094 int effective_channels; // same as channels, except on RGBA/ARGB (7), or XA/AX (3)
1095 size_t alloced_total;
1096};
1099#define stbir__max_uint8_as_float 255.0f
1100#define stbir__max_uint16_as_float 65535.0f
1101#define stbir__max_uint8_as_float_inverted (1.0f/255.0f)
1102#define stbir__max_uint16_as_float_inverted (1.0f/65535.0f)
1103#define stbir__small_float ((float)1 / (1 << 20) / (1 << 20) / (1 << 20) / (1 << 20) / (1 << 20) / (1 << 20))
1105// min/max friendly
1106#define STBIR_CLAMP(x, xmin, xmax) for(;;) { \
1107 if ( (x) < (xmin) ) (x) = (xmin); \
1108 if ( (x) > (xmax) ) (x) = (xmax); \
1109 break; \
1110}
1112static stbir__inline int stbir__min(int a, int b)
1113{
1114 return a < b ? a : b;
1115}
1117static stbir__inline int stbir__max(int a, int b)
1118{
1119 return a > b ? a : b;
1120}
1122static float stbir__srgb_uchar_to_linear_float[256] = {
1123 0.000000f, 0.000304f, 0.000607f, 0.000911f, 0.001214f, 0.001518f, 0.001821f, 0.002125f, 0.002428f, 0.002732f, 0.003035f,
1124 0.003347f, 0.003677f, 0.004025f, 0.004391f, 0.004777f, 0.005182f, 0.005605f, 0.006049f, 0.006512f, 0.006995f, 0.007499f,
1125 0.008023f, 0.008568f, 0.009134f, 0.009721f, 0.010330f, 0.010960f, 0.011612f, 0.012286f, 0.012983f, 0.013702f, 0.014444f,
1126 0.015209f, 0.015996f, 0.016807f, 0.017642f, 0.018500f, 0.019382f, 0.020289f, 0.021219f, 0.022174f, 0.023153f, 0.024158f,
1127 0.025187f, 0.026241f, 0.027321f, 0.028426f, 0.029557f, 0.030713f, 0.031896f, 0.033105f, 0.034340f, 0.035601f, 0.036889f,
1128 0.038204f, 0.039546f, 0.040915f, 0.042311f, 0.043735f, 0.045186f, 0.046665f, 0.048172f, 0.049707f, 0.051269f, 0.052861f,
1129 0.054480f, 0.056128f, 0.057805f, 0.059511f, 0.061246f, 0.063010f, 0.064803f, 0.066626f, 0.068478f, 0.070360f, 0.072272f,
1130 0.074214f, 0.076185f, 0.078187f, 0.080220f, 0.082283f, 0.084376f, 0.086500f, 0.088656f, 0.090842f, 0.093059f, 0.095307f,
1131 0.097587f, 0.099899f, 0.102242f, 0.104616f, 0.107023f, 0.109462f, 0.111932f, 0.114435f, 0.116971f, 0.119538f, 0.122139f,
1132 0.124772f, 0.127438f, 0.130136f, 0.132868f, 0.135633f, 0.138432f, 0.141263f, 0.144128f, 0.147027f, 0.149960f, 0.152926f,
1133 0.155926f, 0.158961f, 0.162029f, 0.165132f, 0.168269f, 0.171441f, 0.174647f, 0.177888f, 0.181164f, 0.184475f, 0.187821f,
1134 0.191202f, 0.194618f, 0.198069f, 0.201556f, 0.205079f, 0.208637f, 0.212231f, 0.215861f, 0.219526f, 0.223228f, 0.226966f,
1135 0.230740f, 0.234551f, 0.238398f, 0.242281f, 0.246201f, 0.250158f, 0.254152f, 0.258183f, 0.262251f, 0.266356f, 0.270498f,
1136 0.274677f, 0.278894f, 0.283149f, 0.287441f, 0.291771f, 0.296138f, 0.300544f, 0.304987f, 0.309469f, 0.313989f, 0.318547f,
1137 0.323143f, 0.327778f, 0.332452f, 0.337164f, 0.341914f, 0.346704f, 0.351533f, 0.356400f, 0.361307f, 0.366253f, 0.371238f,
1138 0.376262f, 0.381326f, 0.386430f, 0.391573f, 0.396755f, 0.401978f, 0.407240f, 0.412543f, 0.417885f, 0.423268f, 0.428691f,
1139 0.434154f, 0.439657f, 0.445201f, 0.450786f, 0.456411f, 0.462077f, 0.467784f, 0.473532f, 0.479320f, 0.485150f, 0.491021f,
1140 0.496933f, 0.502887f, 0.508881f, 0.514918f, 0.520996f, 0.527115f, 0.533276f, 0.539480f, 0.545725f, 0.552011f, 0.558340f,
1141 0.564712f, 0.571125f, 0.577581f, 0.584078f, 0.590619f, 0.597202f, 0.603827f, 0.610496f, 0.617207f, 0.623960f, 0.630757f,
1142 0.637597f, 0.644480f, 0.651406f, 0.658375f, 0.665387f, 0.672443f, 0.679543f, 0.686685f, 0.693872f, 0.701102f, 0.708376f,
1143 0.715694f, 0.723055f, 0.730461f, 0.737911f, 0.745404f, 0.752942f, 0.760525f, 0.768151f, 0.775822f, 0.783538f, 0.791298f,
1144 0.799103f, 0.806952f, 0.814847f, 0.822786f, 0.830770f, 0.838799f, 0.846873f, 0.854993f, 0.863157f, 0.871367f, 0.879622f,
1145 0.887923f, 0.896269f, 0.904661f, 0.913099f, 0.921582f, 0.930111f, 0.938686f, 0.947307f, 0.955974f, 0.964686f, 0.973445f,
1146 0.982251f, 0.991102f, 1.0f
1147};
1149typedef union
1150{
1151 unsigned int u;
1152 float f;
1153} stbir__FP32;
1155// From https://gist.github.com/rygorous/2203834
1157static const stbir_uint32 fp32_to_srgb8_tab4[104] = {
1158 0x0073000d, 0x007a000d, 0x0080000d, 0x0087000d, 0x008d000d, 0x0094000d, 0x009a000d, 0x00a1000d,
1159 0x00a7001a, 0x00b4001a, 0x00c1001a, 0x00ce001a, 0x00da001a, 0x00e7001a, 0x00f4001a, 0x0101001a,
1160 0x010e0033, 0x01280033, 0x01410033, 0x015b0033, 0x01750033, 0x018f0033, 0x01a80033, 0x01c20033,
1161 0x01dc0067, 0x020f0067, 0x02430067, 0x02760067, 0x02aa0067, 0x02dd0067, 0x03110067, 0x03440067,
1162 0x037800ce, 0x03df00ce, 0x044600ce, 0x04ad00ce, 0x051400ce, 0x057b00c5, 0x05dd00bc, 0x063b00b5,
1163 0x06970158, 0x07420142, 0x07e30130, 0x087b0120, 0x090b0112, 0x09940106, 0x0a1700fc, 0x0a9500f2,
1164 0x0b0f01cb, 0x0bf401ae, 0x0ccb0195, 0x0d950180, 0x0e56016e, 0x0f0d015e, 0x0fbc0150, 0x10630143,
1165 0x11070264, 0x1238023e, 0x1357021d, 0x14660201, 0x156601e9, 0x165a01d3, 0x174401c0, 0x182401af,
1166 0x18fe0331, 0x1a9602fe, 0x1c1502d2, 0x1d7e02ad, 0x1ed4028d, 0x201a0270, 0x21520256, 0x227d0240,
1167 0x239f0443, 0x25c003fe, 0x27bf03c4, 0x29a10392, 0x2b6a0367, 0x2d1d0341, 0x2ebe031f, 0x304d0300,
1168 0x31d105b0, 0x34a80555, 0x37520507, 0x39d504c5, 0x3c37048b, 0x3e7c0458, 0x40a8042a, 0x42bd0401,
1169 0x44c20798, 0x488e071e, 0x4c1c06b6, 0x4f76065d, 0x52a50610, 0x55ac05cc, 0x5892058f, 0x5b590559,
1170 0x5e0c0a23, 0x631c0980, 0x67db08f6, 0x6c55087f, 0x70940818, 0x74a007bd, 0x787d076c, 0x7c330723,
1171};
1173static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
1174{
1175 static const stbir__FP32 almostone = { 0x3f7fffff }; // 1-eps
1176 static const stbir__FP32 minval = { (127-13) << 23 };
1177 stbir_uint32 tab,bias,scale,t;
1178 stbir__FP32 f;
1180 // Clamp to [2^(-13), 1-eps]; these two values map to 0 and 1, respectively.
1181 // The tests are carefully written so that NaNs map to 0, same as in the reference
1182 // implementation.
1183 if (!(in > minval.f)) // written this way to catch NaNs
1184 return 0;
1185 if (in > almostone.f)
1186 return 255;
1188 // Do the table lookup and unpack bias, scale
1189 f.f = in;
1190 tab = fp32_to_srgb8_tab4[(f.u - minval.u) >> 20];
1191 bias = (tab >> 16) << 9;
1192 scale = tab & 0xffff;
1194 // Grab next-highest mantissa bits and perform linear interpolation
1195 t = (f.u >> 12) & 0xff;
1196 return (unsigned char) ((bias + scale*t) >> 16);
1197}
1199#ifndef STBIR_FORCE_GATHER_FILTER_SCANLINES_AMOUNT
1200#define STBIR_FORCE_GATHER_FILTER_SCANLINES_AMOUNT 32 // when downsampling and <= 32 scanlines of buffering, use gather. gather used down to 1/8th scaling for 25% win.
1201#endif
1203#ifndef STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS
1204#define STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS 4 // when threading, what is the minimum number of scanlines for a split?
1205#endif
1207// restrict pointers for the output pointers, other loop and unroll control
1208#if defined( _MSC_VER ) && !defined(__clang__)
1209 #define STBIR_STREAMOUT_PTR( star ) star __restrict
1210 #define STBIR_NO_UNROLL( ptr ) __assume(ptr) // this oddly keeps msvc from unrolling a loop
1211 #if _MSC_VER >= 1900
1212 #define STBIR_NO_UNROLL_LOOP_START __pragma(loop( no_vector ))
1213 #else
1214 #define STBIR_NO_UNROLL_LOOP_START
1215 #endif
1216#elif defined( __clang__ )
1217 #define STBIR_STREAMOUT_PTR( star ) star __restrict__
1218 #define STBIR_NO_UNROLL( ptr ) __asm__ (""::"r"(ptr))
1219 #if ( __clang_major__ >= 4 ) || ( ( __clang_major__ >= 3 ) && ( __clang_minor__ >= 5 ) )
1220 #define STBIR_NO_UNROLL_LOOP_START _Pragma("clang loop unroll(disable)") _Pragma("clang loop vectorize(disable)")
1221 #else
1222 #define STBIR_NO_UNROLL_LOOP_START
1223 #endif
1224#elif defined( __GNUC__ )
1225 #define STBIR_STREAMOUT_PTR( star ) star __restrict__
1226 #define STBIR_NO_UNROLL( ptr ) __asm__ (""::"r"(ptr))
1227 #if __GNUC__ >= 14
1228 #define STBIR_NO_UNROLL_LOOP_START _Pragma("GCC unroll 0") _Pragma("GCC novector")
1229 #else
1230 #define STBIR_NO_UNROLL_LOOP_START
1231 #endif
1232 #define STBIR_NO_UNROLL_LOOP_START_INF_FOR
1233#else
1234 #define STBIR_STREAMOUT_PTR( star ) star
1235 #define STBIR_NO_UNROLL( ptr )
1236 #define STBIR_NO_UNROLL_LOOP_START
1237#endif
1239#ifndef STBIR_NO_UNROLL_LOOP_START_INF_FOR
1240#define STBIR_NO_UNROLL_LOOP_START_INF_FOR STBIR_NO_UNROLL_LOOP_START
1241#endif
1243#ifdef STBIR_NO_SIMD // force simd off for whatever reason
1245// force simd off overrides everything else, so clear it all
1247#ifdef STBIR_SSE2
1248#undef STBIR_SSE2
1249#endif
1251#ifdef STBIR_AVX
1252#undef STBIR_AVX
1253#endif
1255#ifdef STBIR_NEON
1256#undef STBIR_NEON
1257#endif
1259#ifdef STBIR_AVX2
1260#undef STBIR_AVX2
1261#endif
1263#ifdef STBIR_FP16C
1264#undef STBIR_FP16C
1265#endif
1267#ifdef STBIR_WASM
1268#undef STBIR_WASM
1269#endif
1271#ifdef STBIR_SIMD
1272#undef STBIR_SIMD
1273#endif
1275#else // STBIR_SIMD
1277#ifdef STBIR_SSE2
1278 #include <emmintrin.h>
1280 #define stbir__simdf __m128
1281 #define stbir__simdi __m128i
1283 #define stbir_simdi_castf( reg ) _mm_castps_si128(reg)
1284 #define stbir_simdf_casti( reg ) _mm_castsi128_ps(reg)
1286 #define stbir__simdf_load( reg, ptr ) (reg) = _mm_loadu_ps( (float const*)(ptr) )
1287 #define stbir__simdi_load( reg, ptr ) (reg) = _mm_loadu_si128 ( (stbir__simdi const*)(ptr) )
1288 #define stbir__simdf_load1( out, ptr ) (out) = _mm_load_ss( (float const*)(ptr) ) // top values can be random (not denormal or nan for perf)
1289 #define stbir__simdi_load1( out, ptr ) (out) = _mm_castps_si128( _mm_load_ss( (float const*)(ptr) ))
1290 #define stbir__simdf_load1z( out, ptr ) (out) = _mm_load_ss( (float const*)(ptr) ) // top values must be zero
1291 #define stbir__simdf_frep4( fvar ) _mm_set_ps1( fvar )
1292 #define stbir__simdf_load1frep4( out, fvar ) (out) = _mm_set_ps1( fvar )
1293 #define stbir__simdf_load2( out, ptr ) (out) = _mm_castsi128_ps( _mm_loadl_epi64( (__m128i*)(ptr)) ) // top values can be random (not denormal or nan for perf)
1294 #define stbir__simdf_load2z( out, ptr ) (out) = _mm_castsi128_ps( _mm_loadl_epi64( (__m128i*)(ptr)) ) // top values must be zero
1295 #define stbir__simdf_load2hmerge( out, reg, ptr ) (out) = _mm_castpd_ps(_mm_loadh_pd( _mm_castps_pd(reg), (double*)(ptr) ))
1297 #define stbir__simdf_zeroP() _mm_setzero_ps()
1298 #define stbir__simdf_zero( reg ) (reg) = _mm_setzero_ps()
1300 #define stbir__simdf_store( ptr, reg ) _mm_storeu_ps( (float*)(ptr), reg )
1301 #define stbir__simdf_store1( ptr, reg ) _mm_store_ss( (float*)(ptr), reg )
1302 #define stbir__simdf_store2( ptr, reg ) _mm_storel_epi64( (__m128i*)(ptr), _mm_castps_si128(reg) )
1303 #define stbir__simdf_store2h( ptr, reg ) _mm_storeh_pd( (double*)(ptr), _mm_castps_pd(reg) )
1305 #define stbir__simdi_store( ptr, reg ) _mm_storeu_si128( (__m128i*)(ptr), reg )
1306 #define stbir__simdi_store1( ptr, reg ) _mm_store_ss( (float*)(ptr), _mm_castsi128_ps(reg) )
1307 #define stbir__simdi_store2( ptr, reg ) _mm_storel_epi64( (__m128i*)(ptr), (reg) )
1309 #define stbir__prefetch( ptr ) _mm_prefetch((char*)(ptr), _MM_HINT_T0 )
1311 #define stbir__simdi_expand_u8_to_u32(out0,out1,out2,out3,ireg) \
1312 { \
1313 stbir__simdi zero = _mm_setzero_si128(); \
1314 out2 = _mm_unpacklo_epi8( ireg, zero ); \
1315 out3 = _mm_unpackhi_epi8( ireg, zero ); \
1316 out0 = _mm_unpacklo_epi16( out2, zero ); \
1317 out1 = _mm_unpackhi_epi16( out2, zero ); \
1318 out2 = _mm_unpacklo_epi16( out3, zero ); \
1319 out3 = _mm_unpackhi_epi16( out3, zero ); \
1320 }
1322#define stbir__simdi_expand_u8_to_1u32(out,ireg) \
1323 { \
1324 stbir__simdi zero = _mm_setzero_si128(); \
1325 out = _mm_unpacklo_epi8( ireg, zero ); \
1326 out = _mm_unpacklo_epi16( out, zero ); \
1327 }
1329 #define stbir__simdi_expand_u16_to_u32(out0,out1,ireg) \
1330 { \
1331 stbir__simdi zero = _mm_setzero_si128(); \
1332 out0 = _mm_unpacklo_epi16( ireg, zero ); \
1333 out1 = _mm_unpackhi_epi16( ireg, zero ); \
1334 }
1336 #define stbir__simdf_convert_float_to_i32( i, f ) (i) = _mm_cvttps_epi32(f)
1337 #define stbir__simdf_convert_float_to_int( f ) _mm_cvtt_ss2si(f)
1338 #define stbir__simdf_convert_float_to_uint8( f ) ((unsigned char)_mm_cvtsi128_si32(_mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(f,STBIR__CONSTF(STBIR_max_uint8_as_float)),_mm_setzero_ps()))))
1339 #define stbir__simdf_convert_float_to_short( f ) ((unsigned short)_mm_cvtsi128_si32(_mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(f,STBIR__CONSTF(STBIR_max_uint16_as_float)),_mm_setzero_ps()))))
1341 #define stbir__simdi_to_int( i ) _mm_cvtsi128_si32(i)
1342 #define stbir__simdi_convert_i32_to_float(out, ireg) (out) = _mm_cvtepi32_ps( ireg )
1343 #define stbir__simdf_add( out, reg0, reg1 ) (out) = _mm_add_ps( reg0, reg1 )
1344 #define stbir__simdf_mult( out, reg0, reg1 ) (out) = _mm_mul_ps( reg0, reg1 )
1345 #define stbir__simdf_mult_mem( out, reg, ptr ) (out) = _mm_mul_ps( reg, _mm_loadu_ps( (float const*)(ptr) ) )
1346 #define stbir__simdf_mult1_mem( out, reg, ptr ) (out) = _mm_mul_ss( reg, _mm_load_ss( (float const*)(ptr) ) )
1347 #define stbir__simdf_add_mem( out, reg, ptr ) (out) = _mm_add_ps( reg, _mm_loadu_ps( (float const*)(ptr) ) )
1348 #define stbir__simdf_add1_mem( out, reg, ptr ) (out) = _mm_add_ss( reg, _mm_load_ss( (float const*)(ptr) ) )
1350 #ifdef STBIR_USE_FMA // not on by default to maintain bit identical simd to non-simd
1351 #include <immintrin.h>
1352 #define stbir__simdf_madd( out, add, mul1, mul2 ) (out) = _mm_fmadd_ps( mul1, mul2, add )
1353 #define stbir__simdf_madd1( out, add, mul1, mul2 ) (out) = _mm_fmadd_ss( mul1, mul2, add )
1354 #define stbir__simdf_madd_mem( out, add, mul, ptr ) (out) = _mm_fmadd_ps( mul, _mm_loadu_ps( (float const*)(ptr) ), add )
1355 #define stbir__simdf_madd1_mem( out, add, mul, ptr ) (out) = _mm_fmadd_ss( mul, _mm_load_ss( (float const*)(ptr) ), add )
1356 #else
1357 #define stbir__simdf_madd( out, add, mul1, mul2 ) (out) = _mm_add_ps( add, _mm_mul_ps( mul1, mul2 ) )
1358 #define stbir__simdf_madd1( out, add, mul1, mul2 ) (out) = _mm_add_ss( add, _mm_mul_ss( mul1, mul2 ) )
1359 #define stbir__simdf_madd_mem( out, add, mul, ptr ) (out) = _mm_add_ps( add, _mm_mul_ps( mul, _mm_loadu_ps( (float const*)(ptr) ) ) )
1360 #define stbir__simdf_madd1_mem( out, add, mul, ptr ) (out) = _mm_add_ss( add, _mm_mul_ss( mul, _mm_load_ss( (float const*)(ptr) ) ) )
1361 #endif
1363 #define stbir__simdf_add1( out, reg0, reg1 ) (out) = _mm_add_ss( reg0, reg1 )
1364 #define stbir__simdf_mult1( out, reg0, reg1 ) (out) = _mm_mul_ss( reg0, reg1 )
1366 #define stbir__simdf_and( out, reg0, reg1 ) (out) = _mm_and_ps( reg0, reg1 )
1367 #define stbir__simdf_or( out, reg0, reg1 ) (out) = _mm_or_ps( reg0, reg1 )
1369 #define stbir__simdf_min( out, reg0, reg1 ) (out) = _mm_min_ps( reg0, reg1 )
1370 #define stbir__simdf_max( out, reg0, reg1 ) (out) = _mm_max_ps( reg0, reg1 )
1371 #define stbir__simdf_min1( out, reg0, reg1 ) (out) = _mm_min_ss( reg0, reg1 )
1372 #define stbir__simdf_max1( out, reg0, reg1 ) (out) = _mm_max_ss( reg0, reg1 )
1374 #define stbir__simdf_0123ABCDto3ABx( out, reg0, reg1 ) (out)=_mm_castsi128_ps( _mm_shuffle_epi32( _mm_castps_si128( _mm_shuffle_ps( reg1,reg0, (0<<0) + (1<<2) + (2<<4) + (3<<6) )), (3<<0) + (0<<2) + (1<<4) + (2<<6) ) )
1375 #define stbir__simdf_0123ABCDto23Ax( out, reg0, reg1 ) (out)=_mm_castsi128_ps( _mm_shuffle_epi32( _mm_castps_si128( _mm_shuffle_ps( reg1,reg0, (0<<0) + (1<<2) + (2<<4) + (3<<6) )), (2<<0) + (3<<2) + (0<<4) + (1<<6) ) )
1377 static const stbir__simdf STBIR_zeroones = { 0.0f,1.0f,0.0f,1.0f };
1378 static const stbir__simdf STBIR_onezeros = { 1.0f,0.0f,1.0f,0.0f };
1379 #define stbir__simdf_aaa1( out, alp, ones ) (out)=_mm_castsi128_ps( _mm_shuffle_epi32( _mm_castps_si128( _mm_movehl_ps( ones, alp ) ), (1<<0) + (1<<2) + (1<<4) + (2<<6) ) )
1380 #define stbir__simdf_1aaa( out, alp, ones ) (out)=_mm_castsi128_ps( _mm_shuffle_epi32( _mm_castps_si128( _mm_movelh_ps( ones, alp ) ), (0<<0) + (2<<2) + (2<<4) + (2<<6) ) )
1381 #define stbir__simdf_a1a1( out, alp, ones) (out) = _mm_or_ps( _mm_castsi128_ps( _mm_srli_epi64( _mm_castps_si128(alp), 32 ) ), STBIR_zeroones )
1382 #define stbir__simdf_1a1a( out, alp, ones) (out) = _mm_or_ps( _mm_castsi128_ps( _mm_slli_epi64( _mm_castps_si128(alp), 32 ) ), STBIR_onezeros )
1384 #define stbir__simdf_swiz( reg, one, two, three, four ) _mm_castsi128_ps( _mm_shuffle_epi32( _mm_castps_si128( reg ), (one<<0) + (two<<2) + (three<<4) + (four<<6) ) )
1386 #define stbir__simdi_and( out, reg0, reg1 ) (out) = _mm_and_si128( reg0, reg1 )
1387 #define stbir__simdi_or( out, reg0, reg1 ) (out) = _mm_or_si128( reg0, reg1 )
1388 #define stbir__simdi_16madd( out, reg0, reg1 ) (out) = _mm_madd_epi16( reg0, reg1 )
1390 #define stbir__simdf_pack_to_8bytes(out,aa,bb) \
1391 { \
1392 stbir__simdf af,bf; \
1393 stbir__simdi a,b; \
1394 af = _mm_min_ps( aa, STBIR_max_uint8_as_float ); \
1395 bf = _mm_min_ps( bb, STBIR_max_uint8_as_float ); \
1396 af = _mm_max_ps( af, _mm_setzero_ps() ); \
1397 bf = _mm_max_ps( bf, _mm_setzero_ps() ); \
1398 a = _mm_cvttps_epi32( af ); \
1399 b = _mm_cvttps_epi32( bf ); \
1400 a = _mm_packs_epi32( a, b ); \
1401 out = _mm_packus_epi16( a, a ); \
1402 }
1404 #define stbir__simdf_load4_transposed( o0, o1, o2, o3, ptr ) \
1405 stbir__simdf_load( o0, (ptr) ); \
1406 stbir__simdf_load( o1, (ptr)+4 ); \
1407 stbir__simdf_load( o2, (ptr)+8 ); \
1408 stbir__simdf_load( o3, (ptr)+12 ); \
1409 { \
1410 __m128 tmp0, tmp1, tmp2, tmp3; \
1411 tmp0 = _mm_unpacklo_ps(o0, o1); \
1412 tmp2 = _mm_unpacklo_ps(o2, o3); \
1413 tmp1 = _mm_unpackhi_ps(o0, o1); \
1414 tmp3 = _mm_unpackhi_ps(o2, o3); \
1415 o0 = _mm_movelh_ps(tmp0, tmp2); \
1416 o1 = _mm_movehl_ps(tmp2, tmp0); \
1417 o2 = _mm_movelh_ps(tmp1, tmp3); \
1418 o3 = _mm_movehl_ps(tmp3, tmp1); \
1419 }
1421 #define stbir__interleave_pack_and_store_16_u8( ptr, r0, r1, r2, r3 ) \
1422 r0 = _mm_packs_epi32( r0, r1 ); \
1423 r2 = _mm_packs_epi32( r2, r3 ); \
1424 r1 = _mm_unpacklo_epi16( r0, r2 ); \
1425 r3 = _mm_unpackhi_epi16( r0, r2 ); \
1426 r0 = _mm_unpacklo_epi16( r1, r3 ); \
1427 r2 = _mm_unpackhi_epi16( r1, r3 ); \
1428 r0 = _mm_packus_epi16( r0, r2 ); \
1429 stbir__simdi_store( ptr, r0 ); \
1431 #define stbir__simdi_32shr( out, reg, imm ) out = _mm_srli_epi32( reg, imm )
1433 #if defined(_MSC_VER) && !defined(__clang__)
1434 // msvc inits with 8 bytes
1435 #define STBIR__CONST_32_TO_8( v ) (char)(unsigned char)((v)&255),(char)(unsigned char)(((v)>>8)&255),(char)(unsigned char)(((v)>>16)&255),(char)(unsigned char)(((v)>>24)&255)
1436 #define STBIR__CONST_4_32i( v ) STBIR__CONST_32_TO_8( v ), STBIR__CONST_32_TO_8( v ), STBIR__CONST_32_TO_8( v ), STBIR__CONST_32_TO_8( v )
1437 #define STBIR__CONST_4d_32i( v0, v1, v2, v3 ) STBIR__CONST_32_TO_8( v0 ), STBIR__CONST_32_TO_8( v1 ), STBIR__CONST_32_TO_8( v2 ), STBIR__CONST_32_TO_8( v3 )
1438 #else
1439 // everything else inits with long long's
1440 #define STBIR__CONST_4_32i( v ) (long long)((((stbir_uint64)(stbir_uint32)(v))<<32)|((stbir_uint64)(stbir_uint32)(v))),(long long)((((stbir_uint64)(stbir_uint32)(v))<<32)|((stbir_uint64)(stbir_uint32)(v)))
1441 #define STBIR__CONST_4d_32i( v0, v1, v2, v3 ) (long long)((((stbir_uint64)(stbir_uint32)(v1))<<32)|((stbir_uint64)(stbir_uint32)(v0))),(long long)((((stbir_uint64)(stbir_uint32)(v3))<<32)|((stbir_uint64)(stbir_uint32)(v2)))
1442 #endif
1444 #define STBIR__SIMDF_CONST(var, x) stbir__simdf var = { x, x, x, x }
1445 #define STBIR__SIMDI_CONST(var, x) stbir__simdi var = { STBIR__CONST_4_32i(x) }
1446 #define STBIR__CONSTF(var) (var)
1447 #define STBIR__CONSTI(var) (var)
1449 #if defined(STBIR_AVX) || defined(__SSE4_1__)
1450 #include <smmintrin.h>
1451 #define stbir__simdf_pack_to_8words(out,reg0,reg1) out = _mm_packus_epi32(_mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(reg0,STBIR__CONSTF(STBIR_max_uint16_as_float)),_mm_setzero_ps())), _mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(reg1,STBIR__CONSTF(STBIR_max_uint16_as_float)),_mm_setzero_ps())))
1452 #else
1453 STBIR__SIMDI_CONST(stbir__s32_32768, 32768);
1454 STBIR__SIMDI_CONST(stbir__s16_32768, ((32768<<16)|32768));
1456 #define stbir__simdf_pack_to_8words(out,reg0,reg1) \
1457 { \
1458 stbir__simdi tmp0,tmp1; \
1459 tmp0 = _mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(reg0,STBIR__CONSTF(STBIR_max_uint16_as_float)),_mm_setzero_ps())); \
1460 tmp1 = _mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(reg1,STBIR__CONSTF(STBIR_max_uint16_as_float)),_mm_setzero_ps())); \
1461 tmp0 = _mm_sub_epi32( tmp0, stbir__s32_32768 ); \
1462 tmp1 = _mm_sub_epi32( tmp1, stbir__s32_32768 ); \
1463 out = _mm_packs_epi32( tmp0, tmp1 ); \
1464 out = _mm_sub_epi16( out, stbir__s16_32768 ); \
1465 }
1467 #endif
1469 #define STBIR_SIMD
1471 // if we detect AVX, set the simd8 defines
1472 #ifdef STBIR_AVX
1473 #include <immintrin.h>
1474 #define STBIR_SIMD8
1475 #define stbir__simdf8 __m256
1476 #define stbir__simdi8 __m256i
1477 #define stbir__simdf8_load( out, ptr ) (out) = _mm256_loadu_ps( (float const *)(ptr) )
1478 #define stbir__simdi8_load( out, ptr ) (out) = _mm256_loadu_si256( (__m256i const *)(ptr) )
1479 #define stbir__simdf8_mult( out, a, b ) (out) = _mm256_mul_ps( (a), (b) )
1480 #define stbir__simdf8_store( ptr, out ) _mm256_storeu_ps( (float*)(ptr), out )
1481 #define stbir__simdi8_store( ptr, reg ) _mm256_storeu_si256( (__m256i*)(ptr), reg )
1482 #define stbir__simdf8_frep8( fval ) _mm256_set1_ps( fval )
1484 #define stbir__simdf8_min( out, reg0, reg1 ) (out) = _mm256_min_ps( reg0, reg1 )
1485 #define stbir__simdf8_max( out, reg0, reg1 ) (out) = _mm256_max_ps( reg0, reg1 )
1487 #define stbir__simdf8_add4halves( out, bot4, top8 ) (out) = _mm_add_ps( bot4, _mm256_extractf128_ps( top8, 1 ) )
1488 #define stbir__simdf8_mult_mem( out, reg, ptr ) (out) = _mm256_mul_ps( reg, _mm256_loadu_ps( (float const*)(ptr) ) )
1489 #define stbir__simdf8_add_mem( out, reg, ptr ) (out) = _mm256_add_ps( reg, _mm256_loadu_ps( (float const*)(ptr) ) )
1490 #define stbir__simdf8_add( out, a, b ) (out) = _mm256_add_ps( a, b )
1491 #define stbir__simdf8_load1b( out, ptr ) (out) = _mm256_broadcast_ss( ptr )
1492 #define stbir__simdf_load1rep4( out, ptr ) (out) = _mm_broadcast_ss( ptr ) // avx load instruction
1494 #define stbir__simdi8_convert_i32_to_float(out, ireg) (out) = _mm256_cvtepi32_ps( ireg )
1495 #define stbir__simdf8_convert_float_to_i32( i, f ) (i) = _mm256_cvttps_epi32(f)
1497 #define stbir__simdf8_bot4s( out, a, b ) (out) = _mm256_permute2f128_ps(a,b, (0<<0)+(2<<4) )
1498 #define stbir__simdf8_top4s( out, a, b ) (out) = _mm256_permute2f128_ps(a,b, (1<<0)+(3<<4) )
1500 #define stbir__simdf8_gettop4( reg ) _mm256_extractf128_ps(reg,1)
1502 #ifdef STBIR_AVX2
1504 #define stbir__simdi8_expand_u8_to_u32(out0,out1,ireg) \
1505 { \
1506 stbir__simdi8 a, zero =_mm256_setzero_si256();\
1507 a = _mm256_permute4x64_epi64( _mm256_unpacklo_epi8( _mm256_permute4x64_epi64(_mm256_castsi128_si256(ireg),(0<<0)+(2<<2)+(1<<4)+(3<<6)), zero ),(0<<0)+(2<<2)+(1<<4)+(3<<6)); \
1508 out0 = _mm256_unpacklo_epi16( a, zero ); \
1509 out1 = _mm256_unpackhi_epi16( a, zero ); \
1510 }
1512 #define stbir__simdf8_pack_to_16bytes(out,aa,bb) \
1513 { \
1514 stbir__simdi8 t; \
1515 stbir__simdf8 af,bf; \
1516 stbir__simdi8 a,b; \
1517 af = _mm256_min_ps( aa, STBIR_max_uint8_as_floatX ); \
1518 bf = _mm256_min_ps( bb, STBIR_max_uint8_as_floatX ); \
1519 af = _mm256_max_ps( af, _mm256_setzero_ps() ); \
1520 bf = _mm256_max_ps( bf, _mm256_setzero_ps() ); \
1521 a = _mm256_cvttps_epi32( af ); \
1522 b = _mm256_cvttps_epi32( bf ); \
1523 t = _mm256_permute4x64_epi64( _mm256_packs_epi32( a, b ), (0<<0)+(2<<2)+(1<<4)+(3<<6) ); \
1524 out = _mm256_castsi256_si128( _mm256_permute4x64_epi64( _mm256_packus_epi16( t, t ), (0<<0)+(2<<2)+(1<<4)+(3<<6) ) ); \
1525 }
1527 #define stbir__simdi8_expand_u16_to_u32(out,ireg) out = _mm256_unpacklo_epi16( _mm256_permute4x64_epi64(_mm256_castsi128_si256(ireg),(0<<0)+(2<<2)+(1<<4)+(3<<6)), _mm256_setzero_si256() );
1529 #define stbir__simdf8_pack_to_16words(out,aa,bb) \
1530 { \
1531 stbir__simdf8 af,bf; \
1532 stbir__simdi8 a,b; \
1533 af = _mm256_min_ps( aa, STBIR_max_uint16_as_floatX ); \
1534 bf = _mm256_min_ps( bb, STBIR_max_uint16_as_floatX ); \
1535 af = _mm256_max_ps( af, _mm256_setzero_ps() ); \
1536 bf = _mm256_max_ps( bf, _mm256_setzero_ps() ); \
1537 a = _mm256_cvttps_epi32( af ); \
1538 b = _mm256_cvttps_epi32( bf ); \
1539 (out) = _mm256_permute4x64_epi64( _mm256_packus_epi32(a, b), (0<<0)+(2<<2)+(1<<4)+(3<<6) ); \
1540 }
1542 #else
1544 #define stbir__simdi8_expand_u8_to_u32(out0,out1,ireg) \
1545 { \
1546 stbir__simdi a,zero = _mm_setzero_si128(); \
1547 a = _mm_unpacklo_epi8( ireg, zero ); \
1548 out0 = _mm256_setr_m128i( _mm_unpacklo_epi16( a, zero ), _mm_unpackhi_epi16( a, zero ) ); \
1549 a = _mm_unpackhi_epi8( ireg, zero ); \
1550 out1 = _mm256_setr_m128i( _mm_unpacklo_epi16( a, zero ), _mm_unpackhi_epi16( a, zero ) ); \
1551 }
1553 #define stbir__simdf8_pack_to_16bytes(out,aa,bb) \
1554 { \
1555 stbir__simdi t; \
1556 stbir__simdf8 af,bf; \
1557 stbir__simdi8 a,b; \
1558 af = _mm256_min_ps( aa, STBIR_max_uint8_as_floatX ); \
1559 bf = _mm256_min_ps( bb, STBIR_max_uint8_as_floatX ); \
1560 af = _mm256_max_ps( af, _mm256_setzero_ps() ); \
1561 bf = _mm256_max_ps( bf, _mm256_setzero_ps() ); \
1562 a = _mm256_cvttps_epi32( af ); \
1563 b = _mm256_cvttps_epi32( bf ); \
1564 out = _mm_packs_epi32( _mm256_castsi256_si128(a), _mm256_extractf128_si256( a, 1 ) ); \
1565 out = _mm_packus_epi16( out, out ); \
1566 t = _mm_packs_epi32( _mm256_castsi256_si128(b), _mm256_extractf128_si256( b, 1 ) ); \
1567 t = _mm_packus_epi16( t, t ); \
1568 out = _mm_castps_si128( _mm_shuffle_ps( _mm_castsi128_ps(out), _mm_castsi128_ps(t), (0<<0)+(1<<2)+(0<<4)+(1<<6) ) ); \
1569 }
1571 #define stbir__simdi8_expand_u16_to_u32(out,ireg) \
1572 { \
1573 stbir__simdi a,b,zero = _mm_setzero_si128(); \
1574 a = _mm_unpacklo_epi16( ireg, zero ); \
1575 b = _mm_unpackhi_epi16( ireg, zero ); \
1576 out = _mm256_insertf128_si256( _mm256_castsi128_si256( a ), b, 1 ); \
1577 }
1579 #define stbir__simdf8_pack_to_16words(out,aa,bb) \
1580 { \
1581 stbir__simdi t0,t1; \
1582 stbir__simdf8 af,bf; \
1583 stbir__simdi8 a,b; \
1584 af = _mm256_min_ps( aa, STBIR_max_uint16_as_floatX ); \
1585 bf = _mm256_min_ps( bb, STBIR_max_uint16_as_floatX ); \
1586 af = _mm256_max_ps( af, _mm256_setzero_ps() ); \
1587 bf = _mm256_max_ps( bf, _mm256_setzero_ps() ); \
1588 a = _mm256_cvttps_epi32( af ); \
1589 b = _mm256_cvttps_epi32( bf ); \
1590 t0 = _mm_packus_epi32( _mm256_castsi256_si128(a), _mm256_extractf128_si256( a, 1 ) ); \
1591 t1 = _mm_packus_epi32( _mm256_castsi256_si128(b), _mm256_extractf128_si256( b, 1 ) ); \
1592 out = _mm256_setr_m128i( t0, t1 ); \
1593 }
1595 #endif
1597 static __m256i stbir_00001111 = { STBIR__CONST_4d_32i( 0, 0, 0, 0 ), STBIR__CONST_4d_32i( 1, 1, 1, 1 ) };
1598 #define stbir__simdf8_0123to00001111( out, in ) (out) = _mm256_permutevar_ps ( in, stbir_00001111 )
1600 static __m256i stbir_22223333 = { STBIR__CONST_4d_32i( 2, 2, 2, 2 ), STBIR__CONST_4d_32i( 3, 3, 3, 3 ) };
1601 #define stbir__simdf8_0123to22223333( out, in ) (out) = _mm256_permutevar_ps ( in, stbir_22223333 )
1603 #define stbir__simdf8_0123to2222( out, in ) (out) = stbir__simdf_swiz(_mm256_castps256_ps128(in), 2,2,2,2 )
1605 #define stbir__simdf8_load4b( out, ptr ) (out) = _mm256_broadcast_ps( (__m128 const *)(ptr) )
1607 static __m256i stbir_00112233 = { STBIR__CONST_4d_32i( 0, 0, 1, 1 ), STBIR__CONST_4d_32i( 2, 2, 3, 3 ) };
1608 #define stbir__simdf8_0123to00112233( out, in ) (out) = _mm256_permutevar_ps ( in, stbir_00112233 )
1609 #define stbir__simdf8_add4( out, a8, b ) (out) = _mm256_add_ps( a8, _mm256_castps128_ps256( b ) )
1611 static __m256i stbir_load6 = { STBIR__CONST_4_32i( 0x80000000 ), STBIR__CONST_4d_32i( 0x80000000, 0x80000000, 0, 0 ) };
1612 #define stbir__simdf8_load6z( out, ptr ) (out) = _mm256_maskload_ps( ptr, stbir_load6 )
1614 #define stbir__simdf8_0123to00000000( out, in ) (out) = _mm256_shuffle_ps ( in, in, (0<<0)+(0<<2)+(0<<4)+(0<<6) )
1615 #define stbir__simdf8_0123to11111111( out, in ) (out) = _mm256_shuffle_ps ( in, in, (1<<0)+(1<<2)+(1<<4)+(1<<6) )
1616 #define stbir__simdf8_0123to22222222( out, in ) (out) = _mm256_shuffle_ps ( in, in, (2<<0)+(2<<2)+(2<<4)+(2<<6) )
1617 #define stbir__simdf8_0123to33333333( out, in ) (out) = _mm256_shuffle_ps ( in, in, (3<<0)+(3<<2)+(3<<4)+(3<<6) )
1618 #define stbir__simdf8_0123to21032103( out, in ) (out) = _mm256_shuffle_ps ( in, in, (2<<0)+(1<<2)+(0<<4)+(3<<6) )
1619 #define stbir__simdf8_0123to32103210( out, in ) (out) = _mm256_shuffle_ps ( in, in, (3<<0)+(2<<2)+(1<<4)+(0<<6) )
1620 #define stbir__simdf8_0123to12301230( out, in ) (out) = _mm256_shuffle_ps ( in, in, (1<<0)+(2<<2)+(3<<4)+(0<<6) )
1621 #define stbir__simdf8_0123to10321032( out, in ) (out) = _mm256_shuffle_ps ( in, in, (1<<0)+(0<<2)+(3<<4)+(2<<6) )
1622 #define stbir__simdf8_0123to30123012( out, in ) (out) = _mm256_shuffle_ps ( in, in, (3<<0)+(0<<2)+(1<<4)+(2<<6) )
1624 #define stbir__simdf8_0123to11331133( out, in ) (out) = _mm256_shuffle_ps ( in, in, (1<<0)+(1<<2)+(3<<4)+(3<<6) )
1625 #define stbir__simdf8_0123to00220022( out, in ) (out) = _mm256_shuffle_ps ( in, in, (0<<0)+(0<<2)+(2<<4)+(2<<6) )
1627 #define stbir__simdf8_aaa1( out, alp, ones ) (out) = _mm256_blend_ps( alp, ones, (1<<0)+(1<<1)+(1<<2)+(0<<3)+(1<<4)+(1<<5)+(1<<6)+(0<<7)); (out)=_mm256_shuffle_ps( out,out, (3<<0) + (3<<2) + (3<<4) + (0<<6) )
1628 #define stbir__simdf8_1aaa( out, alp, ones ) (out) = _mm256_blend_ps( alp, ones, (0<<0)+(1<<1)+(1<<2)+(1<<3)+(0<<4)+(1<<5)+(1<<6)+(1<<7)); (out)=_mm256_shuffle_ps( out,out, (1<<0) + (0<<2) + (0<<4) + (0<<6) )
1629 #define stbir__simdf8_a1a1( out, alp, ones) (out) = _mm256_blend_ps( alp, ones, (1<<0)+(0<<1)+(1<<2)+(0<<3)+(1<<4)+(0<<5)+(1<<6)+(0<<7)); (out)=_mm256_shuffle_ps( out,out, (1<<0) + (0<<2) + (3<<4) + (2<<6) )
1630 #define stbir__simdf8_1a1a( out, alp, ones) (out) = _mm256_blend_ps( alp, ones, (0<<0)+(1<<1)+(0<<2)+(1<<3)+(0<<4)+(1<<5)+(0<<6)+(1<<7)); (out)=_mm256_shuffle_ps( out,out, (1<<0) + (0<<2) + (3<<4) + (2<<6) )
1632 #define stbir__simdf8_zero( reg ) (reg) = _mm256_setzero_ps()
1634 #ifdef STBIR_USE_FMA // not on by default to maintain bit identical simd to non-simd
1635 #define stbir__simdf8_madd( out, add, mul1, mul2 ) (out) = _mm256_fmadd_ps( mul1, mul2, add )
1636 #define stbir__simdf8_madd_mem( out, add, mul, ptr ) (out) = _mm256_fmadd_ps( mul, _mm256_loadu_ps( (float const*)(ptr) ), add )
1637 #define stbir__simdf8_madd_mem4( out, add, mul, ptr )(out) = _mm256_fmadd_ps( _mm256_setr_m128( mul, _mm_setzero_ps() ), _mm256_setr_m128( _mm_loadu_ps( (float const*)(ptr) ), _mm_setzero_ps() ), add )
1638 #else
1639 #define stbir__simdf8_madd( out, add, mul1, mul2 ) (out) = _mm256_add_ps( add, _mm256_mul_ps( mul1, mul2 ) )
1640 #define stbir__simdf8_madd_mem( out, add, mul, ptr ) (out) = _mm256_add_ps( add, _mm256_mul_ps( mul, _mm256_loadu_ps( (float const*)(ptr) ) ) )
1641 #define stbir__simdf8_madd_mem4( out, add, mul, ptr ) (out) = _mm256_add_ps( add, _mm256_setr_m128( _mm_mul_ps( mul, _mm_loadu_ps( (float const*)(ptr) ) ), _mm_setzero_ps() ) )
1642 #endif
1643 #define stbir__if_simdf8_cast_to_simdf4( val ) _mm256_castps256_ps128( val )
1645 #endif
1647 #ifdef STBIR_FLOORF
1648 #undef STBIR_FLOORF
1649 #endif
1650 #define STBIR_FLOORF stbir_simd_floorf
1651 static stbir__inline float stbir_simd_floorf(float x) // martins floorf
1652 {
1653 #if defined(STBIR_AVX) || defined(__SSE4_1__) || defined(STBIR_SSE41)
1654 __m128 t = _mm_set_ss(x);
1655 return _mm_cvtss_f32( _mm_floor_ss(t, t) );
1656 #else
1657 __m128 f = _mm_set_ss(x);
1658 __m128 t = _mm_cvtepi32_ps(_mm_cvttps_epi32(f));
1659 __m128 r = _mm_add_ss(t, _mm_and_ps(_mm_cmplt_ss(f, t), _mm_set_ss(-1.0f)));
1660 return _mm_cvtss_f32(r);
1661 #endif
1662 }
1664 #ifdef STBIR_CEILF
1665 #undef STBIR_CEILF
1666 #endif
1667 #define STBIR_CEILF stbir_simd_ceilf
1668 static stbir__inline float stbir_simd_ceilf(float x) // martins ceilf
1669 {
1670 #if defined(STBIR_AVX) || defined(__SSE4_1__) || defined(STBIR_SSE41)
1671 __m128 t = _mm_set_ss(x);
1672 return _mm_cvtss_f32( _mm_ceil_ss(t, t) );
1673 #else
1674 __m128 f = _mm_set_ss(x);
1675 __m128 t = _mm_cvtepi32_ps(_mm_cvttps_epi32(f));
1676 __m128 r = _mm_add_ss(t, _mm_and_ps(_mm_cmplt_ss(t, f), _mm_set_ss(1.0f)));
1677 return _mm_cvtss_f32(r);
1678 #endif
1679 }
1681#elif defined(STBIR_NEON)
1683 #include <arm_neon.h>
1685 #define stbir__simdf float32x4_t
1686 #define stbir__simdi uint32x4_t
1688 #define stbir_simdi_castf( reg ) vreinterpretq_u32_f32(reg)
1689 #define stbir_simdf_casti( reg ) vreinterpretq_f32_u32(reg)
1691 #define stbir__simdf_load( reg, ptr ) (reg) = vld1q_f32( (float const*)(ptr) )
1692 #define stbir__simdi_load( reg, ptr ) (reg) = vld1q_u32( (uint32_t const*)(ptr) )
1693 #define stbir__simdf_load1( out, ptr ) (out) = vld1q_dup_f32( (float const*)(ptr) ) // top values can be random (not denormal or nan for perf)
1694 #define stbir__simdi_load1( out, ptr ) (out) = vld1q_dup_u32( (uint32_t const*)(ptr) )
1695 #define stbir__simdf_load1z( out, ptr ) (out) = vld1q_lane_f32( (float const*)(ptr), vdupq_n_f32(0), 0 ) // top values must be zero
1696 #define stbir__simdf_frep4( fvar ) vdupq_n_f32( fvar )
1697 #define stbir__simdf_load1frep4( out, fvar ) (out) = vdupq_n_f32( fvar )
1698 #define stbir__simdf_load2( out, ptr ) (out) = vcombine_f32( vld1_f32( (float const*)(ptr) ), vcreate_f32(0) ) // top values can be random (not denormal or nan for perf)
1699 #define stbir__simdf_load2z( out, ptr ) (out) = vcombine_f32( vld1_f32( (float const*)(ptr) ), vcreate_f32(0) ) // top values must be zero
1700 #define stbir__simdf_load2hmerge( out, reg, ptr ) (out) = vcombine_f32( vget_low_f32(reg), vld1_f32( (float const*)(ptr) ) )
1702 #define stbir__simdf_zeroP() vdupq_n_f32(0)
1703 #define stbir__simdf_zero( reg ) (reg) = vdupq_n_f32(0)
1705 #define stbir__simdf_store( ptr, reg ) vst1q_f32( (float*)(ptr), reg )
1706 #define stbir__simdf_store1( ptr, reg ) vst1q_lane_f32( (float*)(ptr), reg, 0)
1707 #define stbir__simdf_store2( ptr, reg ) vst1_f32( (float*)(ptr), vget_low_f32(reg) )
1708 #define stbir__simdf_store2h( ptr, reg ) vst1_f32( (float*)(ptr), vget_high_f32(reg) )
1710 #define stbir__simdi_store( ptr, reg ) vst1q_u32( (uint32_t*)(ptr), reg )
1711 #define stbir__simdi_store1( ptr, reg ) vst1q_lane_u32( (uint32_t*)(ptr), reg, 0 )
1712 #define stbir__simdi_store2( ptr, reg ) vst1_u32( (uint32_t*)(ptr), vget_low_u32(reg) )
1714 #define stbir__prefetch( ptr )
1716 #define stbir__simdi_expand_u8_to_u32(out0,out1,out2,out3,ireg) \
1717 { \
1718 uint16x8_t l = vmovl_u8( vget_low_u8 ( vreinterpretq_u8_u32(ireg) ) ); \
1719 uint16x8_t h = vmovl_u8( vget_high_u8( vreinterpretq_u8_u32(ireg) ) ); \
1720 out0 = vmovl_u16( vget_low_u16 ( l ) ); \
1721 out1 = vmovl_u16( vget_high_u16( l ) ); \
1722 out2 = vmovl_u16( vget_low_u16 ( h ) ); \
1723 out3 = vmovl_u16( vget_high_u16( h ) ); \
1724 }
1726 #define stbir__simdi_expand_u8_to_1u32(out,ireg) \
1727 { \
1728 uint16x8_t tmp = vmovl_u8( vget_low_u8( vreinterpretq_u8_u32(ireg) ) ); \
1729 out = vmovl_u16( vget_low_u16( tmp ) ); \
1730 }
1732 #define stbir__simdi_expand_u16_to_u32(out0,out1,ireg) \
1733 { \
1734 uint16x8_t tmp = vreinterpretq_u16_u32(ireg); \
1735 out0 = vmovl_u16( vget_low_u16 ( tmp ) ); \
1736 out1 = vmovl_u16( vget_high_u16( tmp ) ); \
1737 }
1739 #define stbir__simdf_convert_float_to_i32( i, f ) (i) = vreinterpretq_u32_s32( vcvtq_s32_f32(f) )
1740 #define stbir__simdf_convert_float_to_int( f ) vgetq_lane_s32(vcvtq_s32_f32(f), 0)
1741 #define stbir__simdi_to_int( i ) (int)vgetq_lane_u32(i, 0)
1742 #define stbir__simdf_convert_float_to_uint8( f ) ((unsigned char)vgetq_lane_s32(vcvtq_s32_f32(vmaxq_f32(vminq_f32(f,STBIR__CONSTF(STBIR_max_uint8_as_float)),vdupq_n_f32(0))), 0))
1743 #define stbir__simdf_convert_float_to_short( f ) ((unsigned short)vgetq_lane_s32(vcvtq_s32_f32(vmaxq_f32(vminq_f32(f,STBIR__CONSTF(STBIR_max_uint16_as_float)),vdupq_n_f32(0))), 0))
1744 #define stbir__simdi_convert_i32_to_float(out, ireg) (out) = vcvtq_f32_s32( vreinterpretq_s32_u32(ireg) )
1745 #define stbir__simdf_add( out, reg0, reg1 ) (out) = vaddq_f32( reg0, reg1 )
1746 #define stbir__simdf_mult( out, reg0, reg1 ) (out) = vmulq_f32( reg0, reg1 )
1747 #define stbir__simdf_mult_mem( out, reg, ptr ) (out) = vmulq_f32( reg, vld1q_f32( (float const*)(ptr) ) )
1748 #define stbir__simdf_mult1_mem( out, reg, ptr ) (out) = vmulq_f32( reg, vld1q_dup_f32( (float const*)(ptr) ) )
1749 #define stbir__simdf_add_mem( out, reg, ptr ) (out) = vaddq_f32( reg, vld1q_f32( (float const*)(ptr) ) )
1750 #define stbir__simdf_add1_mem( out, reg, ptr ) (out) = vaddq_f32( reg, vld1q_dup_f32( (float const*)(ptr) ) )
1752 #ifdef STBIR_USE_FMA // not on by default to maintain bit identical simd to non-simd (and also x64 no madd to arm madd)
1753 #define stbir__simdf_madd( out, add, mul1, mul2 ) (out) = vfmaq_f32( add, mul1, mul2 )
1754 #define stbir__simdf_madd1( out, add, mul1, mul2 ) (out) = vfmaq_f32( add, mul1, mul2 )
1755 #define stbir__simdf_madd_mem( out, add, mul, ptr ) (out) = vfmaq_f32( add, mul, vld1q_f32( (float const*)(ptr) ) )
1756 #define stbir__simdf_madd1_mem( out, add, mul, ptr ) (out) = vfmaq_f32( add, mul, vld1q_dup_f32( (float const*)(ptr) ) )
1757 #else
1758 #define stbir__simdf_madd( out, add, mul1, mul2 ) (out) = vaddq_f32( add, vmulq_f32( mul1, mul2 ) )
1759 #define stbir__simdf_madd1( out, add, mul1, mul2 ) (out) = vaddq_f32( add, vmulq_f32( mul1, mul2 ) )
1760 #define stbir__simdf_madd_mem( out, add, mul, ptr ) (out) = vaddq_f32( add, vmulq_f32( mul, vld1q_f32( (float const*)(ptr) ) ) )
1761 #define stbir__simdf_madd1_mem( out, add, mul, ptr ) (out) = vaddq_f32( add, vmulq_f32( mul, vld1q_dup_f32( (float const*)(ptr) ) ) )
1762 #endif
1764 #define stbir__simdf_add1( out, reg0, reg1 ) (out) = vaddq_f32( reg0, reg1 )
1765 #define stbir__simdf_mult1( out, reg0, reg1 ) (out) = vmulq_f32( reg0, reg1 )
1767 #define stbir__simdf_and( out, reg0, reg1 ) (out) = vreinterpretq_f32_u32( vandq_u32( vreinterpretq_u32_f32(reg0), vreinterpretq_u32_f32(reg1) ) )
1768 #define stbir__simdf_or( out, reg0, reg1 ) (out) = vreinterpretq_f32_u32( vorrq_u32( vreinterpretq_u32_f32(reg0), vreinterpretq_u32_f32(reg1) ) )
1770 #define stbir__simdf_min( out, reg0, reg1 ) (out) = vminq_f32( reg0, reg1 )
1771 #define stbir__simdf_max( out, reg0, reg1 ) (out) = vmaxq_f32( reg0, reg1 )
1772 #define stbir__simdf_min1( out, reg0, reg1 ) (out) = vminq_f32( reg0, reg1 )
1773 #define stbir__simdf_max1( out, reg0, reg1 ) (out) = vmaxq_f32( reg0, reg1 )
1775 #define stbir__simdf_0123ABCDto3ABx( out, reg0, reg1 ) (out) = vextq_f32( reg0, reg1, 3 )
1776 #define stbir__simdf_0123ABCDto23Ax( out, reg0, reg1 ) (out) = vextq_f32( reg0, reg1, 2 )
1778 #define stbir__simdf_a1a1( out, alp, ones ) (out) = vzipq_f32(vuzpq_f32(alp, alp).val[1], ones).val[0]
1779 #define stbir__simdf_1a1a( out, alp, ones ) (out) = vzipq_f32(ones, vuzpq_f32(alp, alp).val[0]).val[0]
1781 #if defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ )
1783 #define stbir__simdf_aaa1( out, alp, ones ) (out) = vcopyq_laneq_f32(vdupq_n_f32(vgetq_lane_f32(alp, 3)), 3, ones, 3)
1784 #define stbir__simdf_1aaa( out, alp, ones ) (out) = vcopyq_laneq_f32(vdupq_n_f32(vgetq_lane_f32(alp, 0)), 0, ones, 0)
1786 #if defined( _MSC_VER ) && !defined(__clang__)
1787 #define stbir_make16(a,b,c,d) vcombine_u8( \
1788 vcreate_u8( (4*a+0) | ((4*a+1)<<8) | ((4*a+2)<<16) | ((4*a+3)<<24) | \
1789 ((stbir_uint64)(4*b+0)<<32) | ((stbir_uint64)(4*b+1)<<40) | ((stbir_uint64)(4*b+2)<<48) | ((stbir_uint64)(4*b+3)<<56)), \
1790 vcreate_u8( (4*c+0) | ((4*c+1)<<8) | ((4*c+2)<<16) | ((4*c+3)<<24) | \
1791 ((stbir_uint64)(4*d+0)<<32) | ((stbir_uint64)(4*d+1)<<40) | ((stbir_uint64)(4*d+2)<<48) | ((stbir_uint64)(4*d+3)<<56) ) )
1793 static stbir__inline uint8x16x2_t stbir_make16x2(float32x4_t rega,float32x4_t regb)
1794 {
1795 uint8x16x2_t r = { vreinterpretq_u8_f32(rega), vreinterpretq_u8_f32(regb) };
1796 return r;
1797 }
1798 #else
1799 #define stbir_make16(a,b,c,d) (uint8x16_t){4*a+0,4*a+1,4*a+2,4*a+3,4*b+0,4*b+1,4*b+2,4*b+3,4*c+0,4*c+1,4*c+2,4*c+3,4*d+0,4*d+1,4*d+2,4*d+3}
1800 #define stbir_make16x2(a,b) (uint8x16x2_t){{vreinterpretq_u8_f32(a),vreinterpretq_u8_f32(b)}}
1801 #endif
1803 #define stbir__simdf_swiz( reg, one, two, three, four ) vreinterpretq_f32_u8( vqtbl1q_u8( vreinterpretq_u8_f32(reg), stbir_make16(one, two, three, four) ) )
1804 #define stbir__simdf_swiz2( rega, regb, one, two, three, four ) vreinterpretq_f32_u8( vqtbl2q_u8( stbir_make16x2(rega,regb), stbir_make16(one, two, three, four) ) )
1806 #define stbir__simdi_16madd( out, reg0, reg1 ) \
1807 { \
1808 int16x8_t r0 = vreinterpretq_s16_u32(reg0); \
1809 int16x8_t r1 = vreinterpretq_s16_u32(reg1); \
1810 int32x4_t tmp0 = vmull_s16( vget_low_s16(r0), vget_low_s16(r1) ); \
1811 int32x4_t tmp1 = vmull_s16( vget_high_s16(r0), vget_high_s16(r1) ); \
1812 (out) = vreinterpretq_u32_s32( vpaddq_s32(tmp0, tmp1) ); \
1813 }
1815 #else
1817 #define stbir__simdf_aaa1( out, alp, ones ) (out) = vsetq_lane_f32(1.0f, vdupq_n_f32(vgetq_lane_f32(alp, 3)), 3)
1818 #define stbir__simdf_1aaa( out, alp, ones ) (out) = vsetq_lane_f32(1.0f, vdupq_n_f32(vgetq_lane_f32(alp, 0)), 0)
1820 #if defined( _MSC_VER ) && !defined(__clang__)
1821 static stbir__inline uint8x8x2_t stbir_make8x2(float32x4_t reg)
1822 {
1823 uint8x8x2_t r = { { vget_low_u8(vreinterpretq_u8_f32(reg)), vget_high_u8(vreinterpretq_u8_f32(reg)) } };
1824 return r;
1825 }
1826 #define stbir_make8(a,b) vcreate_u8( \
1827 (4*a+0) | ((4*a+1)<<8) | ((4*a+2)<<16) | ((4*a+3)<<24) | \
1828 ((stbir_uint64)(4*b+0)<<32) | ((stbir_uint64)(4*b+1)<<40) | ((stbir_uint64)(4*b+2)<<48) | ((stbir_uint64)(4*b+3)<<56) )
1829 #else
1830 #define stbir_make8x2(reg) (uint8x8x2_t){ { vget_low_u8(vreinterpretq_u8_f32(reg)), vget_high_u8(vreinterpretq_u8_f32(reg)) } }
1831 #define stbir_make8(a,b) (uint8x8_t){4*a+0,4*a+1,4*a+2,4*a+3,4*b+0,4*b+1,4*b+2,4*b+3}
1832 #endif
1834 #define stbir__simdf_swiz( reg, one, two, three, four ) vreinterpretq_f32_u8( vcombine_u8( \
1835 vtbl2_u8( stbir_make8x2( reg ), stbir_make8( one, two ) ), \
1836 vtbl2_u8( stbir_make8x2( reg ), stbir_make8( three, four ) ) ) )
1838 #define stbir__simdi_16madd( out, reg0, reg1 ) \
1839 { \
1840 int16x8_t r0 = vreinterpretq_s16_u32(reg0); \
1841 int16x8_t r1 = vreinterpretq_s16_u32(reg1); \
1842 int32x4_t tmp0 = vmull_s16( vget_low_s16(r0), vget_low_s16(r1) ); \
1843 int32x4_t tmp1 = vmull_s16( vget_high_s16(r0), vget_high_s16(r1) ); \
1844 int32x2_t out0 = vpadd_s32( vget_low_s32(tmp0), vget_high_s32(tmp0) ); \
1845 int32x2_t out1 = vpadd_s32( vget_low_s32(tmp1), vget_high_s32(tmp1) ); \
1846 (out) = vreinterpretq_u32_s32( vcombine_s32(out0, out1) ); \
1847 }
1849 #endif
1851 #define stbir__simdi_and( out, reg0, reg1 ) (out) = vandq_u32( reg0, reg1 )
1852 #define stbir__simdi_or( out, reg0, reg1 ) (out) = vorrq_u32( reg0, reg1 )
1854 #define stbir__simdf_pack_to_8bytes(out,aa,bb) \
1855 { \
1856 float32x4_t af = vmaxq_f32( vminq_f32(aa,STBIR__CONSTF(STBIR_max_uint8_as_float) ), vdupq_n_f32(0) ); \
1857 float32x4_t bf = vmaxq_f32( vminq_f32(bb,STBIR__CONSTF(STBIR_max_uint8_as_float) ), vdupq_n_f32(0) ); \
1858 int16x4_t ai = vqmovn_s32( vcvtq_s32_f32( af ) ); \
1859 int16x4_t bi = vqmovn_s32( vcvtq_s32_f32( bf ) ); \
1860 uint8x8_t out8 = vqmovun_s16( vcombine_s16(ai, bi) ); \
1861 out = vreinterpretq_u32_u8( vcombine_u8(out8, out8) ); \
1862 }
1864 #define stbir__simdf_pack_to_8words(out,aa,bb) \
1865 { \
1866 float32x4_t af = vmaxq_f32( vminq_f32(aa,STBIR__CONSTF(STBIR_max_uint16_as_float) ), vdupq_n_f32(0) ); \
1867 float32x4_t bf = vmaxq_f32( vminq_f32(bb,STBIR__CONSTF(STBIR_max_uint16_as_float) ), vdupq_n_f32(0) ); \
1868 int32x4_t ai = vcvtq_s32_f32( af ); \
1869 int32x4_t bi = vcvtq_s32_f32( bf ); \
1870 out = vreinterpretq_u32_u16( vcombine_u16(vqmovun_s32(ai), vqmovun_s32(bi)) ); \
1871 }
1873 #define stbir__interleave_pack_and_store_16_u8( ptr, r0, r1, r2, r3 ) \
1874 { \
1875 int16x4x2_t tmp0 = vzip_s16( vqmovn_s32(vreinterpretq_s32_u32(r0)), vqmovn_s32(vreinterpretq_s32_u32(r2)) ); \
1876 int16x4x2_t tmp1 = vzip_s16( vqmovn_s32(vreinterpretq_s32_u32(r1)), vqmovn_s32(vreinterpretq_s32_u32(r3)) ); \
1877 uint8x8x2_t out = \
1878 { { \
1879 vqmovun_s16( vcombine_s16(tmp0.val[0], tmp0.val[1]) ), \
1880 vqmovun_s16( vcombine_s16(tmp1.val[0], tmp1.val[1]) ), \
1881 } }; \
1882 vst2_u8(ptr, out); \
1883 }
1885 #define stbir__simdf_load4_transposed( o0, o1, o2, o3, ptr ) \
1886 { \
1887 float32x4x4_t tmp = vld4q_f32(ptr); \
1888 o0 = tmp.val[0]; \
1889 o1 = tmp.val[1]; \
1890 o2 = tmp.val[2]; \
1891 o3 = tmp.val[3]; \
1892 }
1894 #define stbir__simdi_32shr( out, reg, imm ) out = vshrq_n_u32( reg, imm )
1896 #if defined( _MSC_VER ) && !defined(__clang__)
1897 #define STBIR__SIMDF_CONST(var, x) __declspec(align(8)) float var[] = { x, x, x, x }
1898 #define STBIR__SIMDI_CONST(var, x) __declspec(align(8)) uint32_t var[] = { x, x, x, x }
1899 #define STBIR__CONSTF(var) (*(const float32x4_t*)var)
1900 #define STBIR__CONSTI(var) (*(const uint32x4_t*)var)
1901 #else
1902 #define STBIR__SIMDF_CONST(var, x) stbir__simdf var = { x, x, x, x }
1903 #define STBIR__SIMDI_CONST(var, x) stbir__simdi var = { x, x, x, x }
1904 #define STBIR__CONSTF(var) (var)
1905 #define STBIR__CONSTI(var) (var)
1906 #endif
1908 #ifdef STBIR_FLOORF
1909 #undef STBIR_FLOORF
1910 #endif
1911 #define STBIR_FLOORF stbir_simd_floorf
1912 static stbir__inline float stbir_simd_floorf(float x)
1913 {
1914 #if defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ )
1915 return vget_lane_f32( vrndm_f32( vdup_n_f32(x) ), 0);
1916 #else
1917 float32x2_t f = vdup_n_f32(x);
1918 float32x2_t t = vcvt_f32_s32(vcvt_s32_f32(f));
1919 uint32x2_t a = vclt_f32(f, t);
1920 uint32x2_t b = vreinterpret_u32_f32(vdup_n_f32(-1.0f));
1921 float32x2_t r = vadd_f32(t, vreinterpret_f32_u32(vand_u32(a, b)));
1922 return vget_lane_f32(r, 0);
1923 #endif
1924 }
1926 #ifdef STBIR_CEILF
1927 #undef STBIR_CEILF
1928 #endif
1929 #define STBIR_CEILF stbir_simd_ceilf
1930 static stbir__inline float stbir_simd_ceilf(float x)
1931 {
1932 #if defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ )
1933 return vget_lane_f32( vrndp_f32( vdup_n_f32(x) ), 0);
1934 #else
1935 float32x2_t f = vdup_n_f32(x);
1936 float32x2_t t = vcvt_f32_s32(vcvt_s32_f32(f));
1937 uint32x2_t a = vclt_f32(t, f);
1938 uint32x2_t b = vreinterpret_u32_f32(vdup_n_f32(1.0f));
1939 float32x2_t r = vadd_f32(t, vreinterpret_f32_u32(vand_u32(a, b)));
1940 return vget_lane_f32(r, 0);
1941 #endif
1942 }
1944 #define STBIR_SIMD
1946#elif defined(STBIR_WASM)
1948 #include <wasm_simd128.h>
1950 #define stbir__simdf v128_t
1951 #define stbir__simdi v128_t
1953 #define stbir_simdi_castf( reg ) (reg)
1954 #define stbir_simdf_casti( reg ) (reg)
1956 #define stbir__simdf_load( reg, ptr ) (reg) = wasm_v128_load( (void const*)(ptr) )
1957 #define stbir__simdi_load( reg, ptr ) (reg) = wasm_v128_load( (void const*)(ptr) )
1958 #define stbir__simdf_load1( out, ptr ) (out) = wasm_v128_load32_splat( (void const*)(ptr) ) // top values can be random (not denormal or nan for perf)
1959 #define stbir__simdi_load1( out, ptr ) (out) = wasm_v128_load32_splat( (void const*)(ptr) )
1960 #define stbir__simdf_load1z( out, ptr ) (out) = wasm_v128_load32_zero( (void const*)(ptr) ) // top values must be zero
1961 #define stbir__simdf_frep4( fvar ) wasm_f32x4_splat( fvar )
1962 #define stbir__simdf_load1frep4( out, fvar ) (out) = wasm_f32x4_splat( fvar )
1963 #define stbir__simdf_load2( out, ptr ) (out) = wasm_v128_load64_splat( (void const*)(ptr) ) // top values can be random (not denormal or nan for perf)
1964 #define stbir__simdf_load2z( out, ptr ) (out) = wasm_v128_load64_zero( (void const*)(ptr) ) // top values must be zero
1965 #define stbir__simdf_load2hmerge( out, reg, ptr ) (out) = wasm_v128_load64_lane( (void const*)(ptr), reg, 1 )
1967 #define stbir__simdf_zeroP() wasm_f32x4_const_splat(0)
1968 #define stbir__simdf_zero( reg ) (reg) = wasm_f32x4_const_splat(0)
1970 #define stbir__simdf_store( ptr, reg ) wasm_v128_store( (void*)(ptr), reg )
1971 #define stbir__simdf_store1( ptr, reg ) wasm_v128_store32_lane( (void*)(ptr), reg, 0 )
1972 #define stbir__simdf_store2( ptr, reg ) wasm_v128_store64_lane( (void*)(ptr), reg, 0 )
1973 #define stbir__simdf_store2h( ptr, reg ) wasm_v128_store64_lane( (void*)(ptr), reg, 1 )
1975 #define stbir__simdi_store( ptr, reg ) wasm_v128_store( (void*)(ptr), reg )
1976 #define stbir__simdi_store1( ptr, reg ) wasm_v128_store32_lane( (void*)(ptr), reg, 0 )
1977 #define stbir__simdi_store2( ptr, reg ) wasm_v128_store64_lane( (void*)(ptr), reg, 0 )
1979 #define stbir__prefetch( ptr )
1981 #define stbir__simdi_expand_u8_to_u32(out0,out1,out2,out3,ireg) \
1982 { \
1983 v128_t l = wasm_u16x8_extend_low_u8x16 ( ireg ); \
1984 v128_t h = wasm_u16x8_extend_high_u8x16( ireg ); \
1985 out0 = wasm_u32x4_extend_low_u16x8 ( l ); \
1986 out1 = wasm_u32x4_extend_high_u16x8( l ); \
1987 out2 = wasm_u32x4_extend_low_u16x8 ( h ); \
1988 out3 = wasm_u32x4_extend_high_u16x8( h ); \
1989 }
1991 #define stbir__simdi_expand_u8_to_1u32(out,ireg) \
1992 { \
1993 v128_t tmp = wasm_u16x8_extend_low_u8x16(ireg); \
1994 out = wasm_u32x4_extend_low_u16x8(tmp); \
1995 }
1997 #define stbir__simdi_expand_u16_to_u32(out0,out1,ireg) \
1998 { \
1999 out0 = wasm_u32x4_extend_low_u16x8 ( ireg ); \
2000 out1 = wasm_u32x4_extend_high_u16x8( ireg ); \
2001 }
2003 #define stbir__simdf_convert_float_to_i32( i, f ) (i) = wasm_i32x4_trunc_sat_f32x4(f)
2004 #define stbir__simdf_convert_float_to_int( f ) wasm_i32x4_extract_lane(wasm_i32x4_trunc_sat_f32x4(f), 0)
2005 #define stbir__simdi_to_int( i ) wasm_i32x4_extract_lane(i, 0)
2006 #define stbir__simdf_convert_float_to_uint8( f ) ((unsigned char)wasm_i32x4_extract_lane(wasm_i32x4_trunc_sat_f32x4(wasm_f32x4_max(wasm_f32x4_min(f,STBIR_max_uint8_as_float),wasm_f32x4_const_splat(0))), 0))
2007 #define stbir__simdf_convert_float_to_short( f ) ((unsigned short)wasm_i32x4_extract_lane(wasm_i32x4_trunc_sat_f32x4(wasm_f32x4_max(wasm_f32x4_min(f,STBIR_max_uint16_as_float),wasm_f32x4_const_splat(0))), 0))
2008 #define stbir__simdi_convert_i32_to_float(out, ireg) (out) = wasm_f32x4_convert_i32x4(ireg)
2009 #define stbir__simdf_add( out, reg0, reg1 ) (out) = wasm_f32x4_add( reg0, reg1 )
2010 #define stbir__simdf_mult( out, reg0, reg1 ) (out) = wasm_f32x4_mul( reg0, reg1 )
2011 #define stbir__simdf_mult_mem( out, reg, ptr ) (out) = wasm_f32x4_mul( reg, wasm_v128_load( (void const*)(ptr) ) )
2012 #define stbir__simdf_mult1_mem( out, reg, ptr ) (out) = wasm_f32x4_mul( reg, wasm_v128_load32_splat( (void const*)(ptr) ) )
2013 #define stbir__simdf_add_mem( out, reg, ptr ) (out) = wasm_f32x4_add( reg, wasm_v128_load( (void const*)(ptr) ) )
2014 #define stbir__simdf_add1_mem( out, reg, ptr ) (out) = wasm_f32x4_add( reg, wasm_v128_load32_splat( (void const*)(ptr) ) )
2016 #define stbir__simdf_madd( out, add, mul1, mul2 ) (out) = wasm_f32x4_add( add, wasm_f32x4_mul( mul1, mul2 ) )
2017 #define stbir__simdf_madd1( out, add, mul1, mul2 ) (out) = wasm_f32x4_add( add, wasm_f32x4_mul( mul1, mul2 ) )
2018 #define stbir__simdf_madd_mem( out, add, mul, ptr ) (out) = wasm_f32x4_add( add, wasm_f32x4_mul( mul, wasm_v128_load( (void const*)(ptr) ) ) )
2019 #define stbir__simdf_madd1_mem( out, add, mul, ptr ) (out) = wasm_f32x4_add( add, wasm_f32x4_mul( mul, wasm_v128_load32_splat( (void const*)(ptr) ) ) )
2021 #define stbir__simdf_add1( out, reg0, reg1 ) (out) = wasm_f32x4_add( reg0, reg1 )
2022 #define stbir__simdf_mult1( out, reg0, reg1 ) (out) = wasm_f32x4_mul( reg0, reg1 )
2024 #define stbir__simdf_and( out, reg0, reg1 ) (out) = wasm_v128_and( reg0, reg1 )
2025 #define stbir__simdf_or( out, reg0, reg1 ) (out) = wasm_v128_or( reg0, reg1 )
2027 #define stbir__simdf_min( out, reg0, reg1 ) (out) = wasm_f32x4_min( reg0, reg1 )
2028 #define stbir__simdf_max( out, reg0, reg1 ) (out) = wasm_f32x4_max( reg0, reg1 )
2029 #define stbir__simdf_min1( out, reg0, reg1 ) (out) = wasm_f32x4_min( reg0, reg1 )
2030 #define stbir__simdf_max1( out, reg0, reg1 ) (out) = wasm_f32x4_max( reg0, reg1 )
2032 #define stbir__simdf_0123ABCDto3ABx( out, reg0, reg1 ) (out) = wasm_i32x4_shuffle( reg0, reg1, 3, 4, 5, -1 )
2033 #define stbir__simdf_0123ABCDto23Ax( out, reg0, reg1 ) (out) = wasm_i32x4_shuffle( reg0, reg1, 2, 3, 4, -1 )
2035 #define stbir__simdf_aaa1(out,alp,ones) (out) = wasm_i32x4_shuffle(alp, ones, 3, 3, 3, 4)
2036 #define stbir__simdf_1aaa(out,alp,ones) (out) = wasm_i32x4_shuffle(alp, ones, 4, 0, 0, 0)
2037 #define stbir__simdf_a1a1(out,alp,ones) (out) = wasm_i32x4_shuffle(alp, ones, 1, 4, 3, 4)
2038 #define stbir__simdf_1a1a(out,alp,ones) (out) = wasm_i32x4_shuffle(alp, ones, 4, 0, 4, 2)
2040 #define stbir__simdf_swiz( reg, one, two, three, four ) wasm_i32x4_shuffle(reg, reg, one, two, three, four)
2042 #define stbir__simdi_and( out, reg0, reg1 ) (out) = wasm_v128_and( reg0, reg1 )
2043 #define stbir__simdi_or( out, reg0, reg1 ) (out) = wasm_v128_or( reg0, reg1 )
2044 #define stbir__simdi_16madd( out, reg0, reg1 ) (out) = wasm_i32x4_dot_i16x8( reg0, reg1 )
2046 #define stbir__simdf_pack_to_8bytes(out,aa,bb) \
2047 { \
2048 v128_t af = wasm_f32x4_max( wasm_f32x4_min(aa, STBIR_max_uint8_as_float), wasm_f32x4_const_splat(0) ); \
2049 v128_t bf = wasm_f32x4_max( wasm_f32x4_min(bb, STBIR_max_uint8_as_float), wasm_f32x4_const_splat(0) ); \
2050 v128_t ai = wasm_i32x4_trunc_sat_f32x4( af ); \
2051 v128_t bi = wasm_i32x4_trunc_sat_f32x4( bf ); \
2052 v128_t out16 = wasm_i16x8_narrow_i32x4( ai, bi ); \
2053 out = wasm_u8x16_narrow_i16x8( out16, out16 ); \
2054 }
2056 #define stbir__simdf_pack_to_8words(out,aa,bb) \
2057 { \
2058 v128_t af = wasm_f32x4_max( wasm_f32x4_min(aa, STBIR_max_uint16_as_float), wasm_f32x4_const_splat(0)); \
2059 v128_t bf = wasm_f32x4_max( wasm_f32x4_min(bb, STBIR_max_uint16_as_float), wasm_f32x4_const_splat(0)); \
2060 v128_t ai = wasm_i32x4_trunc_sat_f32x4( af ); \
2061 v128_t bi = wasm_i32x4_trunc_sat_f32x4( bf ); \
2062 out = wasm_u16x8_narrow_i32x4( ai, bi ); \
2063 }
2065 #define stbir__interleave_pack_and_store_16_u8( ptr, r0, r1, r2, r3 ) \
2066 { \
2067 v128_t tmp0 = wasm_i16x8_narrow_i32x4(r0, r1); \
2068 v128_t tmp1 = wasm_i16x8_narrow_i32x4(r2, r3); \
2069 v128_t tmp = wasm_u8x16_narrow_i16x8(tmp0, tmp1); \
2070 tmp = wasm_i8x16_shuffle(tmp, tmp, 0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15); \
2071 wasm_v128_store( (void*)(ptr), tmp); \
2072 }
2074 #define stbir__simdf_load4_transposed( o0, o1, o2, o3, ptr ) \
2075 { \
2076 v128_t t0 = wasm_v128_load( ptr ); \
2077 v128_t t1 = wasm_v128_load( ptr+4 ); \
2078 v128_t t2 = wasm_v128_load( ptr+8 ); \
2079 v128_t t3 = wasm_v128_load( ptr+12 ); \
2080 v128_t s0 = wasm_i32x4_shuffle(t0, t1, 0, 4, 2, 6); \
2081 v128_t s1 = wasm_i32x4_shuffle(t0, t1, 1, 5, 3, 7); \
2082 v128_t s2 = wasm_i32x4_shuffle(t2, t3, 0, 4, 2, 6); \
2083 v128_t s3 = wasm_i32x4_shuffle(t2, t3, 1, 5, 3, 7); \
2084 o0 = wasm_i32x4_shuffle(s0, s2, 0, 1, 4, 5); \
2085 o1 = wasm_i32x4_shuffle(s1, s3, 0, 1, 4, 5); \
2086 o2 = wasm_i32x4_shuffle(s0, s2, 2, 3, 6, 7); \
2087 o3 = wasm_i32x4_shuffle(s1, s3, 2, 3, 6, 7); \
2088 }
2090 #define stbir__simdi_32shr( out, reg, imm ) out = wasm_u32x4_shr( reg, imm )
2092 typedef float stbir__f32x4 __attribute__((__vector_size__(16), __aligned__(16)));
2093 #define STBIR__SIMDF_CONST(var, x) stbir__simdf var = (v128_t)(stbir__f32x4){ x, x, x, x }
2094 #define STBIR__SIMDI_CONST(var, x) stbir__simdi var = { x, x, x, x }
2095 #define STBIR__CONSTF(var) (var)
2096 #define STBIR__CONSTI(var) (var)
2098 #ifdef STBIR_FLOORF
2099 #undef STBIR_FLOORF
2100 #endif
2101 #define STBIR_FLOORF stbir_simd_floorf
2102 static stbir__inline float stbir_simd_floorf(float x)
2103 {
2104 return wasm_f32x4_extract_lane( wasm_f32x4_floor( wasm_f32x4_splat(x) ), 0);
2105 }
2107 #ifdef STBIR_CEILF
2108 #undef STBIR_CEILF
2109 #endif
2110 #define STBIR_CEILF stbir_simd_ceilf
2111 static stbir__inline float stbir_simd_ceilf(float x)
2112 {
2113 return wasm_f32x4_extract_lane( wasm_f32x4_ceil( wasm_f32x4_splat(x) ), 0);
2114 }
2116 #define STBIR_SIMD
2118#endif // SSE2/NEON/WASM
2120#endif // NO SIMD
2122#ifdef STBIR_SIMD8
2123 #define stbir__simdfX stbir__simdf8
2124 #define stbir__simdiX stbir__simdi8
2125 #define stbir__simdfX_load stbir__simdf8_load
2126 #define stbir__simdiX_load stbir__simdi8_load
2127 #define stbir__simdfX_mult stbir__simdf8_mult
2128 #define stbir__simdfX_add_mem stbir__simdf8_add_mem
2129 #define stbir__simdfX_madd_mem stbir__simdf8_madd_mem
2130 #define stbir__simdfX_store stbir__simdf8_store
2131 #define stbir__simdiX_store stbir__simdi8_store
2132 #define stbir__simdf_frepX stbir__simdf8_frep8
2133 #define stbir__simdfX_madd stbir__simdf8_madd
2134 #define stbir__simdfX_min stbir__simdf8_min
2135 #define stbir__simdfX_max stbir__simdf8_max
2136 #define stbir__simdfX_aaa1 stbir__simdf8_aaa1
2137 #define stbir__simdfX_1aaa stbir__simdf8_1aaa
2138 #define stbir__simdfX_a1a1 stbir__simdf8_a1a1
2139 #define stbir__simdfX_1a1a stbir__simdf8_1a1a
2140 #define stbir__simdfX_convert_float_to_i32 stbir__simdf8_convert_float_to_i32
2141 #define stbir__simdfX_pack_to_words stbir__simdf8_pack_to_16words
2142 #define stbir__simdfX_zero stbir__simdf8_zero
2143 #define STBIR_onesX STBIR_ones8
2144 #define STBIR_max_uint8_as_floatX STBIR_max_uint8_as_float8
2145 #define STBIR_max_uint16_as_floatX STBIR_max_uint16_as_float8
2146 #define STBIR_simd_point5X STBIR_simd_point58
2147 #define stbir__simdfX_float_count 8
2148 #define stbir__simdfX_0123to1230 stbir__simdf8_0123to12301230
2149 #define stbir__simdfX_0123to2103 stbir__simdf8_0123to21032103
2150 static const stbir__simdf8 STBIR_max_uint16_as_float_inverted8 = { stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted,stbir__max_uint16_as_float_inverted };
2151 static const stbir__simdf8 STBIR_max_uint8_as_float_inverted8 = { stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted,stbir__max_uint8_as_float_inverted };
2152 static const stbir__simdf8 STBIR_ones8 = { 1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0 };
2153 static const stbir__simdf8 STBIR_simd_point58 = { 0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5 };
2154 static const stbir__simdf8 STBIR_max_uint8_as_float8 = { stbir__max_uint8_as_float,stbir__max_uint8_as_float,stbir__max_uint8_as_float,stbir__max_uint8_as_float, stbir__max_uint8_as_float,stbir__max_uint8_as_float,stbir__max_uint8_as_float,stbir__max_uint8_as_float };
2155 static const stbir__simdf8 STBIR_max_uint16_as_float8 = { stbir__max_uint16_as_float,stbir__max_uint16_as_float,stbir__max_uint16_as_float,stbir__max_uint16_as_float, stbir__max_uint16_as_float,stbir__max_uint16_as_float,stbir__max_uint16_as_float,stbir__max_uint16_as_float };
2156#else
2157 #define stbir__simdfX stbir__simdf
2158 #define stbir__simdiX stbir__simdi
2159 #define stbir__simdfX_load stbir__simdf_load
2160 #define stbir__simdiX_load stbir__simdi_load
2161 #define stbir__simdfX_mult stbir__simdf_mult
2162 #define stbir__simdfX_add_mem stbir__simdf_add_mem
2163 #define stbir__simdfX_madd_mem stbir__simdf_madd_mem
2164 #define stbir__simdfX_store stbir__simdf_store
2165 #define stbir__simdiX_store stbir__simdi_store
2166 #define stbir__simdf_frepX stbir__simdf_frep4
2167 #define stbir__simdfX_madd stbir__simdf_madd
2168 #define stbir__simdfX_min stbir__simdf_min
2169 #define stbir__simdfX_max stbir__simdf_max
2170 #define stbir__simdfX_aaa1 stbir__simdf_aaa1
2171 #define stbir__simdfX_1aaa stbir__simdf_1aaa
2172 #define stbir__simdfX_a1a1 stbir__simdf_a1a1
2173 #define stbir__simdfX_1a1a stbir__simdf_1a1a
2174 #define stbir__simdfX_convert_float_to_i32 stbir__simdf_convert_float_to_i32
2175 #define stbir__simdfX_pack_to_words stbir__simdf_pack_to_8words
2176 #define stbir__simdfX_zero stbir__simdf_zero
2177 #define STBIR_onesX STBIR__CONSTF(STBIR_ones)
2178 #define STBIR_simd_point5X STBIR__CONSTF(STBIR_simd_point5)
2179 #define STBIR_max_uint8_as_floatX STBIR__CONSTF(STBIR_max_uint8_as_float)
2180 #define STBIR_max_uint16_as_floatX STBIR__CONSTF(STBIR_max_uint16_as_float)
2181 #define stbir__simdfX_float_count 4
2182 #define stbir__if_simdf8_cast_to_simdf4( val ) ( val )
2183 #define stbir__simdfX_0123to1230 stbir__simdf_0123to1230
2184 #define stbir__simdfX_0123to2103 stbir__simdf_0123to2103
2185#endif
2188#if defined(STBIR_NEON) && !defined(_M_ARM) && !defined(__arm__)
2190 #if defined( _MSC_VER ) && !defined(__clang__)
2191 typedef __int16 stbir__FP16;
2192 #else
2193 typedef float16_t stbir__FP16;
2194 #endif
2196#else // no NEON, or 32-bit ARM for MSVC
2198 typedef union stbir__FP16
2199 {
2200 unsigned short u;
2201 } stbir__FP16;
2203#endif
2205#if (!defined(STBIR_NEON) && !defined(STBIR_FP16C)) || (defined(STBIR_NEON) && defined(_M_ARM)) || (defined(STBIR_NEON) && defined(__arm__))
2207 // Fabian's half float routines, see: https://gist.github.com/rygorous/2156668
2209 static stbir__inline float stbir__half_to_float( stbir__FP16 h )
2210 {
2211 static const stbir__FP32 magic = { (254 - 15) << 23 };
2212 static const stbir__FP32 was_infnan = { (127 + 16) << 23 };
2213 stbir__FP32 o;
2215 o.u = (h.u & 0x7fff) << 13; // exponent/mantissa bits
2216 o.f *= magic.f; // exponent adjust
2217 if (o.f >= was_infnan.f) // make sure Inf/NaN survive
2218 o.u |= 255 << 23;
2219 o.u |= (h.u & 0x8000) << 16; // sign bit
2220 return o.f;
2221 }
2223 static stbir__inline stbir__FP16 stbir__float_to_half(float val)
2224 {
2225 stbir__FP32 f32infty = { 255 << 23 };
2226 stbir__FP32 f16max = { (127 + 16) << 23 };
2227 stbir__FP32 denorm_magic = { ((127 - 15) + (23 - 10) + 1) << 23 };
2228 unsigned int sign_mask = 0x80000000u;
2229 stbir__FP16 o = { 0 };
2230 stbir__FP32 f;
2231 unsigned int sign;
2233 f.f = val;
2234 sign = f.u & sign_mask;
2235 f.u ^= sign;
2237 if (f.u >= f16max.u) // result is Inf or NaN (all exponent bits set)
2238 o.u = (f.u > f32infty.u) ? 0x7e00 : 0x7c00; // NaN->qNaN and Inf->Inf
2239 else // (De)normalized number or zero
2240 {
2241 if (f.u < (113 << 23)) // resulting FP16 is subnormal or zero
2242 {
2243 // use a magic value to align our 10 mantissa bits at the bottom of
2244 // the float. as long as FP addition is round-to-nearest-even this
2245 // just works.
2246 f.f += denorm_magic.f;
2247 // and one integer subtract of the bias later, we have our final float!
2248 o.u = (unsigned short) ( f.u - denorm_magic.u );
2249 }
2250 else
2251 {
2252 unsigned int mant_odd = (f.u >> 13) & 1; // resulting mantissa is odd
2253 // update exponent, rounding bias part 1
2254 f.u = f.u + ((15u - 127) << 23) + 0xfff;
2255 // rounding bias part 2
2256 f.u += mant_odd;
2257 // take the bits!
2258 o.u = (unsigned short) ( f.u >> 13 );
2259 }
2260 }
2262 o.u |= sign >> 16;
2263 return o;
2264 }
2266#endif
2269#if defined(STBIR_FP16C)
2271 #include <immintrin.h>
2273 static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input)
2274 {
2275 _mm256_storeu_ps( (float*)output, _mm256_cvtph_ps( _mm_loadu_si128( (__m128i const* )input ) ) );
2276 }
2278 static stbir__inline void stbir__float_to_half_SIMD(stbir__FP16 * output, float const * input)
2279 {
2280 _mm_storeu_si128( (__m128i*)output, _mm256_cvtps_ph( _mm256_loadu_ps( input ), 0 ) );
2281 }
2283 static stbir__inline float stbir__half_to_float( stbir__FP16 h )
2284 {
2285 return _mm_cvtss_f32( _mm_cvtph_ps( _mm_cvtsi32_si128( (int)h.u ) ) );
2286 }
2288 static stbir__inline stbir__FP16 stbir__float_to_half( float f )
2289 {
2290 stbir__FP16 h;
2291 h.u = (unsigned short) _mm_cvtsi128_si32( _mm_cvtps_ph( _mm_set_ss( f ), 0 ) );
2292 return h;
2293 }
2295#elif defined(STBIR_SSE2)
2297 // Fabian's half float routines, see: https://gist.github.com/rygorous/2156668
2298 stbir__inline static void stbir__half_to_float_SIMD(float * output, void const * input)
2299 {
2300 static const STBIR__SIMDI_CONST(mask_nosign, 0x7fff);
2301 static const STBIR__SIMDI_CONST(smallest_normal, 0x0400);
2302 static const STBIR__SIMDI_CONST(infinity, 0x7c00);
2303 static const STBIR__SIMDI_CONST(expadjust_normal, (127 - 15) << 23);
2304 static const STBIR__SIMDI_CONST(magic_denorm, 113 << 23);
2306 __m128i i = _mm_loadu_si128 ( (__m128i const*)(input) );
2307 __m128i h = _mm_unpacklo_epi16 ( i, _mm_setzero_si128() );
2308 __m128i mnosign = STBIR__CONSTI(mask_nosign);
2309 __m128i eadjust = STBIR__CONSTI(expadjust_normal);
2310 __m128i smallest = STBIR__CONSTI(smallest_normal);
2311 __m128i infty = STBIR__CONSTI(infinity);
2312 __m128i expmant = _mm_and_si128(mnosign, h);
2313 __m128i justsign = _mm_xor_si128(h, expmant);
2314 __m128i b_notinfnan = _mm_cmpgt_epi32(infty, expmant);
2315 __m128i b_isdenorm = _mm_cmpgt_epi32(smallest, expmant);
2316 __m128i shifted = _mm_slli_epi32(expmant, 13);
2317 __m128i adj_infnan = _mm_andnot_si128(b_notinfnan, eadjust);
2318 __m128i adjusted = _mm_add_epi32(eadjust, shifted);
2319 __m128i den1 = _mm_add_epi32(shifted, STBIR__CONSTI(magic_denorm));
2320 __m128i adjusted2 = _mm_add_epi32(adjusted, adj_infnan);
2321 __m128 den2 = _mm_sub_ps(_mm_castsi128_ps(den1), *(const __m128 *)&magic_denorm);
2322 __m128 adjusted3 = _mm_and_ps(den2, _mm_castsi128_ps(b_isdenorm));
2323 __m128 adjusted4 = _mm_andnot_ps(_mm_castsi128_ps(b_isdenorm), _mm_castsi128_ps(adjusted2));
2324 __m128 adjusted5 = _mm_or_ps(adjusted3, adjusted4);
2325 __m128i sign = _mm_slli_epi32(justsign, 16);
2326 __m128 final = _mm_or_ps(adjusted5, _mm_castsi128_ps(sign));
2327 stbir__simdf_store( output + 0, final );
2329 h = _mm_unpackhi_epi16 ( i, _mm_setzero_si128() );
2330 expmant = _mm_and_si128(mnosign, h);
2331 justsign = _mm_xor_si128(h, expmant);
2332 b_notinfnan = _mm_cmpgt_epi32(infty, expmant);
2333 b_isdenorm = _mm_cmpgt_epi32(smallest, expmant);
2334 shifted = _mm_slli_epi32(expmant, 13);
2335 adj_infnan = _mm_andnot_si128(b_notinfnan, eadjust);
2336 adjusted = _mm_add_epi32(eadjust, shifted);
2337 den1 = _mm_add_epi32(shifted, STBIR__CONSTI(magic_denorm));
2338 adjusted2 = _mm_add_epi32(adjusted, adj_infnan);
2339 den2 = _mm_sub_ps(_mm_castsi128_ps(den1), *(const __m128 *)&magic_denorm);
2340 adjusted3 = _mm_and_ps(den2, _mm_castsi128_ps(b_isdenorm));
2341 adjusted4 = _mm_andnot_ps(_mm_castsi128_ps(b_isdenorm), _mm_castsi128_ps(adjusted2));
2342 adjusted5 = _mm_or_ps(adjusted3, adjusted4);
2343 sign = _mm_slli_epi32(justsign, 16);
2344 final = _mm_or_ps(adjusted5, _mm_castsi128_ps(sign));
2345 stbir__simdf_store( output + 4, final );
2347 // ~38 SSE2 ops for 8 values
2348 }
2350 // Fabian's round-to-nearest-even float to half
2351 // ~48 SSE2 ops for 8 output
2352 stbir__inline static void stbir__float_to_half_SIMD(void * output, float const * input)
2353 {
2354 static const STBIR__SIMDI_CONST(mask_sign, 0x80000000u);
2355 static const STBIR__SIMDI_CONST(c_f16max, (127 + 16) << 23); // all FP32 values >=this round to +inf
2356 static const STBIR__SIMDI_CONST(c_nanbit, 0x200);
2357 static const STBIR__SIMDI_CONST(c_infty_as_fp16, 0x7c00);
2358 static const STBIR__SIMDI_CONST(c_min_normal, (127 - 14) << 23); // smallest FP32 that yields a normalized FP16
2359 static const STBIR__SIMDI_CONST(c_subnorm_magic, ((127 - 15) + (23 - 10) + 1) << 23);
2360 static const STBIR__SIMDI_CONST(c_normal_bias, 0xfff - ((127 - 15) << 23)); // adjust exponent and add mantissa rounding
2362 __m128 f = _mm_loadu_ps(input);
2363 __m128 msign = _mm_castsi128_ps(STBIR__CONSTI(mask_sign));
2364 __m128 justsign = _mm_and_ps(msign, f);
2365 __m128 absf = _mm_xor_ps(f, justsign);
2366 __m128i absf_int = _mm_castps_si128(absf); // the cast is "free" (extra bypass latency, but no thruput hit)
2367 __m128i f16max = STBIR__CONSTI(c_f16max);
2368 __m128 b_isnan = _mm_cmpunord_ps(absf, absf); // is this a NaN?
2369 __m128i b_isregular = _mm_cmpgt_epi32(f16max, absf_int); // (sub)normalized or special?
2370 __m128i nanbit = _mm_and_si128(_mm_castps_si128(b_isnan), STBIR__CONSTI(c_nanbit));
2371 __m128i inf_or_nan = _mm_or_si128(nanbit, STBIR__CONSTI(c_infty_as_fp16)); // output for specials
2373 __m128i min_normal = STBIR__CONSTI(c_min_normal);
2374 __m128i b_issub = _mm_cmpgt_epi32(min_normal, absf_int);
2376 // "result is subnormal" path
2377 __m128 subnorm1 = _mm_add_ps(absf, _mm_castsi128_ps(STBIR__CONSTI(c_subnorm_magic))); // magic value to round output mantissa
2378 __m128i subnorm2 = _mm_sub_epi32(_mm_castps_si128(subnorm1), STBIR__CONSTI(c_subnorm_magic)); // subtract out bias
2380 // "result is normal" path
2381 __m128i mantoddbit = _mm_slli_epi32(absf_int, 31 - 13); // shift bit 13 (mantissa LSB) to sign
2382 __m128i mantodd = _mm_srai_epi32(mantoddbit, 31); // -1 if FP16 mantissa odd, else 0
2384 __m128i round1 = _mm_add_epi32(absf_int, STBIR__CONSTI(c_normal_bias));
2385 __m128i round2 = _mm_sub_epi32(round1, mantodd); // if mantissa LSB odd, bias towards rounding up (RTNE)
2386 __m128i normal = _mm_srli_epi32(round2, 13); // rounded result
2388 // combine the two non-specials
2389 __m128i nonspecial = _mm_or_si128(_mm_and_si128(subnorm2, b_issub), _mm_andnot_si128(b_issub, normal));
2391 // merge in specials as well
2392 __m128i joined = _mm_or_si128(_mm_and_si128(nonspecial, b_isregular), _mm_andnot_si128(b_isregular, inf_or_nan));
2394 __m128i sign_shift = _mm_srai_epi32(_mm_castps_si128(justsign), 16);
2395 __m128i final2, final= _mm_or_si128(joined, sign_shift);
2397 f = _mm_loadu_ps(input+4);
2398 justsign = _mm_and_ps(msign, f);
2399 absf = _mm_xor_ps(f, justsign);
2400 absf_int = _mm_castps_si128(absf); // the cast is "free" (extra bypass latency, but no thruput hit)
2401 b_isnan = _mm_cmpunord_ps(absf, absf); // is this a NaN?
2402 b_isregular = _mm_cmpgt_epi32(f16max, absf_int); // (sub)normalized or special?
2403 nanbit = _mm_and_si128(_mm_castps_si128(b_isnan), c_nanbit);
2404 inf_or_nan = _mm_or_si128(nanbit, STBIR__CONSTI(c_infty_as_fp16)); // output for specials
2406 b_issub = _mm_cmpgt_epi32(min_normal, absf_int);
2408 // "result is subnormal" path
2409 subnorm1 = _mm_add_ps(absf, _mm_castsi128_ps(STBIR__CONSTI(c_subnorm_magic))); // magic value to round output mantissa
2410 subnorm2 = _mm_sub_epi32(_mm_castps_si128(subnorm1), STBIR__CONSTI(c_subnorm_magic)); // subtract out bias
2412 // "result is normal" path
2413 mantoddbit = _mm_slli_epi32(absf_int, 31 - 13); // shift bit 13 (mantissa LSB) to sign
2414 mantodd = _mm_srai_epi32(mantoddbit, 31); // -1 if FP16 mantissa odd, else 0
2416 round1 = _mm_add_epi32(absf_int, STBIR__CONSTI(c_normal_bias));
2417 round2 = _mm_sub_epi32(round1, mantodd); // if mantissa LSB odd, bias towards rounding up (RTNE)
2418 normal = _mm_srli_epi32(round2, 13); // rounded result
2420 // combine the two non-specials
2421 nonspecial = _mm_or_si128(_mm_and_si128(subnorm2, b_issub), _mm_andnot_si128(b_issub, normal));
2423 // merge in specials as well
2424 joined = _mm_or_si128(_mm_and_si128(nonspecial, b_isregular), _mm_andnot_si128(b_isregular, inf_or_nan));
2426 sign_shift = _mm_srai_epi32(_mm_castps_si128(justsign), 16);
2427 final2 = _mm_or_si128(joined, sign_shift);
2428 final = _mm_packs_epi32(final, final2);
2429 stbir__simdi_store( output,final );
2430 }
2432#elif defined(STBIR_NEON) && defined(_MSC_VER) && defined(_M_ARM64) && !defined(__clang__) // 64-bit ARM on MSVC (not clang)
2434 static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input)
2435 {
2436 float16x4_t in0 = vld1_f16(input + 0);
2437 float16x4_t in1 = vld1_f16(input + 4);
2438 vst1q_f32(output + 0, vcvt_f32_f16(in0));
2439 vst1q_f32(output + 4, vcvt_f32_f16(in1));
2440 }
2442 static stbir__inline void stbir__float_to_half_SIMD(stbir__FP16 * output, float const * input)
2443 {
2444 float16x4_t out0 = vcvt_f16_f32(vld1q_f32(input + 0));
2445 float16x4_t out1 = vcvt_f16_f32(vld1q_f32(input + 4));
2446 vst1_f16(output+0, out0);
2447 vst1_f16(output+4, out1);
2448 }
2450 static stbir__inline float stbir__half_to_float( stbir__FP16 h )
2451 {
2452 return vgetq_lane_f32(vcvt_f32_f16(vld1_dup_f16(&h)), 0);
2453 }
2455 static stbir__inline stbir__FP16 stbir__float_to_half( float f )
2456 {
2457 return vget_lane_f16(vcvt_f16_f32(vdupq_n_f32(f)), 0).n16_u16[0];
2458 }
2460#elif defined(STBIR_NEON) && ( defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) ) // 64-bit ARM
2462 static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input)
2463 {
2464 float16x8_t in = vld1q_f16(input);
2465 vst1q_f32(output + 0, vcvt_f32_f16(vget_low_f16(in)));
2466 vst1q_f32(output + 4, vcvt_f32_f16(vget_high_f16(in)));
2467 }
2469 static stbir__inline void stbir__float_to_half_SIMD(stbir__FP16 * output, float const * input)
2470 {
2471 float16x4_t out0 = vcvt_f16_f32(vld1q_f32(input + 0));
2472 float16x4_t out1 = vcvt_f16_f32(vld1q_f32(input + 4));
2473 vst1q_f16(output, vcombine_f16(out0, out1));
2474 }
2476 static stbir__inline float stbir__half_to_float( stbir__FP16 h )
2477 {
2478 return vgetq_lane_f32(vcvt_f32_f16(vdup_n_f16(h)), 0);
2479 }
2481 static stbir__inline stbir__FP16 stbir__float_to_half( float f )
2482 {
2483 return vget_lane_f16(vcvt_f16_f32(vdupq_n_f32(f)), 0);
2484 }
2486#elif defined(STBIR_WASM) || (defined(STBIR_NEON) && (defined(_MSC_VER) || defined(_M_ARM) || defined(__arm__))) // WASM or 32-bit ARM on MSVC/clang
2488 static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input)
2489 {
2490 for (int i=0; i<8; i++)
2491 {
2492 output[i] = stbir__half_to_float(input[i]);
2493 }
2494 }
2495 static stbir__inline void stbir__float_to_half_SIMD(stbir__FP16 * output, float const * input)
2496 {
2497 for (int i=0; i<8; i++)
2498 {
2499 output[i] = stbir__float_to_half(input[i]);
2500 }
2501 }
2503#endif
2506#ifdef STBIR_SIMD
2508#define stbir__simdf_0123to3333( out, reg ) (out) = stbir__simdf_swiz( reg, 3,3,3,3 )
2509#define stbir__simdf_0123to2222( out, reg ) (out) = stbir__simdf_swiz( reg, 2,2,2,2 )
2510#define stbir__simdf_0123to1111( out, reg ) (out) = stbir__simdf_swiz( reg, 1,1,1,1 )
2511#define stbir__simdf_0123to0000( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,0,0 )
2512#define stbir__simdf_0123to0003( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,0,3 )
2513#define stbir__simdf_0123to0001( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,0,1 )
2514#define stbir__simdf_0123to1122( out, reg ) (out) = stbir__simdf_swiz( reg, 1,1,2,2 )
2515#define stbir__simdf_0123to2333( out, reg ) (out) = stbir__simdf_swiz( reg, 2,3,3,3 )
2516#define stbir__simdf_0123to0023( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,2,3 )
2517#define stbir__simdf_0123to1230( out, reg ) (out) = stbir__simdf_swiz( reg, 1,2,3,0 )
2518#define stbir__simdf_0123to2103( out, reg ) (out) = stbir__simdf_swiz( reg, 2,1,0,3 )
2519#define stbir__simdf_0123to3210( out, reg ) (out) = stbir__simdf_swiz( reg, 3,2,1,0 )
2520#define stbir__simdf_0123to2301( out, reg ) (out) = stbir__simdf_swiz( reg, 2,3,0,1 )
2521#define stbir__simdf_0123to3012( out, reg ) (out) = stbir__simdf_swiz( reg, 3,0,1,2 )
2522#define stbir__simdf_0123to0011( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,1,1 )
2523#define stbir__simdf_0123to1100( out, reg ) (out) = stbir__simdf_swiz( reg, 1,1,0,0 )
2524#define stbir__simdf_0123to2233( out, reg ) (out) = stbir__simdf_swiz( reg, 2,2,3,3 )
2525#define stbir__simdf_0123to1133( out, reg ) (out) = stbir__simdf_swiz( reg, 1,1,3,3 )
2526#define stbir__simdf_0123to0022( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,2,2 )
2527#define stbir__simdf_0123to1032( out, reg ) (out) = stbir__simdf_swiz( reg, 1,0,3,2 )
2529typedef union stbir__simdi_u32
2530{
2531 stbir_uint32 m128i_u32[4];
2532 int m128i_i32[4];
2533 stbir__simdi m128i_i128;
2534} stbir__simdi_u32;
2536static const int STBIR_mask[9] = { 0,0,0,-1,-1,-1,0,0,0 };
2538static const STBIR__SIMDF_CONST(STBIR_max_uint8_as_float, stbir__max_uint8_as_float);
2539static const STBIR__SIMDF_CONST(STBIR_max_uint16_as_float, stbir__max_uint16_as_float);
2540static const STBIR__SIMDF_CONST(STBIR_max_uint8_as_float_inverted, stbir__max_uint8_as_float_inverted);
2541static const STBIR__SIMDF_CONST(STBIR_max_uint16_as_float_inverted, stbir__max_uint16_as_float_inverted);
2543static const STBIR__SIMDF_CONST(STBIR_simd_point5, 0.5f);
2544static const STBIR__SIMDF_CONST(STBIR_ones, 1.0f);
2545static const STBIR__SIMDI_CONST(STBIR_almost_zero, (127 - 13) << 23);
2546static const STBIR__SIMDI_CONST(STBIR_almost_one, 0x3f7fffff);
2547static const STBIR__SIMDI_CONST(STBIR_mastissa_mask, 0xff);
2548static const STBIR__SIMDI_CONST(STBIR_topscale, 0x02000000);
2550// Basically, in simd mode, we unroll the proper amount, and we don't want
2551// the non-simd remnant loops to be unroll because they only run a few times
2552// Adding this switch saves about 5K on clang which is Captain Unroll the 3rd.
2553#define STBIR_SIMD_STREAMOUT_PTR( star ) STBIR_STREAMOUT_PTR( star )
2554#define STBIR_SIMD_NO_UNROLL(ptr) STBIR_NO_UNROLL(ptr)
2555#define STBIR_SIMD_NO_UNROLL_LOOP_START STBIR_NO_UNROLL_LOOP_START
2556#define STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR STBIR_NO_UNROLL_LOOP_START_INF_FOR
2558#ifdef STBIR_MEMCPY
2559#undef STBIR_MEMCPY
2560#endif
2561#define STBIR_MEMCPY stbir_simd_memcpy
2563// override normal use of memcpy with much simpler copy (faster and smaller with our sized copies)
2564static void stbir_simd_memcpy( void * dest, void const * src, size_t bytes )
2565{
2566 char STBIR_SIMD_STREAMOUT_PTR (*) d = (char*) dest;
2567 char STBIR_SIMD_STREAMOUT_PTR( * ) d_end = ((char*) dest) + bytes;
2568 ptrdiff_t ofs_to_src = (char*)src - (char*)dest;
2570 // check overlaps
2571 STBIR_ASSERT( ( ( d >= ( (char*)src) + bytes ) ) || ( ( d + bytes ) <= (char*)src ) );
2573 if ( bytes < (16*stbir__simdfX_float_count) )
2574 {
2575 if ( bytes < 16 )
2576 {
2577 if ( bytes )
2578 {
2579 STBIR_SIMD_NO_UNROLL_LOOP_START
2580 do
2581 {
2582 STBIR_SIMD_NO_UNROLL(d);
2583 d[ 0 ] = d[ ofs_to_src ];
2584 ++d;
2585 } while ( d < d_end );
2586 }
2587 }
2588 else
2589 {
2590 stbir__simdf x;
2591 // do one unaligned to get us aligned for the stream out below
2592 stbir__simdf_load( x, ( d + ofs_to_src ) );
2593 stbir__simdf_store( d, x );
2594 d = (char*)( ( ( (size_t)d ) + 16 ) & ~15 );
2596 STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
2597 for(;;)
2598 {
2599 STBIR_SIMD_NO_UNROLL(d);
2601 if ( d > ( d_end - 16 ) )
2602 {
2603 if ( d == d_end )
2604 return;
2605 d = d_end - 16;
2606 }
2608 stbir__simdf_load( x, ( d + ofs_to_src ) );
2609 stbir__simdf_store( d, x );
2610 d += 16;
2611 }
2612 }
2613 }
2614 else
2615 {
2616 stbir__simdfX x0,x1,x2,x3;
2618 // do one unaligned to get us aligned for the stream out below
2619 stbir__simdfX_load( x0, ( d + ofs_to_src ) + 0*stbir__simdfX_float_count );
2620 stbir__simdfX_load( x1, ( d + ofs_to_src ) + 4*stbir__simdfX_float_count );
2621 stbir__simdfX_load( x2, ( d + ofs_to_src ) + 8*stbir__simdfX_float_count );
2622 stbir__simdfX_load( x3, ( d + ofs_to_src ) + 12*stbir__simdfX_float_count );
2623 stbir__simdfX_store( d + 0*stbir__simdfX_float_count, x0 );
2624 stbir__simdfX_store( d + 4*stbir__simdfX_float_count, x1 );
2625 stbir__simdfX_store( d + 8*stbir__simdfX_float_count, x2 );
2626 stbir__simdfX_store( d + 12*stbir__simdfX_float_count, x3 );
2627 d = (char*)( ( ( (size_t)d ) + (16*stbir__simdfX_float_count) ) & ~((16*stbir__simdfX_float_count)-1) );
2629 STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
2630 for(;;)
2631 {
2632 STBIR_SIMD_NO_UNROLL(d);
2634 if ( d > ( d_end - (16*stbir__simdfX_float_count) ) )
2635 {
2636 if ( d == d_end )
2637 return;
2638 d = d_end - (16*stbir__simdfX_float_count);
2639 }
2641 stbir__simdfX_load( x0, ( d + ofs_to_src ) + 0*stbir__simdfX_float_count );
2642 stbir__simdfX_load( x1, ( d + ofs_to_src ) + 4*stbir__simdfX_float_count );
2643 stbir__simdfX_load( x2, ( d + ofs_to_src ) + 8*stbir__simdfX_float_count );
2644 stbir__simdfX_load( x3, ( d + ofs_to_src ) + 12*stbir__simdfX_float_count );
2645 stbir__simdfX_store( d + 0*stbir__simdfX_float_count, x0 );
2646 stbir__simdfX_store( d + 4*stbir__simdfX_float_count, x1 );
2647 stbir__simdfX_store( d + 8*stbir__simdfX_float_count, x2 );
2648 stbir__simdfX_store( d + 12*stbir__simdfX_float_count, x3 );
2649 d += (16*stbir__simdfX_float_count);
2650 }
2651 }
2652}
2654// memcpy that is specically intentionally overlapping (src is smaller then dest, so can be
2655// a normal forward copy, bytes is divisible by 4 and bytes is greater than or equal to
2656// the diff between dest and src)
2657static void stbir_overlapping_memcpy( void * dest, void const * src, size_t bytes )
2658{
2659 char STBIR_SIMD_STREAMOUT_PTR (*) sd = (char*) src;
2660 char STBIR_SIMD_STREAMOUT_PTR( * ) s_end = ((char*) src) + bytes;
2661 ptrdiff_t ofs_to_dest = (char*)dest - (char*)src;
2663 if ( ofs_to_dest >= 16 ) // is the overlap more than 16 away?
2664 {
2665 char STBIR_SIMD_STREAMOUT_PTR( * ) s_end16 = ((char*) src) + (bytes&~15);
2666 STBIR_SIMD_NO_UNROLL_LOOP_START
2667 do
2668 {
2669 stbir__simdf x;
2670 STBIR_SIMD_NO_UNROLL(sd);
2671 stbir__simdf_load( x, sd );
2672 stbir__simdf_store( ( sd + ofs_to_dest ), x );
2673 sd += 16;
2674 } while ( sd < s_end16 );
2676 if ( sd == s_end )
2677 return;
2678 }
2680 do
2681 {
2682 STBIR_SIMD_NO_UNROLL(sd);
2683 *(int*)( sd + ofs_to_dest ) = *(int*) sd;
2684 sd += 4;
2685 } while ( sd < s_end );
2686}
2688#else // no SSE2
2690// when in scalar mode, we let unrolling happen, so this macro just does the __restrict
2691#define STBIR_SIMD_STREAMOUT_PTR( star ) STBIR_STREAMOUT_PTR( star )
2692#define STBIR_SIMD_NO_UNROLL(ptr)
2693#define STBIR_SIMD_NO_UNROLL_LOOP_START
2694#define STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
2696#endif // SSE2
2699#ifdef STBIR_PROFILE
2701#ifndef STBIR_PROFILE_FUNC
2703#if defined(_x86_64) || defined( __x86_64__ ) || defined( _M_X64 ) || defined(__x86_64) || defined(__SSE2__) || defined(STBIR_SSE) || defined( _M_IX86_FP ) || defined(__i386) || defined( __i386__ ) || defined( _M_IX86 ) || defined( _X86_ )
2705#ifdef _MSC_VER
2707 STBIRDEF stbir_uint64 __rdtsc();
2708 #define STBIR_PROFILE_FUNC() __rdtsc()
2710#else // non msvc
2712 static stbir__inline stbir_uint64 STBIR_PROFILE_FUNC()
2713 {
2714 stbir_uint32 lo, hi;
2715 asm volatile ("rdtsc" : "=a" (lo), "=d" (hi) );
2716 return ( ( (stbir_uint64) hi ) << 32 ) | ( (stbir_uint64) lo );
2717 }
2719#endif // msvc
2721#elif defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) || defined(__ARM_NEON__)
2723#if defined( _MSC_VER ) && !defined(__clang__)
2725 #define STBIR_PROFILE_FUNC() _ReadStatusReg(ARM64_CNTVCT)
2727#else
2729 static stbir__inline stbir_uint64 STBIR_PROFILE_FUNC()
2730 {
2731 stbir_uint64 tsc;
2732 asm volatile("mrs %0, cntvct_el0" : "=r" (tsc));
2733 return tsc;
2734 }
2736#endif
2738#else // x64, arm
2740#error Unknown platform for profiling.
2742#endif // x64, arm
2744#endif // STBIR_PROFILE_FUNC
2746#define STBIR_ONLY_PROFILE_GET_SPLIT_INFO ,stbir__per_split_info * split_info
2747#define STBIR_ONLY_PROFILE_SET_SPLIT_INFO ,split_info
2749#define STBIR_ONLY_PROFILE_BUILD_GET_INFO ,stbir__info * profile_info
2750#define STBIR_ONLY_PROFILE_BUILD_SET_INFO ,profile_info
2752// super light-weight micro profiler
2753#define STBIR_PROFILE_START_ll( info, wh ) { stbir_uint64 wh##thiszonetime = STBIR_PROFILE_FUNC(); stbir_uint64 * wh##save_parent_excluded_ptr = info->current_zone_excluded_ptr; stbir_uint64 wh##current_zone_excluded = 0; info->current_zone_excluded_ptr = &wh##current_zone_excluded;
2754#define STBIR_PROFILE_END_ll( info, wh ) wh##thiszonetime = STBIR_PROFILE_FUNC() - wh##thiszonetime; info->profile.named.wh += wh##thiszonetime - wh##current_zone_excluded; *wh##save_parent_excluded_ptr += wh##thiszonetime; info->current_zone_excluded_ptr = wh##save_parent_excluded_ptr; }
2755#define STBIR_PROFILE_FIRST_START_ll( info, wh ) { int i; info->current_zone_excluded_ptr = &info->profile.named.total; for(i=0;i<STBIR__ARRAY_SIZE(info->profile.array);i++) info->profile.array[i]=0; } STBIR_PROFILE_START_ll( info, wh );
2756#define STBIR_PROFILE_CLEAR_EXTRAS_ll( info, num ) { int extra; for(extra=1;extra<(num);extra++) { int i; for(i=0;i<STBIR__ARRAY_SIZE((info)->profile.array);i++) (info)[extra].profile.array[i]=0; } }
2758// for thread data
2759#define STBIR_PROFILE_START( wh ) STBIR_PROFILE_START_ll( split_info, wh )
2760#define STBIR_PROFILE_END( wh ) STBIR_PROFILE_END_ll( split_info, wh )
2761#define STBIR_PROFILE_FIRST_START( wh ) STBIR_PROFILE_FIRST_START_ll( split_info, wh )
2762#define STBIR_PROFILE_CLEAR_EXTRAS() STBIR_PROFILE_CLEAR_EXTRAS_ll( split_info, split_count )
2764// for build data
2765#define STBIR_PROFILE_BUILD_START( wh ) STBIR_PROFILE_START_ll( profile_info, wh )
2766#define STBIR_PROFILE_BUILD_END( wh ) STBIR_PROFILE_END_ll( profile_info, wh )
2767#define STBIR_PROFILE_BUILD_FIRST_START( wh ) STBIR_PROFILE_FIRST_START_ll( profile_info, wh )
2768#define STBIR_PROFILE_BUILD_CLEAR( info ) { int i; for(i=0;i<STBIR__ARRAY_SIZE(info->profile.array);i++) info->profile.array[i]=0; }
2770#else // no profile
2772#define STBIR_ONLY_PROFILE_GET_SPLIT_INFO
2773#define STBIR_ONLY_PROFILE_SET_SPLIT_INFO
2775#define STBIR_ONLY_PROFILE_BUILD_GET_INFO
2776#define STBIR_ONLY_PROFILE_BUILD_SET_INFO
2778#define STBIR_PROFILE_START( wh )
2779#define STBIR_PROFILE_END( wh )
2780#define STBIR_PROFILE_FIRST_START( wh )
2781#define STBIR_PROFILE_CLEAR_EXTRAS( )
2783#define STBIR_PROFILE_BUILD_START( wh )
2784#define STBIR_PROFILE_BUILD_END( wh )
2785#define STBIR_PROFILE_BUILD_FIRST_START( wh )
2786#define STBIR_PROFILE_BUILD_CLEAR( info )
2788#endif // stbir_profile
2790#ifndef STBIR_CEILF
2791#include <math.h>
2792#if _MSC_VER <= 1200 // support VC6 for Sean
2793#define STBIR_CEILF(x) ((float)ceil((float)(x)))
2794#define STBIR_FLOORF(x) ((float)floor((float)(x)))
2795#else
2796#define STBIR_CEILF(x) ceilf(x)
2797#define STBIR_FLOORF(x) floorf(x)
2798#endif
2799#endif
2801#ifndef STBIR_MEMCPY
2802// For memcpy
2803#include <string.h>
2804#define STBIR_MEMCPY( dest, src, len ) memcpy( dest, src, len )
2805#endif
2807#ifndef STBIR_SIMD
2809// memcpy that is specifically intentionally overlapping (src is smaller then dest, so can be
2810// a normal forward copy, bytes is divisible by 4 and bytes is greater than or equal to
2811// the diff between dest and src)
2812static void stbir_overlapping_memcpy( void * dest, void const * src, size_t bytes )
2813{
2814 char STBIR_SIMD_STREAMOUT_PTR (*) sd = (char*) src;
2815 char STBIR_SIMD_STREAMOUT_PTR( * ) s_end = ((char*) src) + bytes;
2816 ptrdiff_t ofs_to_dest = (char*)dest - (char*)src;
2818 if ( ofs_to_dest >= 8 ) // is the overlap more than 8 away?
2819 {
2820 char STBIR_SIMD_STREAMOUT_PTR( * ) s_end8 = ((char*) src) + (bytes&~7);
2821 STBIR_NO_UNROLL_LOOP_START
2822 do
2823 {
2824 STBIR_NO_UNROLL(sd);
2825 *(stbir_uint64*)( sd + ofs_to_dest ) = *(stbir_uint64*) sd;
2826 sd += 8;
2827 } while ( sd < s_end8 );
2829 if ( sd == s_end )
2830 return;
2831 }
2833 STBIR_NO_UNROLL_LOOP_START
2834 do
2835 {
2836 STBIR_NO_UNROLL(sd);
2837 *(int*)( sd + ofs_to_dest ) = *(int*) sd;
2838 sd += 4;
2839 } while ( sd < s_end );
2840}
2842#endif
2844static float stbir__filter_trapezoid(float x, float scale, void * user_data)
2845{
2846 float halfscale = scale / 2;
2847 float t = 0.5f + halfscale;
2848 STBIR_ASSERT(scale <= 1);
2849 STBIR__UNUSED(user_data);
2851 if ( x < 0.0f ) x = -x;
2853 if (x >= t)
2854 return 0.0f;
2855 else
2856 {
2857 float r = 0.5f - halfscale;
2858 if (x <= r)
2859 return 1.0f;
2860 else
2861 return (t - x) / scale;
2862 }
2863}
2865static float stbir__support_trapezoid(float scale, void * user_data)
2866{
2867 STBIR__UNUSED(user_data);
2868 return 0.5f + scale / 2.0f;
2869}
2871static float stbir__filter_triangle(float x, float s, void * user_data)
2872{
2873 STBIR__UNUSED(s);
2874 STBIR__UNUSED(user_data);
2876 if ( x < 0.0f ) x = -x;
2878 if (x <= 1.0f)
2879 return 1.0f - x;
2880 else
2881 return 0.0f;
2882}
2884static float stbir__filter_point(float x, float s, void * user_data)
2885{
2886 STBIR__UNUSED(x);
2887 STBIR__UNUSED(s);
2888 STBIR__UNUSED(user_data);
2890 return 1.0f;
2891}
2893static float stbir__filter_cubic(float x, float s, void * user_data)
2894{
2895 STBIR__UNUSED(s);
2896 STBIR__UNUSED(user_data);
2898 if ( x < 0.0f ) x = -x;
2900 if (x < 1.0f)
2901 return (4.0f + x*x*(3.0f*x - 6.0f))/6.0f;
2902 else if (x < 2.0f)
2903 return (8.0f + x*(-12.0f + x*(6.0f - x)))/6.0f;
2905 return (0.0f);
2906}
2908static float stbir__filter_catmullrom(float x, float s, void * user_data)
2909{
2910 STBIR__UNUSED(s);
2911 STBIR__UNUSED(user_data);
2913 if ( x < 0.0f ) x = -x;
2915 if (x < 1.0f)
2916 return 1.0f - x*x*(2.5f - 1.5f*x);
2917 else if (x < 2.0f)
2918 return 2.0f - x*(4.0f + x*(0.5f*x - 2.5f));
2920 return (0.0f);
2921}
2923static float stbir__filter_mitchell(float x, float s, void * user_data)
2924{
2925 STBIR__UNUSED(s);
2926 STBIR__UNUSED(user_data);
2928 if ( x < 0.0f ) x = -x;
2930 if (x < 1.0f)
2931 return (16.0f + x*x*(21.0f * x - 36.0f))/18.0f;
2932 else if (x < 2.0f)
2933 return (32.0f + x*(-60.0f + x*(36.0f - 7.0f*x)))/18.0f;
2935 return (0.0f);
2936}
2938static float stbir__support_zeropoint5(float s, void * user_data)
2939{
2940 STBIR__UNUSED(s);
2941 STBIR__UNUSED(user_data);
2942 return 0.5f;
2943}
2945static float stbir__support_one(float s, void * user_data)
2946{
2947 STBIR__UNUSED(s);
2948 STBIR__UNUSED(user_data);
2949 return 1;
2950}
2952static float stbir__support_two(float s, void * user_data)
2953{
2954 STBIR__UNUSED(s);
2955 STBIR__UNUSED(user_data);
2956 return 2;
2957}
2959// This is the maximum number of input samples that can affect an output sample
2960// with the given filter from the output pixel's perspective
2961static int stbir__get_filter_pixel_width(stbir__support_callback * support, float scale, void * user_data)
2962{
2963 STBIR_ASSERT(support != 0);
2965 if ( scale >= ( 1.0f-stbir__small_float ) ) // upscale
2966 return (int)STBIR_CEILF(support(1.0f/scale,user_data) * 2.0f);
2967 else
2968 return (int)STBIR_CEILF(support(scale,user_data) * 2.0f / scale);
2969}
2971// this is how many coefficents per run of the filter (which is different
2972// from the filter_pixel_width depending on if we are scattering or gathering)
2973static int stbir__get_coefficient_width(stbir__sampler * samp, int is_gather, void * user_data)
2974{
2975 float scale = samp->scale_info.scale;
2976 stbir__support_callback * support = samp->filter_support;
2978 switch( is_gather )
2979 {
2980 case 1:
2981 return (int)STBIR_CEILF(support(1.0f / scale, user_data) * 2.0f);
2982 case 2:
2983 return (int)STBIR_CEILF(support(scale, user_data) * 2.0f / scale);
2984 case 0:
2985 return (int)STBIR_CEILF(support(scale, user_data) * 2.0f);
2986 default:
2987 STBIR_ASSERT( (is_gather >= 0 ) && (is_gather <= 2 ) );
2988 return 0;
2989 }
2990}
2992static int stbir__get_contributors(stbir__sampler * samp, int is_gather)
2993{
2994 if (is_gather)
2995 return samp->scale_info.output_sub_size;
2996 else
2997 return (samp->scale_info.input_full_size + samp->filter_pixel_margin * 2);
2998}
3000static int stbir__edge_zero_full( int n, int max )
3001{
3002 STBIR__UNUSED(n);
3003 STBIR__UNUSED(max);
3004 return 0; // NOTREACHED
3005}
3007static int stbir__edge_clamp_full( int n, int max )
3008{
3009 if (n < 0)
3010 return 0;
3012 if (n >= max)
3013 return max - 1;
3015 return n; // NOTREACHED
3016}
3018static int stbir__edge_reflect_full( int n, int max )
3019{
3020 if (n < 0)
3021 {
3022 if (n > -max)
3023 return -n;
3024 else
3025 return max - 1;
3026 }
3028 if (n >= max)
3029 {
3030 int max2 = max * 2;
3031 if (n >= max2)
3032 return 0;
3033 else
3034 return max2 - n - 1;
3035 }
3037 return n; // NOTREACHED
3038}
3040static int stbir__edge_wrap_full( int n, int max )
3041{
3042 if (n >= 0)
3043 return (n % max);
3044 else
3045 {
3046 int m = (-n) % max;
3048 if (m != 0)
3049 m = max - m;
3051 return (m);
3052 }
3053}
3055typedef int stbir__edge_wrap_func( int n, int max );
3056static stbir__edge_wrap_func * stbir__edge_wrap_slow[] =
3057{
3058 stbir__edge_clamp_full, // STBIR_EDGE_CLAMP
3059 stbir__edge_reflect_full, // STBIR_EDGE_REFLECT
3060 stbir__edge_wrap_full, // STBIR_EDGE_WRAP
3061 stbir__edge_zero_full, // STBIR_EDGE_ZERO
3062};
3064stbir__inline static int stbir__edge_wrap(stbir_edge edge, int n, int max)
3065{
3066 // avoid per-pixel switch
3067 if (n >= 0 && n < max)
3068 return n;
3069 return stbir__edge_wrap_slow[edge]( n, max );
3070}
3072#define STBIR__MERGE_RUNS_PIXEL_THRESHOLD 16
3074// get information on the extents of a sampler
3075static void stbir__get_extents( stbir__sampler * samp, stbir__extents * scanline_extents )
3076{
3077 int j, stop;
3078 int left_margin, right_margin;
3079 int min_n = 0x7fffffff, max_n = -0x7fffffff;
3080 int min_left = 0x7fffffff, max_left = -0x7fffffff;
3081 int min_right = 0x7fffffff, max_right = -0x7fffffff;
3082 stbir_edge edge = samp->edge;
3083 stbir__contributors* contributors = samp->contributors;
3084 int output_sub_size = samp->scale_info.output_sub_size;
3085 int input_full_size = samp->scale_info.input_full_size;
3086 int filter_pixel_margin = samp->filter_pixel_margin;
3088 STBIR_ASSERT( samp->is_gather );
3090 stop = output_sub_size;
3091 for (j = 0; j < stop; j++ )
3092 {
3093 STBIR_ASSERT( contributors[j].n1 >= contributors[j].n0 );
3094 if ( contributors[j].n0 < min_n )
3095 {
3096 min_n = contributors[j].n0;
3097 stop = j + filter_pixel_margin; // if we find a new min, only scan another filter width
3098 if ( stop > output_sub_size ) stop = output_sub_size;
3099 }
3100 }
3102 stop = 0;
3103 for (j = output_sub_size - 1; j >= stop; j-- )
3104 {
3105 STBIR_ASSERT( contributors[j].n1 >= contributors[j].n0 );
3106 if ( contributors[j].n1 > max_n )
3107 {
3108 max_n = contributors[j].n1;
3109 stop = j - filter_pixel_margin; // if we find a new max, only scan another filter width
3110 if (stop<0) stop = 0;
3111 }
3112 }
3114 STBIR_ASSERT( scanline_extents->conservative.n0 <= min_n );
3115 STBIR_ASSERT( scanline_extents->conservative.n1 >= max_n );
3117 // now calculate how much into the margins we really read
3118 left_margin = 0;
3119 if ( min_n < 0 )
3120 {
3121 left_margin = -min_n;
3122 min_n = 0;
3123 }
3125 right_margin = 0;
3126 if ( max_n >= input_full_size )
3127 {
3128 right_margin = max_n - input_full_size + 1;
3129 max_n = input_full_size - 1;
3130 }
3132 // index 1 is margin pixel extents (how many pixels we hang over the edge)
3133 scanline_extents->edge_sizes[0] = left_margin;
3134 scanline_extents->edge_sizes[1] = right_margin;
3136 // index 2 is pixels read from the input
3137 scanline_extents->spans[0].n0 = min_n;
3138 scanline_extents->spans[0].n1 = max_n;
3139 scanline_extents->spans[0].pixel_offset_for_input = min_n;
3141 // default to no other input range
3142 scanline_extents->spans[1].n0 = 0;
3143 scanline_extents->spans[1].n1 = -1;
3144 scanline_extents->spans[1].pixel_offset_for_input = 0;
3146 // don't have to do edge calc for zero clamp
3147 if ( edge == STBIR_EDGE_ZERO )
3148 return;
3150 // convert margin pixels to the pixels within the input (min and max)
3151 for( j = -left_margin ; j < 0 ; j++ )
3152 {
3153 int p = stbir__edge_wrap( edge, j, input_full_size );
3154 if ( p < min_left )
3155 min_left = p;
3156 if ( p > max_left )
3157 max_left = p;
3158 }
3160 for( j = input_full_size ; j < (input_full_size + right_margin) ; j++ )
3161 {
3162 int p = stbir__edge_wrap( edge, j, input_full_size );
3163 if ( p < min_right )
3164 min_right = p;
3165 if ( p > max_right )
3166 max_right = p;
3167 }
3169 // merge the left margin pixel region if it connects within 4 pixels of main pixel region
3170 if ( min_left != 0x7fffffff )
3171 {
3172 if ( ( ( min_left <= min_n ) && ( ( max_left + STBIR__MERGE_RUNS_PIXEL_THRESHOLD ) >= min_n ) ) ||
3173 ( ( min_n <= min_left ) && ( ( max_n + STBIR__MERGE_RUNS_PIXEL_THRESHOLD ) >= max_left ) ) )
3174 {
3175 scanline_extents->spans[0].n0 = min_n = stbir__min( min_n, min_left );
3176 scanline_extents->spans[0].n1 = max_n = stbir__max( max_n, max_left );
3177 scanline_extents->spans[0].pixel_offset_for_input = min_n;
3178 left_margin = 0;
3179 }
3180 }
3182 // merge the right margin pixel region if it connects within 4 pixels of main pixel region
3183 if ( min_right != 0x7fffffff )
3184 {
3185 if ( ( ( min_right <= min_n ) && ( ( max_right + STBIR__MERGE_RUNS_PIXEL_THRESHOLD ) >= min_n ) ) ||
3186 ( ( min_n <= min_right ) && ( ( max_n + STBIR__MERGE_RUNS_PIXEL_THRESHOLD ) >= max_right ) ) )
3187 {
3188 scanline_extents->spans[0].n0 = min_n = stbir__min( min_n, min_right );
3189 scanline_extents->spans[0].n1 = max_n = stbir__max( max_n, max_right );
3190 scanline_extents->spans[0].pixel_offset_for_input = min_n;
3191 right_margin = 0;
3192 }
3193 }
3195 STBIR_ASSERT( scanline_extents->conservative.n0 <= min_n );
3196 STBIR_ASSERT( scanline_extents->conservative.n1 >= max_n );
3198 // you get two ranges when you have the WRAP edge mode and you are doing just the a piece of the resize
3199 // so you need to get a second run of pixels from the opposite side of the scanline (which you
3200 // wouldn't need except for WRAP)
3203 // if we can't merge the min_left range, add it as a second range
3204 if ( ( left_margin ) && ( min_left != 0x7fffffff ) )
3205 {
3206 stbir__span * newspan = scanline_extents->spans + 1;
3207 STBIR_ASSERT( right_margin == 0 );
3208 if ( min_left < scanline_extents->spans[0].n0 )
3209 {
3210 scanline_extents->spans[1].pixel_offset_for_input = scanline_extents->spans[0].n0;
3211 scanline_extents->spans[1].n0 = scanline_extents->spans[0].n0;
3212 scanline_extents->spans[1].n1 = scanline_extents->spans[0].n1;
3213 --newspan;
3214 }
3215 newspan->pixel_offset_for_input = min_left;
3216 newspan->n0 = -left_margin;
3217 newspan->n1 = ( max_left - min_left ) - left_margin;
3218 scanline_extents->edge_sizes[0] = 0; // don't need to copy the left margin, since we are directly decoding into the margin
3219 return;
3220 }
3222 // if we can't merge the min_left range, add it as a second range
3223 if ( ( right_margin ) && ( min_right != 0x7fffffff ) )
3224 {
3225 stbir__span * newspan = scanline_extents->spans + 1;
3226 if ( min_right < scanline_extents->spans[0].n0 )
3227 {
3228 scanline_extents->spans[1].pixel_offset_for_input = scanline_extents->spans[0].n0;
3229 scanline_extents->spans[1].n0 = scanline_extents->spans[0].n0;
3230 scanline_extents->spans[1].n1 = scanline_extents->spans[0].n1;
3231 --newspan;
3232 }
3233 newspan->pixel_offset_for_input = min_right;
3234 newspan->n0 = scanline_extents->spans[1].n1 + 1;
3235 newspan->n1 = scanline_extents->spans[1].n1 + 1 + ( max_right - min_right );
3236 scanline_extents->edge_sizes[1] = 0; // don't need to copy the right margin, since we are directly decoding into the margin
3237 return;
3238 }
3239}
3241static void stbir__calculate_in_pixel_range( int * first_pixel, int * last_pixel, float out_pixel_center, float out_filter_radius, float inv_scale, float out_shift, int input_size, stbir_edge edge )
3242{
3243 int first, last;
3244 float out_pixel_influence_lowerbound = out_pixel_center - out_filter_radius;
3245 float out_pixel_influence_upperbound = out_pixel_center + out_filter_radius;
3247 float in_pixel_influence_lowerbound = (out_pixel_influence_lowerbound + out_shift) * inv_scale;
3248 float in_pixel_influence_upperbound = (out_pixel_influence_upperbound + out_shift) * inv_scale;
3250 first = (int)(STBIR_FLOORF(in_pixel_influence_lowerbound + 0.5f));
3251 last = (int)(STBIR_FLOORF(in_pixel_influence_upperbound - 0.5f));
3252 if ( last < first ) last = first; // point sample mode can span a value *right* at 0.5, and cause these to cross
3254 if ( edge == STBIR_EDGE_WRAP )
3255 {
3256 if ( first < -input_size )
3257 first = -input_size;
3258 if ( last >= (input_size*2))
3259 last = (input_size*2) - 1;
3260 }
3262 *first_pixel = first;
3263 *last_pixel = last;
3264}
3266static void stbir__calculate_coefficients_for_gather_upsample( float out_filter_radius, stbir__kernel_callback * kernel, stbir__scale_info * scale_info, int num_contributors, stbir__contributors* contributors, float* coefficient_group, int coefficient_width, stbir_edge edge, void * user_data )
3267{
3268 int n, end;
3269 float inv_scale = scale_info->inv_scale;
3270 float out_shift = scale_info->pixel_shift;
3271 int input_size = scale_info->input_full_size;
3272 int numerator = scale_info->scale_numerator;
3273 int polyphase = ( ( scale_info->scale_is_rational ) && ( numerator < num_contributors ) );
3275 // Looping through out pixels
3276 end = num_contributors; if ( polyphase ) end = numerator;
3277 for (n = 0; n < end; n++)
3278 {
3279 int i;
3280 int last_non_zero;
3281 float out_pixel_center = (float)n + 0.5f;
3282 float in_center_of_out = (out_pixel_center + out_shift) * inv_scale;
3284 int in_first_pixel, in_last_pixel;
3286 stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, out_pixel_center, out_filter_radius, inv_scale, out_shift, input_size, edge );
3288 // make sure we never generate a range larger than our precalculated coeff width
3289 // this only happens in point sample mode, but it's a good safe thing to do anyway
3290 if ( ( in_last_pixel - in_first_pixel + 1 ) > coefficient_width )
3291 in_last_pixel = in_first_pixel + coefficient_width - 1;
3293 last_non_zero = -1;
3294 for (i = 0; i <= in_last_pixel - in_first_pixel; i++)
3295 {
3296 float in_pixel_center = (float)(i + in_first_pixel) + 0.5f;
3297 float coeff = kernel(in_center_of_out - in_pixel_center, inv_scale, user_data);
3299 // kill denormals
3300 if ( ( ( coeff < stbir__small_float ) && ( coeff > -stbir__small_float ) ) )
3301 {
3302 if ( i == 0 ) // if we're at the front, just eat zero contributors
3303 {
3304 STBIR_ASSERT ( ( in_last_pixel - in_first_pixel ) != 0 ); // there should be at least one contrib
3305 ++in_first_pixel;
3306 i--;
3307 continue;
3308 }
3309 coeff = 0; // make sure is fully zero (should keep denormals away)
3310 }
3311 else
3312 last_non_zero = i;
3314 coefficient_group[i] = coeff;
3315 }
3317 in_last_pixel = last_non_zero+in_first_pixel; // kills trailing zeros
3318 contributors->n0 = in_first_pixel;
3319 contributors->n1 = in_last_pixel;
3321 STBIR_ASSERT(contributors->n1 >= contributors->n0);
3323 ++contributors;
3324 coefficient_group += coefficient_width;
3325 }
3326}
3328static void stbir__insert_coeff( stbir__contributors * contribs, float * coeffs, int new_pixel, float new_coeff, int max_width )
3329{
3330 if ( new_pixel <= contribs->n1 ) // before the end
3331 {
3332 if ( new_pixel < contribs->n0 ) // before the front?
3333 {
3334 if ( ( contribs->n1 - new_pixel + 1 ) <= max_width )
3335 {
3336 int j, o = contribs->n0 - new_pixel;
3337 for ( j = contribs->n1 - contribs->n0 ; j <= 0 ; j-- )
3338 coeffs[ j + o ] = coeffs[ j ];
3339 for ( j = 1 ; j < o ; j-- )
3340 coeffs[ j ] = coeffs[ 0 ];
3341 coeffs[ 0 ] = new_coeff;
3342 contribs->n0 = new_pixel;
3343 }
3344 }
3345 else
3346 {
3347 coeffs[ new_pixel - contribs->n0 ] += new_coeff;
3348 }
3349 }
3350 else
3351 {
3352 if ( ( new_pixel - contribs->n0 + 1 ) <= max_width )
3353 {
3354 int j, e = new_pixel - contribs->n0;
3355 for( j = ( contribs->n1 - contribs->n0 ) + 1 ; j < e ; j++ ) // clear in-betweens coeffs if there are any
3356 coeffs[j] = 0;
3358 coeffs[ e ] = new_coeff;
3359 contribs->n1 = new_pixel;
3360 }
3361 }
3362}
3364static void stbir__calculate_out_pixel_range( int * first_pixel, int * last_pixel, float in_pixel_center, float in_pixels_radius, float scale, float out_shift, int out_size )
3365{
3366 float in_pixel_influence_lowerbound = in_pixel_center - in_pixels_radius;
3367 float in_pixel_influence_upperbound = in_pixel_center + in_pixels_radius;
3368 float out_pixel_influence_lowerbound = in_pixel_influence_lowerbound * scale - out_shift;
3369 float out_pixel_influence_upperbound = in_pixel_influence_upperbound * scale - out_shift;
3370 int out_first_pixel = (int)(STBIR_FLOORF(out_pixel_influence_lowerbound + 0.5f));
3371 int out_last_pixel = (int)(STBIR_FLOORF(out_pixel_influence_upperbound - 0.5f));
3373 if ( out_first_pixel < 0 )
3374 out_first_pixel = 0;
3375 if ( out_last_pixel >= out_size )
3376 out_last_pixel = out_size - 1;
3377 *first_pixel = out_first_pixel;
3378 *last_pixel = out_last_pixel;
3379}
3381static void stbir__calculate_coefficients_for_gather_downsample( int start, int end, float in_pixels_radius, stbir__kernel_callback * kernel, stbir__scale_info * scale_info, int coefficient_width, int num_contributors, stbir__contributors * contributors, float * coefficient_group, void * user_data )
3382{
3383 int in_pixel;
3384 int i;
3385 int first_out_inited = -1;
3386 float scale = scale_info->scale;
3387 float out_shift = scale_info->pixel_shift;
3388 int out_size = scale_info->output_sub_size;
3389 int numerator = scale_info->scale_numerator;
3390 int polyphase = ( ( scale_info->scale_is_rational ) && ( numerator < out_size ) );
3392 STBIR__UNUSED(num_contributors);
3394 // Loop through the input pixels
3395 for (in_pixel = start; in_pixel < end; in_pixel++)
3396 {
3397 float in_pixel_center = (float)in_pixel + 0.5f;
3398 float out_center_of_in = in_pixel_center * scale - out_shift;
3399 int out_first_pixel, out_last_pixel;
3401 stbir__calculate_out_pixel_range( &out_first_pixel, &out_last_pixel, in_pixel_center, in_pixels_radius, scale, out_shift, out_size );
3403 if ( out_first_pixel > out_last_pixel )
3404 continue;
3406 // clamp or exit if we are using polyphase filtering, and the limit is up
3407 if ( polyphase )
3408 {
3409 // when polyphase, you only have to do coeffs up to the numerator count
3410 if ( out_first_pixel == numerator )
3411 break;
3413 // don't do any extra work, clamp last pixel at numerator too
3414 if ( out_last_pixel >= numerator )
3415 out_last_pixel = numerator - 1;
3416 }
3418 for (i = 0; i <= out_last_pixel - out_first_pixel; i++)
3419 {
3420 float out_pixel_center = (float)(i + out_first_pixel) + 0.5f;
3421 float x = out_pixel_center - out_center_of_in;
3422 float coeff = kernel(x, scale, user_data) * scale;
3424 // kill the coeff if it's too small (avoid denormals)
3425 if ( ( ( coeff < stbir__small_float ) && ( coeff > -stbir__small_float ) ) )
3426 coeff = 0.0f;
3428 {
3429 int out = i + out_first_pixel;
3430 float * coeffs = coefficient_group + out * coefficient_width;
3431 stbir__contributors * contribs = contributors + out;
3433 // is this the first time this output pixel has been seen? Init it.
3434 if ( out > first_out_inited )
3435 {
3436 STBIR_ASSERT( out == ( first_out_inited + 1 ) ); // ensure we have only advanced one at time
3437 first_out_inited = out;
3438 contribs->n0 = in_pixel;
3439 contribs->n1 = in_pixel;
3440 coeffs[0] = coeff;
3441 }
3442 else
3443 {
3444 // insert on end (always in order)
3445 if ( coeffs[0] == 0.0f ) // if the first coefficent is zero, then zap it for this coeffs
3446 {
3447 STBIR_ASSERT( ( in_pixel - contribs->n0 ) == 1 ); // ensure that when we zap, we're at the 2nd pos
3448 contribs->n0 = in_pixel;
3449 }
3450 contribs->n1 = in_pixel;
3451 STBIR_ASSERT( ( in_pixel - contribs->n0 ) < coefficient_width );
3452 coeffs[in_pixel - contribs->n0] = coeff;
3453 }
3454 }
3455 }
3456 }
3457}
3459#ifdef STBIR_RENORMALIZE_IN_FLOAT
3460#define STBIR_RENORM_TYPE float
3461#else
3462#define STBIR_RENORM_TYPE double
3463#endif
3465static void stbir__cleanup_gathered_coefficients( stbir_edge edge, stbir__filter_extent_info* filter_info, stbir__scale_info * scale_info, int num_contributors, stbir__contributors* contributors, float * coefficient_group, int coefficient_width )
3466{
3467 int input_size = scale_info->input_full_size;
3468 int input_last_n1 = input_size - 1;
3469 int n, end;
3470 int lowest = 0x7fffffff;
3471 int highest = -0x7fffffff;
3472 int widest = -1;
3473 int numerator = scale_info->scale_numerator;
3474 int denominator = scale_info->scale_denominator;
3475 int polyphase = ( ( scale_info->scale_is_rational ) && ( numerator < num_contributors ) );
3476 float * coeffs;
3477 stbir__contributors * contribs;
3479 // weight all the coeffs for each sample
3480 coeffs = coefficient_group;
3481 contribs = contributors;
3482 end = num_contributors; if ( polyphase ) end = numerator;
3483 for (n = 0; n < end; n++)
3484 {
3485 int i;
3486 STBIR_RENORM_TYPE filter_scale, total_filter = 0;
3487 int e;
3489 // add all contribs
3490 e = contribs->n1 - contribs->n0;
3491 for( i = 0 ; i <= e ; i++ )
3492 {
3493 total_filter += (STBIR_RENORM_TYPE) coeffs[i];
3494 STBIR_ASSERT( ( coeffs[i] >= -2.0f ) && ( coeffs[i] <= 2.0f ) ); // check for wonky weights
3495 }
3497 // rescale
3498 if ( ( total_filter < stbir__small_float ) && ( total_filter > -stbir__small_float ) )
3499 {
3500 // all coeffs are extremely small, just zero it
3501 contribs->n1 = contribs->n0;
3502 coeffs[0] = 0.0f;
3503 }
3504 else
3505 {
3506 // if the total isn't 1.0, rescale everything
3507 if ( ( total_filter < (1.0f-stbir__small_float) ) || ( total_filter > (1.0f+stbir__small_float) ) )
3508 {
3509 filter_scale = ((STBIR_RENORM_TYPE)1.0) / total_filter;
3511 // scale them all
3512 for (i = 0; i <= e; i++)
3513 coeffs[i] = (float) ( coeffs[i] * filter_scale );
3514 }
3515 }
3516 ++contribs;
3517 coeffs += coefficient_width;
3518 }
3520 // if we have a rational for the scale, we can exploit the polyphaseness to not calculate
3521 // most of the coefficients, so we copy them here
3522 if ( polyphase )
3523 {
3524 stbir__contributors * prev_contribs = contributors;
3525 stbir__contributors * cur_contribs = contributors + numerator;
3527 for( n = numerator ; n < num_contributors ; n++ )
3528 {
3529 cur_contribs->n0 = prev_contribs->n0 + denominator;
3530 cur_contribs->n1 = prev_contribs->n1 + denominator;
3531 ++cur_contribs;
3532 ++prev_contribs;
3533 }
3534 stbir_overlapping_memcpy( coefficient_group + numerator * coefficient_width, coefficient_group, ( num_contributors - numerator ) * coefficient_width * sizeof( coeffs[ 0 ] ) );
3535 }
3537 coeffs = coefficient_group;
3538 contribs = contributors;
3540 for (n = 0; n < num_contributors; n++)
3541 {
3542 int i;
3544 // in zero edge mode, just remove out of bounds contribs completely (since their weights are accounted for now)
3545 if ( edge == STBIR_EDGE_ZERO )
3546 {
3547 // shrink the right side if necessary
3548 if ( contribs->n1 > input_last_n1 )
3549 contribs->n1 = input_last_n1;
3551 // shrink the left side
3552 if ( contribs->n0 < 0 )
3553 {
3554 int j, left, skips = 0;
3556 skips = -contribs->n0;
3557 contribs->n0 = 0;
3559 // now move down the weights
3560 left = contribs->n1 - contribs->n0 + 1;
3561 if ( left > 0 )
3562 {
3563 for( j = 0 ; j < left ; j++ )
3564 coeffs[ j ] = coeffs[ j + skips ];
3565 }
3566 }
3567 }
3568 else if ( ( edge == STBIR_EDGE_CLAMP ) || ( edge == STBIR_EDGE_REFLECT ) )
3569 {
3570 // for clamp and reflect, calculate the true inbounds position (based on edge type) and just add that to the existing weight
3572 // right hand side first
3573 if ( contribs->n1 > input_last_n1 )
3574 {
3575 int start = contribs->n0;
3576 int endi = contribs->n1;
3577 contribs->n1 = input_last_n1;
3578 for( i = input_size; i <= endi; i++ )
3579 stbir__insert_coeff( contribs, coeffs, stbir__edge_wrap_slow[edge]( i, input_size ), coeffs[i-start], coefficient_width );
3580 }
3582 // now check left hand edge
3583 if ( contribs->n0 < 0 )
3584 {
3585 int save_n0;
3586 float save_n0_coeff;
3587 float * c = coeffs - ( contribs->n0 + 1 );
3589 // reinsert the coeffs with it reflected or clamped (insert accumulates, if the coeffs exist)
3590 for( i = -1 ; i > contribs->n0 ; i-- )
3591 stbir__insert_coeff( contribs, coeffs, stbir__edge_wrap_slow[edge]( i, input_size ), *c--, coefficient_width );
3592 save_n0 = contribs->n0;
3593 save_n0_coeff = c[0]; // save it, since we didn't do the final one (i==n0), because there might be too many coeffs to hold (before we resize)!
3595 // now slide all the coeffs down (since we have accumulated them in the positive contribs) and reset the first contrib
3596 contribs->n0 = 0;
3597 for(i = 0 ; i <= contribs->n1 ; i++ )
3598 coeffs[i] = coeffs[i-save_n0];
3600 // now that we have shrunk down the contribs, we insert the first one safely
3601 stbir__insert_coeff( contribs, coeffs, stbir__edge_wrap_slow[edge]( save_n0, input_size ), save_n0_coeff, coefficient_width );
3602 }
3603 }
3605 if ( contribs->n0 <= contribs->n1 )
3606 {
3607 int diff = contribs->n1 - contribs->n0 + 1;
3608 while ( diff && ( coeffs[ diff-1 ] == 0.0f ) )
3609 --diff;
3611 contribs->n1 = contribs->n0 + diff - 1;
3613 if ( contribs->n0 <= contribs->n1 )
3614 {
3615 if ( contribs->n0 < lowest )
3616 lowest = contribs->n0;
3617 if ( contribs->n1 > highest )
3618 highest = contribs->n1;
3619 if ( diff > widest )
3620 widest = diff;
3621 }
3623 // re-zero out unused coefficients (if any)
3624 for( i = diff ; i < coefficient_width ; i++ )
3625 coeffs[i] = 0.0f;
3626 }
3628 ++contribs;
3629 coeffs += coefficient_width;
3630 }
3631 filter_info->lowest = lowest;
3632 filter_info->highest = highest;
3633 filter_info->widest = widest;
3634}
3636#undef STBIR_RENORM_TYPE
3638static int stbir__pack_coefficients( int num_contributors, stbir__contributors* contributors, float * coefficents, int coefficient_width, int widest, int row0, int row1 )
3639{
3640 #define STBIR_MOVE_1( dest, src ) { STBIR_NO_UNROLL(dest); ((stbir_uint32*)(dest))[0] = ((stbir_uint32*)(src))[0]; }
3641 #define STBIR_MOVE_2( dest, src ) { STBIR_NO_UNROLL(dest); ((stbir_uint64*)(dest))[0] = ((stbir_uint64*)(src))[0]; }
3642 #ifdef STBIR_SIMD
3643 #define STBIR_MOVE_4( dest, src ) { stbir__simdf t; STBIR_NO_UNROLL(dest); stbir__simdf_load( t, src ); stbir__simdf_store( dest, t ); }
3644 #else
3645 #define STBIR_MOVE_4( dest, src ) { STBIR_NO_UNROLL(dest); ((stbir_uint64*)(dest))[0] = ((stbir_uint64*)(src))[0]; ((stbir_uint64*)(dest))[1] = ((stbir_uint64*)(src))[1]; }
3646 #endif
3648 int row_end = row1 + 1;
3649 STBIR__UNUSED( row0 ); // only used in an assert
3651 if ( coefficient_width != widest )
3652 {
3653 float * pc = coefficents;
3654 float * coeffs = coefficents;
3655 float * pc_end = coefficents + num_contributors * widest;
3656 switch( widest )
3657 {
3658 case 1:
3659 STBIR_NO_UNROLL_LOOP_START
3660 do {
3661 STBIR_MOVE_1( pc, coeffs );
3662 ++pc;
3663 coeffs += coefficient_width;
3664 } while ( pc < pc_end );
3665 break;
3666 case 2:
3667 STBIR_NO_UNROLL_LOOP_START
3668 do {
3669 STBIR_MOVE_2( pc, coeffs );
3670 pc += 2;
3671 coeffs += coefficient_width;
3672 } while ( pc < pc_end );
3673 break;
3674 case 3:
3675 STBIR_NO_UNROLL_LOOP_START
3676 do {
3677 STBIR_MOVE_2( pc, coeffs );
3678 STBIR_MOVE_1( pc+2, coeffs+2 );
3679 pc += 3;
3680 coeffs += coefficient_width;
3681 } while ( pc < pc_end );
3682 break;
3683 case 4:
3684 STBIR_NO_UNROLL_LOOP_START
3685 do {
3686 STBIR_MOVE_4( pc, coeffs );
3687 pc += 4;
3688 coeffs += coefficient_width;
3689 } while ( pc < pc_end );
3690 break;
3691 case 5:
3692 STBIR_NO_UNROLL_LOOP_START
3693 do {
3694 STBIR_MOVE_4( pc, coeffs );
3695 STBIR_MOVE_1( pc+4, coeffs+4 );
3696 pc += 5;
3697 coeffs += coefficient_width;
3698 } while ( pc < pc_end );
3699 break;
3700 case 6:
3701 STBIR_NO_UNROLL_LOOP_START
3702 do {
3703 STBIR_MOVE_4( pc, coeffs );
3704 STBIR_MOVE_2( pc+4, coeffs+4 );
3705 pc += 6;
3706 coeffs += coefficient_width;
3707 } while ( pc < pc_end );
3708 break;
3709 case 7:
3710 STBIR_NO_UNROLL_LOOP_START
3711 do {
3712 STBIR_MOVE_4( pc, coeffs );
3713 STBIR_MOVE_2( pc+4, coeffs+4 );
3714 STBIR_MOVE_1( pc+6, coeffs+6 );
3715 pc += 7;
3716 coeffs += coefficient_width;
3717 } while ( pc < pc_end );
3718 break;
3719 case 8:
3720 STBIR_NO_UNROLL_LOOP_START
3721 do {
3722 STBIR_MOVE_4( pc, coeffs );
3723 STBIR_MOVE_4( pc+4, coeffs+4 );
3724 pc += 8;
3725 coeffs += coefficient_width;
3726 } while ( pc < pc_end );
3727 break;
3728 case 9:
3729 STBIR_NO_UNROLL_LOOP_START
3730 do {
3731 STBIR_MOVE_4( pc, coeffs );
3732 STBIR_MOVE_4( pc+4, coeffs+4 );
3733 STBIR_MOVE_1( pc+8, coeffs+8 );
3734 pc += 9;
3735 coeffs += coefficient_width;
3736 } while ( pc < pc_end );
3737 break;
3738 case 10:
3739 STBIR_NO_UNROLL_LOOP_START
3740 do {
3741 STBIR_MOVE_4( pc, coeffs );
3742 STBIR_MOVE_4( pc+4, coeffs+4 );
3743 STBIR_MOVE_2( pc+8, coeffs+8 );
3744 pc += 10;
3745 coeffs += coefficient_width;
3746 } while ( pc < pc_end );
3747 break;
3748 case 11:
3749 STBIR_NO_UNROLL_LOOP_START
3750 do {
3751 STBIR_MOVE_4( pc, coeffs );
3752 STBIR_MOVE_4( pc+4, coeffs+4 );
3753 STBIR_MOVE_2( pc+8, coeffs+8 );
3754 STBIR_MOVE_1( pc+10, coeffs+10 );
3755 pc += 11;
3756 coeffs += coefficient_width;
3757 } while ( pc < pc_end );
3758 break;
3759 case 12:
3760 STBIR_NO_UNROLL_LOOP_START
3761 do {
3762 STBIR_MOVE_4( pc, coeffs );
3763 STBIR_MOVE_4( pc+4, coeffs+4 );
3764 STBIR_MOVE_4( pc+8, coeffs+8 );
3765 pc += 12;
3766 coeffs += coefficient_width;
3767 } while ( pc < pc_end );
3768 break;
3769 default:
3770 STBIR_NO_UNROLL_LOOP_START
3771 do {
3772 float * copy_end = pc + widest - 4;
3773 float * c = coeffs;
3774 do {
3775 STBIR_NO_UNROLL( pc );
3776 STBIR_MOVE_4( pc, c );
3777 pc += 4;
3778 c += 4;
3779 } while ( pc <= copy_end );
3780 copy_end += 4;
3781 STBIR_NO_UNROLL_LOOP_START
3782 while ( pc < copy_end )
3783 {
3784 STBIR_MOVE_1( pc, c );
3785 ++pc; ++c;
3786 }
3787 coeffs += coefficient_width;
3788 } while ( pc < pc_end );
3789 break;
3790 }
3791 }
3793 // some horizontal routines read one float off the end (which is then masked off), so put in a sentinal so we don't read an snan or denormal
3794 coefficents[ widest * num_contributors ] = 8888.0f;
3796 // the minimum we might read for unrolled filters widths is 12. So, we need to
3797 // make sure we never read outside the decode buffer, by possibly moving
3798 // the sample area back into the scanline, and putting zeros weights first.
3799 // we start on the right edge and check until we're well past the possible
3800 // clip area (2*widest).
3801 {
3802 stbir__contributors * contribs = contributors + num_contributors - 1;
3803 float * coeffs = coefficents + widest * ( num_contributors - 1 );
3805 // go until no chance of clipping (this is usually less than 8 lops)
3806 while ( ( contribs >= contributors ) && ( ( contribs->n0 + widest*2 ) >= row_end ) )
3807 {
3808 // might we clip??
3809 if ( ( contribs->n0 + widest ) > row_end )
3810 {
3811 int stop_range = widest;
3813 // if range is larger than 12, it will be handled by generic loops that can terminate on the exact length
3814 // of this contrib n1, instead of a fixed widest amount - so calculate this
3815 if ( widest > 12 )
3816 {
3817 int mod;
3819 // how far will be read in the n_coeff loop (which depends on the widest count mod4);
3820 mod = widest & 3;
3821 stop_range = ( ( ( contribs->n1 - contribs->n0 + 1 ) - mod + 3 ) & ~3 ) + mod;
3823 // the n_coeff loops do a minimum amount of coeffs, so factor that in!
3824 if ( stop_range < ( 8 + mod ) ) stop_range = 8 + mod;
3825 }
3827 // now see if we still clip with the refined range
3828 if ( ( contribs->n0 + stop_range ) > row_end )
3829 {
3830 int new_n0 = row_end - stop_range;
3831 int num = contribs->n1 - contribs->n0 + 1;
3832 int backup = contribs->n0 - new_n0;
3833 float * from_co = coeffs + num - 1;
3834 float * to_co = from_co + backup;
3836 STBIR_ASSERT( ( new_n0 >= row0 ) && ( new_n0 < contribs->n0 ) );
3838 // move the coeffs over
3839 while( num )
3840 {
3841 *to_co-- = *from_co--;
3842 --num;
3843 }
3844 // zero new positions
3845 while ( to_co >= coeffs )
3846 *to_co-- = 0;
3847 // set new start point
3848 contribs->n0 = new_n0;
3849 if ( widest > 12 )
3850 {
3851 int mod;
3853 // how far will be read in the n_coeff loop (which depends on the widest count mod4);
3854 mod = widest & 3;
3855 stop_range = ( ( ( contribs->n1 - contribs->n0 + 1 ) - mod + 3 ) & ~3 ) + mod;
3857 // the n_coeff loops do a minimum amount of coeffs, so factor that in!
3858 if ( stop_range < ( 8 + mod ) ) stop_range = 8 + mod;
3859 }
3860 }
3861 }
3862 --contribs;
3863 coeffs -= widest;
3864 }
3865 }
3867 return widest;
3868 #undef STBIR_MOVE_1
3869 #undef STBIR_MOVE_2
3870 #undef STBIR_MOVE_4
3871}
3873static void stbir__calculate_filters( stbir__sampler * samp, stbir__sampler * other_axis_for_pivot, void * user_data STBIR_ONLY_PROFILE_BUILD_GET_INFO )
3874{
3875 int n;
3876 float scale = samp->scale_info.scale;
3877 stbir__kernel_callback * kernel = samp->filter_kernel;
3878 stbir__support_callback * support = samp->filter_support;
3879 float inv_scale = samp->scale_info.inv_scale;
3880 int input_full_size = samp->scale_info.input_full_size;
3881 int gather_num_contributors = samp->num_contributors;
3882 stbir__contributors* gather_contributors = samp->contributors;
3883 float * gather_coeffs = samp->coefficients;
3884 int gather_coefficient_width = samp->coefficient_width;
3886 switch ( samp->is_gather )
3887 {
3888 case 1: // gather upsample
3889 {
3890 float out_pixels_radius = support(inv_scale,user_data) * scale;
3892 stbir__calculate_coefficients_for_gather_upsample( out_pixels_radius, kernel, &samp->scale_info, gather_num_contributors, gather_contributors, gather_coeffs, gather_coefficient_width, samp->edge, user_data );
3894 STBIR_PROFILE_BUILD_START( cleanup );
3895 stbir__cleanup_gathered_coefficients( samp->edge, &samp->extent_info, &samp->scale_info, gather_num_contributors, gather_contributors, gather_coeffs, gather_coefficient_width );
3896 STBIR_PROFILE_BUILD_END( cleanup );
3897 }
3898 break;
3900 case 0: // scatter downsample (only on vertical)
3901 case 2: // gather downsample
3902 {
3903 float in_pixels_radius = support(scale,user_data) * inv_scale;
3904 int filter_pixel_margin = samp->filter_pixel_margin;
3905 int input_end = input_full_size + filter_pixel_margin;
3907 // if this is a scatter, we do a downsample gather to get the coeffs, and then pivot after
3908 if ( !samp->is_gather )
3909 {
3910 // check if we are using the same gather downsample on the horizontal as this vertical,
3911 // if so, then we don't have to generate them, we can just pivot from the horizontal.
3912 if ( other_axis_for_pivot )
3913 {
3914 gather_contributors = other_axis_for_pivot->contributors;
3915 gather_coeffs = other_axis_for_pivot->coefficients;
3916 gather_coefficient_width = other_axis_for_pivot->coefficient_width;
3917 gather_num_contributors = other_axis_for_pivot->num_contributors;
3918 samp->extent_info.lowest = other_axis_for_pivot->extent_info.lowest;
3919 samp->extent_info.highest = other_axis_for_pivot->extent_info.highest;
3920 samp->extent_info.widest = other_axis_for_pivot->extent_info.widest;
3921 goto jump_right_to_pivot;
3922 }
3924 gather_contributors = samp->gather_prescatter_contributors;
3925 gather_coeffs = samp->gather_prescatter_coefficients;
3926 gather_coefficient_width = samp->gather_prescatter_coefficient_width;
3927 gather_num_contributors = samp->gather_prescatter_num_contributors;
3928 }
3930 stbir__calculate_coefficients_for_gather_downsample( -filter_pixel_margin, input_end, in_pixels_radius, kernel, &samp->scale_info, gather_coefficient_width, gather_num_contributors, gather_contributors, gather_coeffs, user_data );
3932 STBIR_PROFILE_BUILD_START( cleanup );
3933 stbir__cleanup_gathered_coefficients( samp->edge, &samp->extent_info, &samp->scale_info, gather_num_contributors, gather_contributors, gather_coeffs, gather_coefficient_width );
3934 STBIR_PROFILE_BUILD_END( cleanup );
3936 if ( !samp->is_gather )
3937 {
3938 // if this is a scatter (vertical only), then we need to pivot the coeffs
3939 stbir__contributors * scatter_contributors;
3940 int highest_set;
3942 jump_right_to_pivot:
3944 STBIR_PROFILE_BUILD_START( pivot );
3946 highest_set = (-filter_pixel_margin) - 1;
3947 for (n = 0; n < gather_num_contributors; n++)
3948 {
3949 int k;
3950 int gn0 = gather_contributors->n0, gn1 = gather_contributors->n1;
3951 int scatter_coefficient_width = samp->coefficient_width;
3952 float * scatter_coeffs = samp->coefficients + ( gn0 + filter_pixel_margin ) * scatter_coefficient_width;
3953 float * g_coeffs = gather_coeffs;
3954 scatter_contributors = samp->contributors + ( gn0 + filter_pixel_margin );
3956 for (k = gn0 ; k <= gn1 ; k++ )
3957 {
3958 float gc = *g_coeffs++;
3960 // skip zero and denormals - must skip zeros to avoid adding coeffs beyond scatter_coefficient_width
3961 // (which happens when pivoting from horizontal, which might have dummy zeros)
3962 if ( ( ( gc >= stbir__small_float ) || ( gc <= -stbir__small_float ) ) )
3963 {
3964 if ( ( k > highest_set ) || ( scatter_contributors->n0 > scatter_contributors->n1 ) )
3965 {
3966 {
3967 // if we are skipping over several contributors, we need to clear the skipped ones
3968 stbir__contributors * clear_contributors = samp->contributors + ( highest_set + filter_pixel_margin + 1);
3969 while ( clear_contributors < scatter_contributors )
3970 {
3971 clear_contributors->n0 = 0;
3972 clear_contributors->n1 = -1;
3973 ++clear_contributors;
3974 }
3975 }
3976 scatter_contributors->n0 = n;
3977 scatter_contributors->n1 = n;
3978 scatter_coeffs[0] = gc;
3979 highest_set = k;
3980 }
3981 else
3982 {
3983 stbir__insert_coeff( scatter_contributors, scatter_coeffs, n, gc, scatter_coefficient_width );
3984 }
3985 STBIR_ASSERT( ( scatter_contributors->n1 - scatter_contributors->n0 + 1 ) <= scatter_coefficient_width );
3986 }
3987 ++scatter_contributors;
3988 scatter_coeffs += scatter_coefficient_width;
3989 }
3991 ++gather_contributors;
3992 gather_coeffs += gather_coefficient_width;
3993 }
3995 // now clear any unset contribs
3996 {
3997 stbir__contributors * clear_contributors = samp->contributors + ( highest_set + filter_pixel_margin + 1);
3998 stbir__contributors * end_contributors = samp->contributors + samp->num_contributors;
3999 while ( clear_contributors < end_contributors )
4000 {
4001 clear_contributors->n0 = 0;
4002 clear_contributors->n1 = -1;
4003 ++clear_contributors;
4004 }
4005 }
4007 STBIR_PROFILE_BUILD_END( pivot );
4008 }
4009 }
4010 break;
4011 }
4012}
4015//========================================================================================================
4016// scanline decoders and encoders
4018#define stbir__coder_min_num 1
4019#define STB_IMAGE_RESIZE_DO_CODERS
4020#include STBIR__HEADER_FILENAME
4022#define stbir__decode_suffix BGRA
4023#define stbir__decode_swizzle
4024#define stbir__decode_order0 2
4025#define stbir__decode_order1 1
4026#define stbir__decode_order2 0
4027#define stbir__decode_order3 3
4028#define stbir__encode_order0 2
4029#define stbir__encode_order1 1
4030#define stbir__encode_order2 0
4031#define stbir__encode_order3 3
4032#define stbir__coder_min_num 4
4033#define STB_IMAGE_RESIZE_DO_CODERS
4034#include STBIR__HEADER_FILENAME
4036#define stbir__decode_suffix ARGB
4037#define stbir__decode_swizzle
4038#define stbir__decode_order0 1
4039#define stbir__decode_order1 2
4040#define stbir__decode_order2 3
4041#define stbir__decode_order3 0
4042#define stbir__encode_order0 3
4043#define stbir__encode_order1 0
4044#define stbir__encode_order2 1
4045#define stbir__encode_order3 2
4046#define stbir__coder_min_num 4
4047#define STB_IMAGE_RESIZE_DO_CODERS
4048#include STBIR__HEADER_FILENAME
4050#define stbir__decode_suffix ABGR
4051#define stbir__decode_swizzle
4052#define stbir__decode_order0 3
4053#define stbir__decode_order1 2
4054#define stbir__decode_order2 1
4055#define stbir__decode_order3 0
4056#define stbir__encode_order0 3
4057#define stbir__encode_order1 2
4058#define stbir__encode_order2 1
4059#define stbir__encode_order3 0
4060#define stbir__coder_min_num 4
4061#define STB_IMAGE_RESIZE_DO_CODERS
4062#include STBIR__HEADER_FILENAME
4064#define stbir__decode_suffix AR
4065#define stbir__decode_swizzle
4066#define stbir__decode_order0 1
4067#define stbir__decode_order1 0
4068#define stbir__decode_order2 3
4069#define stbir__decode_order3 2
4070#define stbir__encode_order0 1
4071#define stbir__encode_order1 0
4072#define stbir__encode_order2 3
4073#define stbir__encode_order3 2
4074#define stbir__coder_min_num 2
4075#define STB_IMAGE_RESIZE_DO_CODERS
4076#include STBIR__HEADER_FILENAME
4079// fancy alpha means we expand to keep both premultipied and non-premultiplied color channels
4080static void stbir__fancy_alpha_weight_4ch( float * out_buffer, int width_times_channels )
4081{
4082 float STBIR_STREAMOUT_PTR(*) out = out_buffer;
4083 float const * end_decode = out_buffer + ( width_times_channels / 4 ) * 7; // decode buffer aligned to end of out_buffer
4084 float STBIR_STREAMOUT_PTR(*) decode = (float*)end_decode - width_times_channels;
4086 // fancy alpha is stored internally as R G B A Rpm Gpm Bpm
4088 #ifdef STBIR_SIMD
4090 #ifdef STBIR_SIMD8
4091 decode += 16;
4092 STBIR_NO_UNROLL_LOOP_START
4093 while ( decode <= end_decode )
4094 {
4095 stbir__simdf8 d0,d1,a0,a1,p0,p1;
4096 STBIR_NO_UNROLL(decode);
4097 stbir__simdf8_load( d0, decode-16 );
4098 stbir__simdf8_load( d1, decode-16+8 );
4099 stbir__simdf8_0123to33333333( a0, d0 );
4100 stbir__simdf8_0123to33333333( a1, d1 );
4101 stbir__simdf8_mult( p0, a0, d0 );
4102 stbir__simdf8_mult( p1, a1, d1 );
4103 stbir__simdf8_bot4s( a0, d0, p0 );
4104 stbir__simdf8_bot4s( a1, d1, p1 );
4105 stbir__simdf8_top4s( d0, d0, p0 );
4106 stbir__simdf8_top4s( d1, d1, p1 );
4107 stbir__simdf8_store ( out, a0 );
4108 stbir__simdf8_store ( out+7, d0 );
4109 stbir__simdf8_store ( out+14, a1 );
4110 stbir__simdf8_store ( out+21, d1 );
4111 decode += 16;
4112 out += 28;
4113 }
4114 decode -= 16;
4115 #else
4116 decode += 8;
4117 STBIR_NO_UNROLL_LOOP_START
4118 while ( decode <= end_decode )
4119 {
4120 stbir__simdf d0,a0,d1,a1,p0,p1;
4121 STBIR_NO_UNROLL(decode);
4122 stbir__simdf_load( d0, decode-8 );
4123 stbir__simdf_load( d1, decode-8+4 );
4124 stbir__simdf_0123to3333( a0, d0 );
4125 stbir__simdf_0123to3333( a1, d1 );
4126 stbir__simdf_mult( p0, a0, d0 );
4127 stbir__simdf_mult( p1, a1, d1 );
4128 stbir__simdf_store ( out, d0 );
4129 stbir__simdf_store ( out+4, p0 );
4130 stbir__simdf_store ( out+7, d1 );
4131 stbir__simdf_store ( out+7+4, p1 );
4132 decode += 8;
4133 out += 14;
4134 }
4135 decode -= 8;
4136 #endif
4138 // might be one last odd pixel
4139 #ifdef STBIR_SIMD8
4140 STBIR_NO_UNROLL_LOOP_START
4141 while ( decode < end_decode )
4142 #else
4143 if ( decode < end_decode )
4144 #endif
4145 {
4146 stbir__simdf d,a,p;
4147 STBIR_NO_UNROLL(decode);
4148 stbir__simdf_load( d, decode );
4149 stbir__simdf_0123to3333( a, d );
4150 stbir__simdf_mult( p, a, d );
4151 stbir__simdf_store ( out, d );
4152 stbir__simdf_store ( out+4, p );
4153 decode += 4;
4154 out += 7;
4155 }
4157 #else
4159 while( decode < end_decode )
4160 {
4161 float r = decode[0], g = decode[1], b = decode[2], alpha = decode[3];
4162 out[0] = r;
4163 out[1] = g;
4164 out[2] = b;
4165 out[3] = alpha;
4166 out[4] = r * alpha;
4167 out[5] = g * alpha;
4168 out[6] = b * alpha;
4169 out += 7;
4170 decode += 4;
4171 }
4173 #endif
4174}
4176static void stbir__fancy_alpha_weight_2ch( float * out_buffer, int width_times_channels )
4177{
4178 float STBIR_STREAMOUT_PTR(*) out = out_buffer;
4179 float const * end_decode = out_buffer + ( width_times_channels / 2 ) * 3;
4180 float STBIR_STREAMOUT_PTR(*) decode = (float*)end_decode - width_times_channels;
4182 // for fancy alpha, turns into: [X A Xpm][X A Xpm],etc
4184 #ifdef STBIR_SIMD
4186 decode += 8;
4187 if ( decode <= end_decode )
4188 {
4189 STBIR_NO_UNROLL_LOOP_START
4190 do {
4191 #ifdef STBIR_SIMD8
4192 stbir__simdf8 d0,a0,p0;
4193 STBIR_NO_UNROLL(decode);
4194 stbir__simdf8_load( d0, decode-8 );
4195 stbir__simdf8_0123to11331133( p0, d0 );
4196 stbir__simdf8_0123to00220022( a0, d0 );
4197 stbir__simdf8_mult( p0, p0, a0 );
4199 stbir__simdf_store2( out, stbir__if_simdf8_cast_to_simdf4( d0 ) );
4200 stbir__simdf_store( out+2, stbir__if_simdf8_cast_to_simdf4( p0 ) );
4201 stbir__simdf_store2h( out+3, stbir__if_simdf8_cast_to_simdf4( d0 ) );
4203 stbir__simdf_store2( out+6, stbir__simdf8_gettop4( d0 ) );
4204 stbir__simdf_store( out+8, stbir__simdf8_gettop4( p0 ) );
4205 stbir__simdf_store2h( out+9, stbir__simdf8_gettop4( d0 ) );
4206 #else
4207 stbir__simdf d0,a0,d1,a1,p0,p1;
4208 STBIR_NO_UNROLL(decode);
4209 stbir__simdf_load( d0, decode-8 );
4210 stbir__simdf_load( d1, decode-8+4 );
4211 stbir__simdf_0123to1133( p0, d0 );
4212 stbir__simdf_0123to1133( p1, d1 );
4213 stbir__simdf_0123to0022( a0, d0 );
4214 stbir__simdf_0123to0022( a1, d1 );
4215 stbir__simdf_mult( p0, p0, a0 );
4216 stbir__simdf_mult( p1, p1, a1 );
4218 stbir__simdf_store2( out, d0 );
4219 stbir__simdf_store( out+2, p0 );
4220 stbir__simdf_store2h( out+3, d0 );
4222 stbir__simdf_store2( out+6, d1 );
4223 stbir__simdf_store( out+8, p1 );
4224 stbir__simdf_store2h( out+9, d1 );
4225 #endif
4226 decode += 8;
4227 out += 12;
4228 } while ( decode <= end_decode );
4229 }
4230 decode -= 8;
4231 #endif
4233 STBIR_SIMD_NO_UNROLL_LOOP_START
4234 while( decode < end_decode )
4235 {
4236 float x = decode[0], y = decode[1];
4237 STBIR_SIMD_NO_UNROLL(decode);
4238 out[0] = x;
4239 out[1] = y;
4240 out[2] = x * y;
4241 out += 3;
4242 decode += 2;
4243 }
4244}
4246static void stbir__fancy_alpha_unweight_4ch( float * encode_buffer, int width_times_channels )
4247{
4248 float STBIR_SIMD_STREAMOUT_PTR(*) encode = encode_buffer;
4249 float STBIR_SIMD_STREAMOUT_PTR(*) input = encode_buffer;
4250 float const * end_output = encode_buffer + width_times_channels;
4252 // fancy RGBA is stored internally as R G B A Rpm Gpm Bpm
4254 STBIR_SIMD_NO_UNROLL_LOOP_START
4255 do {
4256 float alpha = input[3];
4257#ifdef STBIR_SIMD
4258 stbir__simdf i,ia;
4259 STBIR_SIMD_NO_UNROLL(encode);
4260 if ( alpha < stbir__small_float )
4261 {
4262 stbir__simdf_load( i, input );
4263 stbir__simdf_store( encode, i );
4264 }
4265 else
4266 {
4267 stbir__simdf_load1frep4( ia, 1.0f / alpha );
4268 stbir__simdf_load( i, input+4 );
4269 stbir__simdf_mult( i, i, ia );
4270 stbir__simdf_store( encode, i );
4271 encode[3] = alpha;
4272 }
4273#else
4274 if ( alpha < stbir__small_float )
4275 {
4276 encode[0] = input[0];
4277 encode[1] = input[1];
4278 encode[2] = input[2];
4279 }
4280 else
4281 {
4282 float ialpha = 1.0f / alpha;
4283 encode[0] = input[4] * ialpha;
4284 encode[1] = input[5] * ialpha;
4285 encode[2] = input[6] * ialpha;
4286 }
4287 encode[3] = alpha;
4288#endif
4290 input += 7;
4291 encode += 4;
4292 } while ( encode < end_output );
4293}
4295// format: [X A Xpm][X A Xpm] etc
4296static void stbir__fancy_alpha_unweight_2ch( float * encode_buffer, int width_times_channels )
4297{
4298 float STBIR_SIMD_STREAMOUT_PTR(*) encode = encode_buffer;
4299 float STBIR_SIMD_STREAMOUT_PTR(*) input = encode_buffer;
4300 float const * end_output = encode_buffer + width_times_channels;
4302 do {
4303 float alpha = input[1];
4304 encode[0] = input[0];
4305 if ( alpha >= stbir__small_float )
4306 encode[0] = input[2] / alpha;
4307 encode[1] = alpha;
4309 input += 3;
4310 encode += 2;
4311 } while ( encode < end_output );
4312}
4314static void stbir__simple_alpha_weight_4ch( float * decode_buffer, int width_times_channels )
4315{
4316 float STBIR_STREAMOUT_PTR(*) decode = decode_buffer;
4317 float const * end_decode = decode_buffer + width_times_channels;
4319 #ifdef STBIR_SIMD
4320 {
4321 decode += 2 * stbir__simdfX_float_count;
4322 STBIR_NO_UNROLL_LOOP_START
4323 while ( decode <= end_decode )
4324 {
4325 stbir__simdfX d0,a0,d1,a1;
4326 STBIR_NO_UNROLL(decode);
4327 stbir__simdfX_load( d0, decode-2*stbir__simdfX_float_count );
4328 stbir__simdfX_load( d1, decode-2*stbir__simdfX_float_count+stbir__simdfX_float_count );
4329 stbir__simdfX_aaa1( a0, d0, STBIR_onesX );
4330 stbir__simdfX_aaa1( a1, d1, STBIR_onesX );
4331 stbir__simdfX_mult( d0, d0, a0 );
4332 stbir__simdfX_mult( d1, d1, a1 );
4333 stbir__simdfX_store ( decode-2*stbir__simdfX_float_count, d0 );
4334 stbir__simdfX_store ( decode-2*stbir__simdfX_float_count+stbir__simdfX_float_count, d1 );
4335 decode += 2 * stbir__simdfX_float_count;
4336 }
4337 decode -= 2 * stbir__simdfX_float_count;
4339 // few last pixels remnants
4340 #ifdef STBIR_SIMD8
4341 STBIR_NO_UNROLL_LOOP_START
4342 while ( decode < end_decode )
4343 #else
4344 if ( decode < end_decode )
4345 #endif
4346 {
4347 stbir__simdf d,a;
4348 stbir__simdf_load( d, decode );
4349 stbir__simdf_aaa1( a, d, STBIR__CONSTF(STBIR_ones) );
4350 stbir__simdf_mult( d, d, a );
4351 stbir__simdf_store ( decode, d );
4352 decode += 4;
4353 }
4354 }
4356 #else
4358 while( decode < end_decode )
4359 {
4360 float alpha = decode[3];
4361 decode[0] *= alpha;
4362 decode[1] *= alpha;
4363 decode[2] *= alpha;
4364 decode += 4;
4365 }
4367 #endif
4368}
4370static void stbir__simple_alpha_weight_2ch( float * decode_buffer, int width_times_channels )
4371{
4372 float STBIR_STREAMOUT_PTR(*) decode = decode_buffer;
4373 float const * end_decode = decode_buffer + width_times_channels;
4375 #ifdef STBIR_SIMD
4376 decode += 2 * stbir__simdfX_float_count;
4377 STBIR_NO_UNROLL_LOOP_START
4378 while ( decode <= end_decode )
4379 {
4380 stbir__simdfX d0,a0,d1,a1;
4381 STBIR_NO_UNROLL(decode);
4382 stbir__simdfX_load( d0, decode-2*stbir__simdfX_float_count );
4383 stbir__simdfX_load( d1, decode-2*stbir__simdfX_float_count+stbir__simdfX_float_count );
4384 stbir__simdfX_a1a1( a0, d0, STBIR_onesX );
4385 stbir__simdfX_a1a1( a1, d1, STBIR_onesX );
4386 stbir__simdfX_mult( d0, d0, a0 );
4387 stbir__simdfX_mult( d1, d1, a1 );
4388 stbir__simdfX_store ( decode-2*stbir__simdfX_float_count, d0 );
4389 stbir__simdfX_store ( decode-2*stbir__simdfX_float_count+stbir__simdfX_float_count, d1 );
4390 decode += 2 * stbir__simdfX_float_count;
4391 }
4392 decode -= 2 * stbir__simdfX_float_count;
4393 #endif
4395 STBIR_SIMD_NO_UNROLL_LOOP_START
4396 while( decode < end_decode )
4397 {
4398 float alpha = decode[1];
4399 STBIR_SIMD_NO_UNROLL(decode);
4400 decode[0] *= alpha;
4401 decode += 2;
4402 }
4403}
4405static void stbir__simple_alpha_unweight_4ch( float * encode_buffer, int width_times_channels )
4406{
4407 float STBIR_SIMD_STREAMOUT_PTR(*) encode = encode_buffer;
4408 float const * end_output = encode_buffer + width_times_channels;
4410 STBIR_SIMD_NO_UNROLL_LOOP_START
4411 do {
4412 float alpha = encode[3];
4414#ifdef STBIR_SIMD
4415 stbir__simdf i,ia;
4416 STBIR_SIMD_NO_UNROLL(encode);
4417 if ( alpha >= stbir__small_float )
4418 {
4419 stbir__simdf_load1frep4( ia, 1.0f / alpha );
4420 stbir__simdf_load( i, encode );
4421 stbir__simdf_mult( i, i, ia );
4422 stbir__simdf_store( encode, i );
4423 encode[3] = alpha;
4424 }
4425#else
4426 if ( alpha >= stbir__small_float )
4427 {
4428 float ialpha = 1.0f / alpha;
4429 encode[0] *= ialpha;
4430 encode[1] *= ialpha;
4431 encode[2] *= ialpha;
4432 }
4433#endif
4434 encode += 4;
4435 } while ( encode < end_output );
4436}
4438static void stbir__simple_alpha_unweight_2ch( float * encode_buffer, int width_times_channels )
4439{
4440 float STBIR_SIMD_STREAMOUT_PTR(*) encode = encode_buffer;
4441 float const * end_output = encode_buffer + width_times_channels;
4443 do {
4444 float alpha = encode[1];
4445 if ( alpha >= stbir__small_float )
4446 encode[0] /= alpha;
4447 encode += 2;
4448 } while ( encode < end_output );
4449}
4452// only used in RGB->BGR or BGR->RGB
4453static void stbir__simple_flip_3ch( float * decode_buffer, int width_times_channels )
4454{
4455 float STBIR_STREAMOUT_PTR(*) decode = decode_buffer;
4456 float const * end_decode = decode_buffer + width_times_channels;
4458#ifdef STBIR_SIMD
4459 #ifdef stbir__simdf_swiz2 // do we have two argument swizzles?
4460 end_decode -= 12;
4461 STBIR_NO_UNROLL_LOOP_START
4462 while( decode <= end_decode )
4463 {
4464 // on arm64 8 instructions, no overlapping stores
4465 stbir__simdf a,b,c,na,nb;
4466 STBIR_SIMD_NO_UNROLL(decode);
4467 stbir__simdf_load( a, decode );
4468 stbir__simdf_load( b, decode+4 );
4469 stbir__simdf_load( c, decode+8 );
4471 na = stbir__simdf_swiz2( a, b, 2, 1, 0, 5 );
4472 b = stbir__simdf_swiz2( a, b, 4, 3, 6, 7 );
4473 nb = stbir__simdf_swiz2( b, c, 0, 1, 4, 3 );
4474 c = stbir__simdf_swiz2( b, c, 2, 7, 6, 5 );
4476 stbir__simdf_store( decode, na );
4477 stbir__simdf_store( decode+4, nb );
4478 stbir__simdf_store( decode+8, c );
4479 decode += 12;
4480 }
4481 end_decode += 12;
4482 #else
4483 end_decode -= 24;
4484 STBIR_NO_UNROLL_LOOP_START
4485 while( decode <= end_decode )
4486 {
4487 // 26 instructions on x64
4488 stbir__simdf a,b,c,d,e,f,g;
4489 float i21, i23;
4490 STBIR_SIMD_NO_UNROLL(decode);
4491 stbir__simdf_load( a, decode );
4492 stbir__simdf_load( b, decode+3 );
4493 stbir__simdf_load( c, decode+6 );
4494 stbir__simdf_load( d, decode+9 );
4495 stbir__simdf_load( e, decode+12 );
4496 stbir__simdf_load( f, decode+15 );
4497 stbir__simdf_load( g, decode+18 );
4499 a = stbir__simdf_swiz( a, 2, 1, 0, 3 );
4500 b = stbir__simdf_swiz( b, 2, 1, 0, 3 );
4501 c = stbir__simdf_swiz( c, 2, 1, 0, 3 );
4502 d = stbir__simdf_swiz( d, 2, 1, 0, 3 );
4503 e = stbir__simdf_swiz( e, 2, 1, 0, 3 );
4504 f = stbir__simdf_swiz( f, 2, 1, 0, 3 );
4505 g = stbir__simdf_swiz( g, 2, 1, 0, 3 );
4507 // stores overlap, need to be in order,
4508 stbir__simdf_store( decode, a );
4509 i21 = decode[21];
4510 stbir__simdf_store( decode+3, b );
4511 i23 = decode[23];
4512 stbir__simdf_store( decode+6, c );
4513 stbir__simdf_store( decode+9, d );
4514 stbir__simdf_store( decode+12, e );
4515 stbir__simdf_store( decode+15, f );
4516 stbir__simdf_store( decode+18, g );
4517 decode[21] = i23;
4518 decode[23] = i21;
4519 decode += 24;
4520 }
4521 end_decode += 24;
4522 #endif
4523#else
4524 end_decode -= 12;
4525 STBIR_NO_UNROLL_LOOP_START
4526 while( decode <= end_decode )
4527 {
4528 // 16 instructions
4529 float t0,t1,t2,t3;
4530 STBIR_NO_UNROLL(decode);
4531 t0 = decode[0]; t1 = decode[3]; t2 = decode[6]; t3 = decode[9];
4532 decode[0] = decode[2]; decode[3] = decode[5]; decode[6] = decode[8]; decode[9] = decode[11];
4533 decode[2] = t0; decode[5] = t1; decode[8] = t2; decode[11] = t3;
4534 decode += 12;
4535 }
4536 end_decode += 12;
4537#endif
4539 STBIR_NO_UNROLL_LOOP_START
4540 while( decode < end_decode )
4541 {
4542 float t = decode[0];
4543 STBIR_NO_UNROLL(decode);
4544 decode[0] = decode[2];
4545 decode[2] = t;
4546 decode += 3;
4547 }
4548}
4552static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float * output_buffer STBIR_ONLY_PROFILE_GET_SPLIT_INFO )
4553{
4554 int channels = stbir_info->channels;
4555 int effective_channels = stbir_info->effective_channels;
4556 int input_sample_in_bytes = stbir__type_size[stbir_info->input_type] * channels;
4557 stbir_edge edge_horizontal = stbir_info->horizontal.edge;
4558 stbir_edge edge_vertical = stbir_info->vertical.edge;
4559 int row = stbir__edge_wrap(edge_vertical, n, stbir_info->vertical.scale_info.input_full_size);
4560 const void* input_plane_data = ( (char *) stbir_info->input_data ) + (size_t)row * (size_t) stbir_info->input_stride_bytes;
4561 stbir__span const * spans = stbir_info->scanline_extents.spans;
4562 float* full_decode_buffer = output_buffer - stbir_info->scanline_extents.conservative.n0 * effective_channels;
4564 // if we are on edge_zero, and we get in here with an out of bounds n, then the calculate filters has failed
4565 STBIR_ASSERT( !(edge_vertical == STBIR_EDGE_ZERO && (n < 0 || n >= stbir_info->vertical.scale_info.input_full_size)) );
4567 do
4568 {
4569 float * decode_buffer;
4570 void const * input_data;
4571 float * end_decode;
4572 int width_times_channels;
4573 int width;
4575 if ( spans->n1 < spans->n0 )
4576 break;
4578 width = spans->n1 + 1 - spans->n0;
4579 decode_buffer = full_decode_buffer + spans->n0 * effective_channels;
4580 end_decode = full_decode_buffer + ( spans->n1 + 1 ) * effective_channels;
4581 width_times_channels = width * channels;
4583 // read directly out of input plane by default
4584 input_data = ( (char*)input_plane_data ) + spans->pixel_offset_for_input * input_sample_in_bytes;
4586 // if we have an input callback, call it to get the input data
4587 if ( stbir_info->in_pixels_cb )
4588 {
4589 // call the callback with a temp buffer (that they can choose to use or not). the temp is just right aligned memory in the decode_buffer itself
4590 input_data = stbir_info->in_pixels_cb( ( (char*) end_decode ) - ( width * input_sample_in_bytes ), input_plane_data, width, spans->pixel_offset_for_input, row, stbir_info->user_data );
4591 }
4593 STBIR_PROFILE_START( decode );
4594 // convert the pixels info the float decode_buffer, (we index from end_decode, so that when channels<effective_channels, we are right justified in the buffer)
4595 stbir_info->decode_pixels( (float*)end_decode - width_times_channels, width_times_channels, input_data );
4596 STBIR_PROFILE_END( decode );
4598 if (stbir_info->alpha_weight)
4599 {
4600 STBIR_PROFILE_START( alpha );
4601 stbir_info->alpha_weight( decode_buffer, width_times_channels );
4602 STBIR_PROFILE_END( alpha );
4603 }
4605 ++spans;
4606 } while ( spans <= ( &stbir_info->scanline_extents.spans[1] ) );
4608 // handle the edge_wrap filter (all other types are handled back out at the calculate_filter stage)
4609 // basically the idea here is that if we have the whole scanline in memory, we don't redecode the
4610 // wrapped edge pixels, and instead just memcpy them from the scanline into the edge positions
4611 if ( ( edge_horizontal == STBIR_EDGE_WRAP ) && ( stbir_info->scanline_extents.edge_sizes[0] | stbir_info->scanline_extents.edge_sizes[1] ) )
4612 {
4613 // this code only runs if we're in edge_wrap, and we're doing the entire scanline
4614 int e, start_x[2];
4615 int input_full_size = stbir_info->horizontal.scale_info.input_full_size;
4617 start_x[0] = -stbir_info->scanline_extents.edge_sizes[0]; // left edge start x
4618 start_x[1] = input_full_size; // right edge
4620 for( e = 0; e < 2 ; e++ )
4621 {
4622 // do each margin
4623 int margin = stbir_info->scanline_extents.edge_sizes[e];
4624 if ( margin )
4625 {
4626 int x = start_x[e];
4627 float * marg = full_decode_buffer + x * effective_channels;
4628 float const * src = full_decode_buffer + stbir__edge_wrap(edge_horizontal, x, input_full_size) * effective_channels;
4629 STBIR_MEMCPY( marg, src, margin * effective_channels * sizeof(float) );
4630 }
4631 }
4632 }
4633}
4636//=================
4637// Do 1 channel horizontal routines
4639#ifdef STBIR_SIMD
4641#define stbir__1_coeff_only() \
4642 stbir__simdf tot,c; \
4643 STBIR_SIMD_NO_UNROLL(decode); \
4644 stbir__simdf_load1( c, hc ); \
4645 stbir__simdf_mult1_mem( tot, c, decode );
4647#define stbir__2_coeff_only() \
4648 stbir__simdf tot,c,d; \
4649 STBIR_SIMD_NO_UNROLL(decode); \
4650 stbir__simdf_load2z( c, hc ); \
4651 stbir__simdf_load2( d, decode ); \
4652 stbir__simdf_mult( tot, c, d ); \
4653 stbir__simdf_0123to1230( c, tot ); \
4654 stbir__simdf_add1( tot, tot, c );
4656#define stbir__3_coeff_only() \
4657 stbir__simdf tot,c,t; \
4658 STBIR_SIMD_NO_UNROLL(decode); \
4659 stbir__simdf_load( c, hc ); \
4660 stbir__simdf_mult_mem( tot, c, decode ); \
4661 stbir__simdf_0123to1230( c, tot ); \
4662 stbir__simdf_0123to2301( t, tot ); \
4663 stbir__simdf_add1( tot, tot, c ); \
4664 stbir__simdf_add1( tot, tot, t );
4666#define stbir__store_output_tiny() \
4667 stbir__simdf_store1( output, tot ); \
4668 horizontal_coefficients += coefficient_width; \
4669 ++horizontal_contributors; \
4670 output += 1;
4672#define stbir__4_coeff_start() \
4673 stbir__simdf tot,c; \
4674 STBIR_SIMD_NO_UNROLL(decode); \
4675 stbir__simdf_load( c, hc ); \
4676 stbir__simdf_mult_mem( tot, c, decode ); \
4678#define stbir__4_coeff_continue_from_4( ofs ) \
4679 STBIR_SIMD_NO_UNROLL(decode); \
4680 stbir__simdf_load( c, hc + (ofs) ); \
4681 stbir__simdf_madd_mem( tot, tot, c, decode+(ofs) );
4683#define stbir__1_coeff_remnant( ofs ) \
4684 { stbir__simdf d; \
4685 stbir__simdf_load1z( c, hc + (ofs) ); \
4686 stbir__simdf_load1( d, decode + (ofs) ); \
4687 stbir__simdf_madd( tot, tot, d, c ); }
4689#define stbir__2_coeff_remnant( ofs ) \
4690 { stbir__simdf d; \
4691 stbir__simdf_load2z( c, hc+(ofs) ); \
4692 stbir__simdf_load2( d, decode+(ofs) ); \
4693 stbir__simdf_madd( tot, tot, d, c ); }
4695#define stbir__3_coeff_setup() \
4696 stbir__simdf mask; \
4697 stbir__simdf_load( mask, STBIR_mask + 3 );
4699#define stbir__3_coeff_remnant( ofs ) \
4700 stbir__simdf_load( c, hc+(ofs) ); \
4701 stbir__simdf_and( c, c, mask ); \
4702 stbir__simdf_madd_mem( tot, tot, c, decode+(ofs) );
4704#define stbir__store_output() \
4705 stbir__simdf_0123to2301( c, tot ); \
4706 stbir__simdf_add( tot, tot, c ); \
4707 stbir__simdf_0123to1230( c, tot ); \
4708 stbir__simdf_add1( tot, tot, c ); \
4709 stbir__simdf_store1( output, tot ); \
4710 horizontal_coefficients += coefficient_width; \
4711 ++horizontal_contributors; \
4712 output += 1;
4714#else
4716#define stbir__1_coeff_only() \
4717 float tot; \
4718 tot = decode[0]*hc[0];
4720#define stbir__2_coeff_only() \
4721 float tot; \
4722 tot = decode[0] * hc[0]; \
4723 tot += decode[1] * hc[1];
4725#define stbir__3_coeff_only() \
4726 float tot; \
4727 tot = decode[0] * hc[0]; \
4728 tot += decode[1] * hc[1]; \
4729 tot += decode[2] * hc[2];
4731#define stbir__store_output_tiny() \
4732 output[0] = tot; \
4733 horizontal_coefficients += coefficient_width; \
4734 ++horizontal_contributors; \
4735 output += 1;
4737#define stbir__4_coeff_start() \
4738 float tot0,tot1,tot2,tot3; \
4739 tot0 = decode[0] * hc[0]; \
4740 tot1 = decode[1] * hc[1]; \
4741 tot2 = decode[2] * hc[2]; \
4742 tot3 = decode[3] * hc[3];
4744#define stbir__4_coeff_continue_from_4( ofs ) \
4745 tot0 += decode[0+(ofs)] * hc[0+(ofs)]; \
4746 tot1 += decode[1+(ofs)] * hc[1+(ofs)]; \
4747 tot2 += decode[2+(ofs)] * hc[2+(ofs)]; \
4748 tot3 += decode[3+(ofs)] * hc[3+(ofs)];
4750#define stbir__1_coeff_remnant( ofs ) \
4751 tot0 += decode[0+(ofs)] * hc[0+(ofs)];
4753#define stbir__2_coeff_remnant( ofs ) \
4754 tot0 += decode[0+(ofs)] * hc[0+(ofs)]; \
4755 tot1 += decode[1+(ofs)] * hc[1+(ofs)]; \
4757#define stbir__3_coeff_remnant( ofs ) \
4758 tot0 += decode[0+(ofs)] * hc[0+(ofs)]; \
4759 tot1 += decode[1+(ofs)] * hc[1+(ofs)]; \
4760 tot2 += decode[2+(ofs)] * hc[2+(ofs)];
4762#define stbir__store_output() \
4763 output[0] = (tot0+tot2)+(tot1+tot3); \
4764 horizontal_coefficients += coefficient_width; \
4765 ++horizontal_contributors; \
4766 output += 1;
4768#endif
4770#define STBIR__horizontal_channels 1
4771#define STB_IMAGE_RESIZE_DO_HORIZONTALS
4772#include STBIR__HEADER_FILENAME
4775//=================
4776// Do 2 channel horizontal routines
4778#ifdef STBIR_SIMD
4780#define stbir__1_coeff_only() \
4781 stbir__simdf tot,c,d; \
4782 STBIR_SIMD_NO_UNROLL(decode); \
4783 stbir__simdf_load1z( c, hc ); \
4784 stbir__simdf_0123to0011( c, c ); \
4785 stbir__simdf_load2( d, decode ); \
4786 stbir__simdf_mult( tot, d, c );
4788#define stbir__2_coeff_only() \
4789 stbir__simdf tot,c; \
4790 STBIR_SIMD_NO_UNROLL(decode); \
4791 stbir__simdf_load2( c, hc ); \
4792 stbir__simdf_0123to0011( c, c ); \
4793 stbir__simdf_mult_mem( tot, c, decode );
4795#define stbir__3_coeff_only() \
4796 stbir__simdf tot,c,cs,d; \
4797 STBIR_SIMD_NO_UNROLL(decode); \
4798 stbir__simdf_load( cs, hc ); \
4799 stbir__simdf_0123to0011( c, cs ); \
4800 stbir__simdf_mult_mem( tot, c, decode ); \
4801 stbir__simdf_0123to2222( c, cs ); \
4802 stbir__simdf_load2z( d, decode+4 ); \
4803 stbir__simdf_madd( tot, tot, d, c );
4805#define stbir__store_output_tiny() \
4806 stbir__simdf_0123to2301( c, tot ); \
4807 stbir__simdf_add( tot, tot, c ); \
4808 stbir__simdf_store2( output, tot ); \
4809 horizontal_coefficients += coefficient_width; \
4810 ++horizontal_contributors; \
4811 output += 2;
4813#ifdef STBIR_SIMD8
4815#define stbir__4_coeff_start() \
4816 stbir__simdf8 tot0,c,cs; \
4817 STBIR_SIMD_NO_UNROLL(decode); \
4818 stbir__simdf8_load4b( cs, hc ); \
4819 stbir__simdf8_0123to00112233( c, cs ); \
4820 stbir__simdf8_mult_mem( tot0, c, decode );
4822#define stbir__4_coeff_continue_from_4( ofs ) \
4823 STBIR_SIMD_NO_UNROLL(decode); \
4824 stbir__simdf8_load4b( cs, hc + (ofs) ); \
4825 stbir__simdf8_0123to00112233( c, cs ); \
4826 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*2 );
4828#define stbir__1_coeff_remnant( ofs ) \
4829 { stbir__simdf t,d; \
4830 stbir__simdf_load1z( t, hc + (ofs) ); \
4831 stbir__simdf_load2( d, decode + (ofs) * 2 ); \
4832 stbir__simdf_0123to0011( t, t ); \
4833 stbir__simdf_mult( t, t, d ); \
4834 stbir__simdf8_add4( tot0, tot0, t ); }
4836#define stbir__2_coeff_remnant( ofs ) \
4837 { stbir__simdf t; \
4838 stbir__simdf_load2( t, hc + (ofs) ); \
4839 stbir__simdf_0123to0011( t, t ); \
4840 stbir__simdf_mult_mem( t, t, decode+(ofs)*2 ); \
4841 stbir__simdf8_add4( tot0, tot0, t ); }
4843#define stbir__3_coeff_remnant( ofs ) \
4844 { stbir__simdf8 d; \
4845 stbir__simdf8_load4b( cs, hc + (ofs) ); \
4846 stbir__simdf8_0123to00112233( c, cs ); \
4847 stbir__simdf8_load6z( d, decode+(ofs)*2 ); \
4848 stbir__simdf8_madd( tot0, tot0, c, d ); }
4850#define stbir__store_output() \
4851 { stbir__simdf t,d; \
4852 stbir__simdf8_add4halves( t, stbir__if_simdf8_cast_to_simdf4(tot0), tot0 ); \
4853 stbir__simdf_0123to2301( d, t ); \
4854 stbir__simdf_add( t, t, d ); \
4855 stbir__simdf_store2( output, t ); \
4856 horizontal_coefficients += coefficient_width; \
4857 ++horizontal_contributors; \
4858 output += 2; }
4860#else
4862#define stbir__4_coeff_start() \
4863 stbir__simdf tot0,tot1,c,cs; \
4864 STBIR_SIMD_NO_UNROLL(decode); \
4865 stbir__simdf_load( cs, hc ); \
4866 stbir__simdf_0123to0011( c, cs ); \
4867 stbir__simdf_mult_mem( tot0, c, decode ); \
4868 stbir__simdf_0123to2233( c, cs ); \
4869 stbir__simdf_mult_mem( tot1, c, decode+4 );
4871#define stbir__4_coeff_continue_from_4( ofs ) \
4872 STBIR_SIMD_NO_UNROLL(decode); \
4873 stbir__simdf_load( cs, hc + (ofs) ); \
4874 stbir__simdf_0123to0011( c, cs ); \
4875 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*2 ); \
4876 stbir__simdf_0123to2233( c, cs ); \
4877 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*2+4 );
4879#define stbir__1_coeff_remnant( ofs ) \
4880 { stbir__simdf d; \
4881 stbir__simdf_load1z( cs, hc + (ofs) ); \
4882 stbir__simdf_0123to0011( c, cs ); \
4883 stbir__simdf_load2( d, decode + (ofs) * 2 ); \
4884 stbir__simdf_madd( tot0, tot0, d, c ); }
4886#define stbir__2_coeff_remnant( ofs ) \
4887 stbir__simdf_load2( cs, hc + (ofs) ); \
4888 stbir__simdf_0123to0011( c, cs ); \
4889 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*2 );
4891#define stbir__3_coeff_remnant( ofs ) \
4892 { stbir__simdf d; \
4893 stbir__simdf_load( cs, hc + (ofs) ); \
4894 stbir__simdf_0123to0011( c, cs ); \
4895 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*2 ); \
4896 stbir__simdf_0123to2222( c, cs ); \
4897 stbir__simdf_load2z( d, decode + (ofs) * 2 + 4 ); \
4898 stbir__simdf_madd( tot1, tot1, d, c ); }
4900#define stbir__store_output() \
4901 stbir__simdf_add( tot0, tot0, tot1 ); \
4902 stbir__simdf_0123to2301( c, tot0 ); \
4903 stbir__simdf_add( tot0, tot0, c ); \
4904 stbir__simdf_store2( output, tot0 ); \
4905 horizontal_coefficients += coefficient_width; \
4906 ++horizontal_contributors; \
4907 output += 2;
4909#endif
4911#else
4913#define stbir__1_coeff_only() \
4914 float tota,totb,c; \
4915 c = hc[0]; \
4916 tota = decode[0]*c; \
4917 totb = decode[1]*c;
4919#define stbir__2_coeff_only() \
4920 float tota,totb,c; \
4921 c = hc[0]; \
4922 tota = decode[0]*c; \
4923 totb = decode[1]*c; \
4924 c = hc[1]; \
4925 tota += decode[2]*c; \
4926 totb += decode[3]*c;
4928// this weird order of add matches the simd
4929#define stbir__3_coeff_only() \
4930 float tota,totb,c; \
4931 c = hc[0]; \
4932 tota = decode[0]*c; \
4933 totb = decode[1]*c; \
4934 c = hc[2]; \
4935 tota += decode[4]*c; \
4936 totb += decode[5]*c; \
4937 c = hc[1]; \
4938 tota += decode[2]*c; \
4939 totb += decode[3]*c;
4941#define stbir__store_output_tiny() \
4942 output[0] = tota; \
4943 output[1] = totb; \
4944 horizontal_coefficients += coefficient_width; \
4945 ++horizontal_contributors; \
4946 output += 2;
4948#define stbir__4_coeff_start() \
4949 float tota0,tota1,tota2,tota3,totb0,totb1,totb2,totb3,c; \
4950 c = hc[0]; \
4951 tota0 = decode[0]*c; \
4952 totb0 = decode[1]*c; \
4953 c = hc[1]; \
4954 tota1 = decode[2]*c; \
4955 totb1 = decode[3]*c; \
4956 c = hc[2]; \
4957 tota2 = decode[4]*c; \
4958 totb2 = decode[5]*c; \
4959 c = hc[3]; \
4960 tota3 = decode[6]*c; \
4961 totb3 = decode[7]*c;
4963#define stbir__4_coeff_continue_from_4( ofs ) \
4964 c = hc[0+(ofs)]; \
4965 tota0 += decode[0+(ofs)*2]*c; \
4966 totb0 += decode[1+(ofs)*2]*c; \
4967 c = hc[1+(ofs)]; \
4968 tota1 += decode[2+(ofs)*2]*c; \
4969 totb1 += decode[3+(ofs)*2]*c; \
4970 c = hc[2+(ofs)]; \
4971 tota2 += decode[4+(ofs)*2]*c; \
4972 totb2 += decode[5+(ofs)*2]*c; \
4973 c = hc[3+(ofs)]; \
4974 tota3 += decode[6+(ofs)*2]*c; \
4975 totb3 += decode[7+(ofs)*2]*c;
4977#define stbir__1_coeff_remnant( ofs ) \
4978 c = hc[0+(ofs)]; \
4979 tota0 += decode[0+(ofs)*2] * c; \
4980 totb0 += decode[1+(ofs)*2] * c;
4982#define stbir__2_coeff_remnant( ofs ) \
4983 c = hc[0+(ofs)]; \
4984 tota0 += decode[0+(ofs)*2] * c; \
4985 totb0 += decode[1+(ofs)*2] * c; \
4986 c = hc[1+(ofs)]; \
4987 tota1 += decode[2+(ofs)*2] * c; \
4988 totb1 += decode[3+(ofs)*2] * c;
4990#define stbir__3_coeff_remnant( ofs ) \
4991 c = hc[0+(ofs)]; \
4992 tota0 += decode[0+(ofs)*2] * c; \
4993 totb0 += decode[1+(ofs)*2] * c; \
4994 c = hc[1+(ofs)]; \
4995 tota1 += decode[2+(ofs)*2] * c; \
4996 totb1 += decode[3+(ofs)*2] * c; \
4997 c = hc[2+(ofs)]; \
4998 tota2 += decode[4+(ofs)*2] * c; \
4999 totb2 += decode[5+(ofs)*2] * c;
5001#define stbir__store_output() \
5002 output[0] = (tota0+tota2)+(tota1+tota3); \
5003 output[1] = (totb0+totb2)+(totb1+totb3); \
5004 horizontal_coefficients += coefficient_width; \
5005 ++horizontal_contributors; \
5006 output += 2;
5008#endif
5010#define STBIR__horizontal_channels 2
5011#define STB_IMAGE_RESIZE_DO_HORIZONTALS
5012#include STBIR__HEADER_FILENAME
5015//=================
5016// Do 3 channel horizontal routines
5018#ifdef STBIR_SIMD
5020#define stbir__1_coeff_only() \
5021 stbir__simdf tot,c,d; \
5022 STBIR_SIMD_NO_UNROLL(decode); \
5023 stbir__simdf_load1z( c, hc ); \
5024 stbir__simdf_0123to0001( c, c ); \
5025 stbir__simdf_load( d, decode ); \
5026 stbir__simdf_mult( tot, d, c );
5028#define stbir__2_coeff_only() \
5029 stbir__simdf tot,c,cs,d; \
5030 STBIR_SIMD_NO_UNROLL(decode); \
5031 stbir__simdf_load2( cs, hc ); \
5032 stbir__simdf_0123to0000( c, cs ); \
5033 stbir__simdf_load( d, decode ); \
5034 stbir__simdf_mult( tot, d, c ); \
5035 stbir__simdf_0123to1111( c, cs ); \
5036 stbir__simdf_load( d, decode+3 ); \
5037 stbir__simdf_madd( tot, tot, d, c );
5039#define stbir__3_coeff_only() \
5040 stbir__simdf tot,c,d,cs; \
5041 STBIR_SIMD_NO_UNROLL(decode); \
5042 stbir__simdf_load( cs, hc ); \
5043 stbir__simdf_0123to0000( c, cs ); \
5044 stbir__simdf_load( d, decode ); \
5045 stbir__simdf_mult( tot, d, c ); \
5046 stbir__simdf_0123to1111( c, cs ); \
5047 stbir__simdf_load( d, decode+3 ); \
5048 stbir__simdf_madd( tot, tot, d, c ); \
5049 stbir__simdf_0123to2222( c, cs ); \
5050 stbir__simdf_load( d, decode+6 ); \
5051 stbir__simdf_madd( tot, tot, d, c );
5053#define stbir__store_output_tiny() \
5054 stbir__simdf_store2( output, tot ); \
5055 stbir__simdf_0123to2301( tot, tot ); \
5056 stbir__simdf_store1( output+2, tot ); \
5057 horizontal_coefficients += coefficient_width; \
5058 ++horizontal_contributors; \
5059 output += 3;
5061#ifdef STBIR_SIMD8
5063// we're loading from the XXXYYY decode by -1 to get the XXXYYY into different halves of the AVX reg fyi
5064#define stbir__4_coeff_start() \
5065 stbir__simdf8 tot0,tot1,c,cs; stbir__simdf t; \
5066 STBIR_SIMD_NO_UNROLL(decode); \
5067 stbir__simdf8_load4b( cs, hc ); \
5068 stbir__simdf8_0123to00001111( c, cs ); \
5069 stbir__simdf8_mult_mem( tot0, c, decode - 1 ); \
5070 stbir__simdf8_0123to22223333( c, cs ); \
5071 stbir__simdf8_mult_mem( tot1, c, decode+6 - 1 );
5073#define stbir__4_coeff_continue_from_4( ofs ) \
5074 STBIR_SIMD_NO_UNROLL(decode); \
5075 stbir__simdf8_load4b( cs, hc + (ofs) ); \
5076 stbir__simdf8_0123to00001111( c, cs ); \
5077 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*3 - 1 ); \
5078 stbir__simdf8_0123to22223333( c, cs ); \
5079 stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*3 + 6 - 1 );
5081#define stbir__1_coeff_remnant( ofs ) \
5082 STBIR_SIMD_NO_UNROLL(decode); \
5083 stbir__simdf_load1rep4( t, hc + (ofs) ); \
5084 stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*3 - 1 );
5086#define stbir__2_coeff_remnant( ofs ) \
5087 STBIR_SIMD_NO_UNROLL(decode); \
5088 stbir__simdf8_load4b( cs, hc + (ofs) - 2 ); \
5089 stbir__simdf8_0123to22223333( c, cs ); \
5090 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*3 - 1 );
5092 #define stbir__3_coeff_remnant( ofs ) \
5093 STBIR_SIMD_NO_UNROLL(decode); \
5094 stbir__simdf8_load4b( cs, hc + (ofs) ); \
5095 stbir__simdf8_0123to00001111( c, cs ); \
5096 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*3 - 1 ); \
5097 stbir__simdf8_0123to2222( t, cs ); \
5098 stbir__simdf8_madd_mem4( tot1, tot1, t, decode+(ofs)*3 + 6 - 1 );
5100#define stbir__store_output() \
5101 stbir__simdf8_add( tot0, tot0, tot1 ); \
5102 stbir__simdf_0123to1230( t, stbir__if_simdf8_cast_to_simdf4( tot0 ) ); \
5103 stbir__simdf8_add4halves( t, t, tot0 ); \
5104 horizontal_coefficients += coefficient_width; \
5105 ++horizontal_contributors; \
5106 output += 3; \
5107 if ( output < output_end ) \
5108 { \
5109 stbir__simdf_store( output-3, t ); \
5110 continue; \
5111 } \
5112 { stbir__simdf tt; stbir__simdf_0123to2301( tt, t ); \
5113 stbir__simdf_store2( output-3, t ); \
5114 stbir__simdf_store1( output+2-3, tt ); } \
5115 break;
5118#else
5120#define stbir__4_coeff_start() \
5121 stbir__simdf tot0,tot1,tot2,c,cs; \
5122 STBIR_SIMD_NO_UNROLL(decode); \
5123 stbir__simdf_load( cs, hc ); \
5124 stbir__simdf_0123to0001( c, cs ); \
5125 stbir__simdf_mult_mem( tot0, c, decode ); \
5126 stbir__simdf_0123to1122( c, cs ); \
5127 stbir__simdf_mult_mem( tot1, c, decode+4 ); \
5128 stbir__simdf_0123to2333( c, cs ); \
5129 stbir__simdf_mult_mem( tot2, c, decode+8 );
5131#define stbir__4_coeff_continue_from_4( ofs ) \
5132 STBIR_SIMD_NO_UNROLL(decode); \
5133 stbir__simdf_load( cs, hc + (ofs) ); \
5134 stbir__simdf_0123to0001( c, cs ); \
5135 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*3 ); \
5136 stbir__simdf_0123to1122( c, cs ); \
5137 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*3+4 ); \
5138 stbir__simdf_0123to2333( c, cs ); \
5139 stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*3+8 );
5141#define stbir__1_coeff_remnant( ofs ) \
5142 STBIR_SIMD_NO_UNROLL(decode); \
5143 stbir__simdf_load1z( c, hc + (ofs) ); \
5144 stbir__simdf_0123to0001( c, c ); \
5145 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*3 );
5147#define stbir__2_coeff_remnant( ofs ) \
5148 { stbir__simdf d; \
5149 STBIR_SIMD_NO_UNROLL(decode); \
5150 stbir__simdf_load2z( cs, hc + (ofs) ); \
5151 stbir__simdf_0123to0001( c, cs ); \
5152 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*3 ); \
5153 stbir__simdf_0123to1122( c, cs ); \
5154 stbir__simdf_load2z( d, decode+(ofs)*3+4 ); \
5155 stbir__simdf_madd( tot1, tot1, c, d ); }
5157#define stbir__3_coeff_remnant( ofs ) \
5158 { stbir__simdf d; \
5159 STBIR_SIMD_NO_UNROLL(decode); \
5160 stbir__simdf_load( cs, hc + (ofs) ); \
5161 stbir__simdf_0123to0001( c, cs ); \
5162 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*3 ); \
5163 stbir__simdf_0123to1122( c, cs ); \
5164 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*3+4 ); \
5165 stbir__simdf_0123to2222( c, cs ); \
5166 stbir__simdf_load1z( d, decode+(ofs)*3+8 ); \
5167 stbir__simdf_madd( tot2, tot2, c, d ); }
5169#define stbir__store_output() \
5170 stbir__simdf_0123ABCDto3ABx( c, tot0, tot1 ); \
5171 stbir__simdf_0123ABCDto23Ax( cs, tot1, tot2 ); \
5172 stbir__simdf_0123to1230( tot2, tot2 ); \
5173 stbir__simdf_add( tot0, tot0, cs ); \
5174 stbir__simdf_add( c, c, tot2 ); \
5175 stbir__simdf_add( tot0, tot0, c ); \
5176 horizontal_coefficients += coefficient_width; \
5177 ++horizontal_contributors; \
5178 output += 3; \
5179 if ( output < output_end ) \
5180 { \
5181 stbir__simdf_store( output-3, tot0 ); \
5182 continue; \
5183 } \
5184 stbir__simdf_0123to2301( tot1, tot0 ); \
5185 stbir__simdf_store2( output-3, tot0 ); \
5186 stbir__simdf_store1( output+2-3, tot1 ); \
5187 break;
5189#endif
5191#else
5193#define stbir__1_coeff_only() \
5194 float tot0, tot1, tot2, c; \
5195 c = hc[0]; \
5196 tot0 = decode[0]*c; \
5197 tot1 = decode[1]*c; \
5198 tot2 = decode[2]*c;
5200#define stbir__2_coeff_only() \
5201 float tot0, tot1, tot2, c; \
5202 c = hc[0]; \
5203 tot0 = decode[0]*c; \
5204 tot1 = decode[1]*c; \
5205 tot2 = decode[2]*c; \
5206 c = hc[1]; \
5207 tot0 += decode[3]*c; \
5208 tot1 += decode[4]*c; \
5209 tot2 += decode[5]*c;
5211#define stbir__3_coeff_only() \
5212 float tot0, tot1, tot2, c; \
5213 c = hc[0]; \
5214 tot0 = decode[0]*c; \
5215 tot1 = decode[1]*c; \
5216 tot2 = decode[2]*c; \
5217 c = hc[1]; \
5218 tot0 += decode[3]*c; \
5219 tot1 += decode[4]*c; \
5220 tot2 += decode[5]*c; \
5221 c = hc[2]; \
5222 tot0 += decode[6]*c; \
5223 tot1 += decode[7]*c; \
5224 tot2 += decode[8]*c;
5226#define stbir__store_output_tiny() \
5227 output[0] = tot0; \
5228 output[1] = tot1; \
5229 output[2] = tot2; \
5230 horizontal_coefficients += coefficient_width; \
5231 ++horizontal_contributors; \
5232 output += 3;
5234#define stbir__4_coeff_start() \
5235 float tota0,tota1,tota2,totb0,totb1,totb2,totc0,totc1,totc2,totd0,totd1,totd2,c; \
5236 c = hc[0]; \
5237 tota0 = decode[0]*c; \
5238 tota1 = decode[1]*c; \
5239 tota2 = decode[2]*c; \
5240 c = hc[1]; \
5241 totb0 = decode[3]*c; \
5242 totb1 = decode[4]*c; \
5243 totb2 = decode[5]*c; \
5244 c = hc[2]; \
5245 totc0 = decode[6]*c; \
5246 totc1 = decode[7]*c; \
5247 totc2 = decode[8]*c; \
5248 c = hc[3]; \
5249 totd0 = decode[9]*c; \
5250 totd1 = decode[10]*c; \
5251 totd2 = decode[11]*c;
5253#define stbir__4_coeff_continue_from_4( ofs ) \
5254 c = hc[0+(ofs)]; \
5255 tota0 += decode[0+(ofs)*3]*c; \
5256 tota1 += decode[1+(ofs)*3]*c; \
5257 tota2 += decode[2+(ofs)*3]*c; \
5258 c = hc[1+(ofs)]; \
5259 totb0 += decode[3+(ofs)*3]*c; \
5260 totb1 += decode[4+(ofs)*3]*c; \
5261 totb2 += decode[5+(ofs)*3]*c; \
5262 c = hc[2+(ofs)]; \
5263 totc0 += decode[6+(ofs)*3]*c; \
5264 totc1 += decode[7+(ofs)*3]*c; \
5265 totc2 += decode[8+(ofs)*3]*c; \
5266 c = hc[3+(ofs)]; \
5267 totd0 += decode[9+(ofs)*3]*c; \
5268 totd1 += decode[10+(ofs)*3]*c; \
5269 totd2 += decode[11+(ofs)*3]*c;
5271#define stbir__1_coeff_remnant( ofs ) \
5272 c = hc[0+(ofs)]; \
5273 tota0 += decode[0+(ofs)*3]*c; \
5274 tota1 += decode[1+(ofs)*3]*c; \
5275 tota2 += decode[2+(ofs)*3]*c;
5277#define stbir__2_coeff_remnant( ofs ) \
5278 c = hc[0+(ofs)]; \
5279 tota0 += decode[0+(ofs)*3]*c; \
5280 tota1 += decode[1+(ofs)*3]*c; \
5281 tota2 += decode[2+(ofs)*3]*c; \
5282 c = hc[1+(ofs)]; \
5283 totb0 += decode[3+(ofs)*3]*c; \
5284 totb1 += decode[4+(ofs)*3]*c; \
5285 totb2 += decode[5+(ofs)*3]*c; \
5287#define stbir__3_coeff_remnant( ofs ) \
5288 c = hc[0+(ofs)]; \
5289 tota0 += decode[0+(ofs)*3]*c; \
5290 tota1 += decode[1+(ofs)*3]*c; \
5291 tota2 += decode[2+(ofs)*3]*c; \
5292 c = hc[1+(ofs)]; \
5293 totb0 += decode[3+(ofs)*3]*c; \
5294 totb1 += decode[4+(ofs)*3]*c; \
5295 totb2 += decode[5+(ofs)*3]*c; \
5296 c = hc[2+(ofs)]; \
5297 totc0 += decode[6+(ofs)*3]*c; \
5298 totc1 += decode[7+(ofs)*3]*c; \
5299 totc2 += decode[8+(ofs)*3]*c;
5301#define stbir__store_output() \
5302 output[0] = (tota0+totc0)+(totb0+totd0); \
5303 output[1] = (tota1+totc1)+(totb1+totd1); \
5304 output[2] = (tota2+totc2)+(totb2+totd2); \
5305 horizontal_coefficients += coefficient_width; \
5306 ++horizontal_contributors; \
5307 output += 3;
5309#endif
5311#define STBIR__horizontal_channels 3
5312#define STB_IMAGE_RESIZE_DO_HORIZONTALS
5313#include STBIR__HEADER_FILENAME
5315//=================
5316// Do 4 channel horizontal routines
5318#ifdef STBIR_SIMD
5320#define stbir__1_coeff_only() \
5321 stbir__simdf tot,c; \
5322 STBIR_SIMD_NO_UNROLL(decode); \
5323 stbir__simdf_load1( c, hc ); \
5324 stbir__simdf_0123to0000( c, c ); \
5325 stbir__simdf_mult_mem( tot, c, decode );
5327#define stbir__2_coeff_only() \
5328 stbir__simdf tot,c,cs; \
5329 STBIR_SIMD_NO_UNROLL(decode); \
5330 stbir__simdf_load2( cs, hc ); \
5331 stbir__simdf_0123to0000( c, cs ); \
5332 stbir__simdf_mult_mem( tot, c, decode ); \
5333 stbir__simdf_0123to1111( c, cs ); \
5334 stbir__simdf_madd_mem( tot, tot, c, decode+4 );
5336#define stbir__3_coeff_only() \
5337 stbir__simdf tot,c,cs; \
5338 STBIR_SIMD_NO_UNROLL(decode); \
5339 stbir__simdf_load( cs, hc ); \
5340 stbir__simdf_0123to0000( c, cs ); \
5341 stbir__simdf_mult_mem( tot, c, decode ); \
5342 stbir__simdf_0123to1111( c, cs ); \
5343 stbir__simdf_madd_mem( tot, tot, c, decode+4 ); \
5344 stbir__simdf_0123to2222( c, cs ); \
5345 stbir__simdf_madd_mem( tot, tot, c, decode+8 );
5347#define stbir__store_output_tiny() \
5348 stbir__simdf_store( output, tot ); \
5349 horizontal_coefficients += coefficient_width; \
5350 ++horizontal_contributors; \
5351 output += 4;
5353#ifdef STBIR_SIMD8
5355#define stbir__4_coeff_start() \
5356 stbir__simdf8 tot0,c,cs; stbir__simdf t; \
5357 STBIR_SIMD_NO_UNROLL(decode); \
5358 stbir__simdf8_load4b( cs, hc ); \
5359 stbir__simdf8_0123to00001111( c, cs ); \
5360 stbir__simdf8_mult_mem( tot0, c, decode ); \
5361 stbir__simdf8_0123to22223333( c, cs ); \
5362 stbir__simdf8_madd_mem( tot0, tot0, c, decode+8 );
5364#define stbir__4_coeff_continue_from_4( ofs ) \
5365 STBIR_SIMD_NO_UNROLL(decode); \
5366 stbir__simdf8_load4b( cs, hc + (ofs) ); \
5367 stbir__simdf8_0123to00001111( c, cs ); \
5368 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4 ); \
5369 stbir__simdf8_0123to22223333( c, cs ); \
5370 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4+8 );
5372#define stbir__1_coeff_remnant( ofs ) \
5373 STBIR_SIMD_NO_UNROLL(decode); \
5374 stbir__simdf_load1rep4( t, hc + (ofs) ); \
5375 stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*4 );
5377#define stbir__2_coeff_remnant( ofs ) \
5378 STBIR_SIMD_NO_UNROLL(decode); \
5379 stbir__simdf8_load4b( cs, hc + (ofs) - 2 ); \
5380 stbir__simdf8_0123to22223333( c, cs ); \
5381 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4 );
5383 #define stbir__3_coeff_remnant( ofs ) \
5384 STBIR_SIMD_NO_UNROLL(decode); \
5385 stbir__simdf8_load4b( cs, hc + (ofs) ); \
5386 stbir__simdf8_0123to00001111( c, cs ); \
5387 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4 ); \
5388 stbir__simdf8_0123to2222( t, cs ); \
5389 stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*4+8 );
5391#define stbir__store_output() \
5392 stbir__simdf8_add4halves( t, stbir__if_simdf8_cast_to_simdf4(tot0), tot0 ); \
5393 stbir__simdf_store( output, t ); \
5394 horizontal_coefficients += coefficient_width; \
5395 ++horizontal_contributors; \
5396 output += 4;
5398#else
5400#define stbir__4_coeff_start() \
5401 stbir__simdf tot0,tot1,c,cs; \
5402 STBIR_SIMD_NO_UNROLL(decode); \
5403 stbir__simdf_load( cs, hc ); \
5404 stbir__simdf_0123to0000( c, cs ); \
5405 stbir__simdf_mult_mem( tot0, c, decode ); \
5406 stbir__simdf_0123to1111( c, cs ); \
5407 stbir__simdf_mult_mem( tot1, c, decode+4 ); \
5408 stbir__simdf_0123to2222( c, cs ); \
5409 stbir__simdf_madd_mem( tot0, tot0, c, decode+8 ); \
5410 stbir__simdf_0123to3333( c, cs ); \
5411 stbir__simdf_madd_mem( tot1, tot1, c, decode+12 );
5413#define stbir__4_coeff_continue_from_4( ofs ) \
5414 STBIR_SIMD_NO_UNROLL(decode); \
5415 stbir__simdf_load( cs, hc + (ofs) ); \
5416 stbir__simdf_0123to0000( c, cs ); \
5417 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4 ); \
5418 stbir__simdf_0123to1111( c, cs ); \
5419 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+4 ); \
5420 stbir__simdf_0123to2222( c, cs ); \
5421 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4+8 ); \
5422 stbir__simdf_0123to3333( c, cs ); \
5423 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+12 );
5425#define stbir__1_coeff_remnant( ofs ) \
5426 STBIR_SIMD_NO_UNROLL(decode); \
5427 stbir__simdf_load1( c, hc + (ofs) ); \
5428 stbir__simdf_0123to0000( c, c ); \
5429 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4 );
5431#define stbir__2_coeff_remnant( ofs ) \
5432 STBIR_SIMD_NO_UNROLL(decode); \
5433 stbir__simdf_load2( cs, hc + (ofs) ); \
5434 stbir__simdf_0123to0000( c, cs ); \
5435 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4 ); \
5436 stbir__simdf_0123to1111( c, cs ); \
5437 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+4 );
5439#define stbir__3_coeff_remnant( ofs ) \
5440 STBIR_SIMD_NO_UNROLL(decode); \
5441 stbir__simdf_load( cs, hc + (ofs) ); \
5442 stbir__simdf_0123to0000( c, cs ); \
5443 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4 ); \
5444 stbir__simdf_0123to1111( c, cs ); \
5445 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+4 ); \
5446 stbir__simdf_0123to2222( c, cs ); \
5447 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4+8 );
5449#define stbir__store_output() \
5450 stbir__simdf_add( tot0, tot0, tot1 ); \
5451 stbir__simdf_store( output, tot0 ); \
5452 horizontal_coefficients += coefficient_width; \
5453 ++horizontal_contributors; \
5454 output += 4;
5456#endif
5458#else
5460#define stbir__1_coeff_only() \
5461 float p0,p1,p2,p3,c; \
5462 STBIR_SIMD_NO_UNROLL(decode); \
5463 c = hc[0]; \
5464 p0 = decode[0] * c; \
5465 p1 = decode[1] * c; \
5466 p2 = decode[2] * c; \
5467 p3 = decode[3] * c;
5469#define stbir__2_coeff_only() \
5470 float p0,p1,p2,p3,c; \
5471 STBIR_SIMD_NO_UNROLL(decode); \
5472 c = hc[0]; \
5473 p0 = decode[0] * c; \
5474 p1 = decode[1] * c; \
5475 p2 = decode[2] * c; \
5476 p3 = decode[3] * c; \
5477 c = hc[1]; \
5478 p0 += decode[4] * c; \
5479 p1 += decode[5] * c; \
5480 p2 += decode[6] * c; \
5481 p3 += decode[7] * c;
5483#define stbir__3_coeff_only() \
5484 float p0,p1,p2,p3,c; \
5485 STBIR_SIMD_NO_UNROLL(decode); \
5486 c = hc[0]; \
5487 p0 = decode[0] * c; \
5488 p1 = decode[1] * c; \
5489 p2 = decode[2] * c; \
5490 p3 = decode[3] * c; \
5491 c = hc[1]; \
5492 p0 += decode[4] * c; \
5493 p1 += decode[5] * c; \
5494 p2 += decode[6] * c; \
5495 p3 += decode[7] * c; \
5496 c = hc[2]; \
5497 p0 += decode[8] * c; \
5498 p1 += decode[9] * c; \
5499 p2 += decode[10] * c; \
5500 p3 += decode[11] * c;
5502#define stbir__store_output_tiny() \
5503 output[0] = p0; \
5504 output[1] = p1; \
5505 output[2] = p2; \
5506 output[3] = p3; \
5507 horizontal_coefficients += coefficient_width; \
5508 ++horizontal_contributors; \
5509 output += 4;
5511#define stbir__4_coeff_start() \
5512 float x0,x1,x2,x3,y0,y1,y2,y3,c; \
5513 STBIR_SIMD_NO_UNROLL(decode); \
5514 c = hc[0]; \
5515 x0 = decode[0] * c; \
5516 x1 = decode[1] * c; \
5517 x2 = decode[2] * c; \
5518 x3 = decode[3] * c; \
5519 c = hc[1]; \
5520 y0 = decode[4] * c; \
5521 y1 = decode[5] * c; \
5522 y2 = decode[6] * c; \
5523 y3 = decode[7] * c; \
5524 c = hc[2]; \
5525 x0 += decode[8] * c; \
5526 x1 += decode[9] * c; \
5527 x2 += decode[10] * c; \
5528 x3 += decode[11] * c; \
5529 c = hc[3]; \
5530 y0 += decode[12] * c; \
5531 y1 += decode[13] * c; \
5532 y2 += decode[14] * c; \
5533 y3 += decode[15] * c;
5535#define stbir__4_coeff_continue_from_4( ofs ) \
5536 STBIR_SIMD_NO_UNROLL(decode); \
5537 c = hc[0+(ofs)]; \
5538 x0 += decode[0+(ofs)*4] * c; \
5539 x1 += decode[1+(ofs)*4] * c; \
5540 x2 += decode[2+(ofs)*4] * c; \
5541 x3 += decode[3+(ofs)*4] * c; \
5542 c = hc[1+(ofs)]; \
5543 y0 += decode[4+(ofs)*4] * c; \
5544 y1 += decode[5+(ofs)*4] * c; \
5545 y2 += decode[6+(ofs)*4] * c; \
5546 y3 += decode[7+(ofs)*4] * c; \
5547 c = hc[2+(ofs)]; \
5548 x0 += decode[8+(ofs)*4] * c; \
5549 x1 += decode[9+(ofs)*4] * c; \
5550 x2 += decode[10+(ofs)*4] * c; \
5551 x3 += decode[11+(ofs)*4] * c; \
5552 c = hc[3+(ofs)]; \
5553 y0 += decode[12+(ofs)*4] * c; \
5554 y1 += decode[13+(ofs)*4] * c; \
5555 y2 += decode[14+(ofs)*4] * c; \
5556 y3 += decode[15+(ofs)*4] * c;
5558#define stbir__1_coeff_remnant( ofs ) \
5559 STBIR_SIMD_NO_UNROLL(decode); \
5560 c = hc[0+(ofs)]; \
5561 x0 += decode[0+(ofs)*4] * c; \
5562 x1 += decode[1+(ofs)*4] * c; \
5563 x2 += decode[2+(ofs)*4] * c; \
5564 x3 += decode[3+(ofs)*4] * c;
5566#define stbir__2_coeff_remnant( ofs ) \
5567 STBIR_SIMD_NO_UNROLL(decode); \
5568 c = hc[0+(ofs)]; \
5569 x0 += decode[0+(ofs)*4] * c; \
5570 x1 += decode[1+(ofs)*4] * c; \
5571 x2 += decode[2+(ofs)*4] * c; \
5572 x3 += decode[3+(ofs)*4] * c; \
5573 c = hc[1+(ofs)]; \
5574 y0 += decode[4+(ofs)*4] * c; \
5575 y1 += decode[5+(ofs)*4] * c; \
5576 y2 += decode[6+(ofs)*4] * c; \
5577 y3 += decode[7+(ofs)*4] * c;
5579#define stbir__3_coeff_remnant( ofs ) \
5580 STBIR_SIMD_NO_UNROLL(decode); \
5581 c = hc[0+(ofs)]; \
5582 x0 += decode[0+(ofs)*4] * c; \
5583 x1 += decode[1+(ofs)*4] * c; \
5584 x2 += decode[2+(ofs)*4] * c; \
5585 x3 += decode[3+(ofs)*4] * c; \
5586 c = hc[1+(ofs)]; \
5587 y0 += decode[4+(ofs)*4] * c; \
5588 y1 += decode[5+(ofs)*4] * c; \
5589 y2 += decode[6+(ofs)*4] * c; \
5590 y3 += decode[7+(ofs)*4] * c; \
5591 c = hc[2+(ofs)]; \
5592 x0 += decode[8+(ofs)*4] * c; \
5593 x1 += decode[9+(ofs)*4] * c; \
5594 x2 += decode[10+(ofs)*4] * c; \
5595 x3 += decode[11+(ofs)*4] * c;
5597#define stbir__store_output() \
5598 output[0] = x0 + y0; \
5599 output[1] = x1 + y1; \
5600 output[2] = x2 + y2; \
5601 output[3] = x3 + y3; \
5602 horizontal_coefficients += coefficient_width; \
5603 ++horizontal_contributors; \
5604 output += 4;
5606#endif
5608#define STBIR__horizontal_channels 4
5609#define STB_IMAGE_RESIZE_DO_HORIZONTALS
5610#include STBIR__HEADER_FILENAME
5614//=================
5615// Do 7 channel horizontal routines
5617#ifdef STBIR_SIMD
5619#define stbir__1_coeff_only() \
5620 stbir__simdf tot0,tot1,c; \
5621 STBIR_SIMD_NO_UNROLL(decode); \
5622 stbir__simdf_load1( c, hc ); \
5623 stbir__simdf_0123to0000( c, c ); \
5624 stbir__simdf_mult_mem( tot0, c, decode ); \
5625 stbir__simdf_mult_mem( tot1, c, decode+3 );
5627#define stbir__2_coeff_only() \
5628 stbir__simdf tot0,tot1,c,cs; \
5629 STBIR_SIMD_NO_UNROLL(decode); \
5630 stbir__simdf_load2( cs, hc ); \
5631 stbir__simdf_0123to0000( c, cs ); \
5632 stbir__simdf_mult_mem( tot0, c, decode ); \
5633 stbir__simdf_mult_mem( tot1, c, decode+3 ); \
5634 stbir__simdf_0123to1111( c, cs ); \
5635 stbir__simdf_madd_mem( tot0, tot0, c, decode+7 ); \
5636 stbir__simdf_madd_mem( tot1, tot1, c,decode+10 );
5638#define stbir__3_coeff_only() \
5639 stbir__simdf tot0,tot1,c,cs; \
5640 STBIR_SIMD_NO_UNROLL(decode); \
5641 stbir__simdf_load( cs, hc ); \
5642 stbir__simdf_0123to0000( c, cs ); \
5643 stbir__simdf_mult_mem( tot0, c, decode ); \
5644 stbir__simdf_mult_mem( tot1, c, decode+3 ); \
5645 stbir__simdf_0123to1111( c, cs ); \
5646 stbir__simdf_madd_mem( tot0, tot0, c, decode+7 ); \
5647 stbir__simdf_madd_mem( tot1, tot1, c, decode+10 ); \
5648 stbir__simdf_0123to2222( c, cs ); \
5649 stbir__simdf_madd_mem( tot0, tot0, c, decode+14 ); \
5650 stbir__simdf_madd_mem( tot1, tot1, c, decode+17 );
5652#define stbir__store_output_tiny() \
5653 stbir__simdf_store( output+3, tot1 ); \
5654 stbir__simdf_store( output, tot0 ); \
5655 horizontal_coefficients += coefficient_width; \
5656 ++horizontal_contributors; \
5657 output += 7;
5659#ifdef STBIR_SIMD8
5661#define stbir__4_coeff_start() \
5662 stbir__simdf8 tot0,tot1,c,cs; \
5663 STBIR_SIMD_NO_UNROLL(decode); \
5664 stbir__simdf8_load4b( cs, hc ); \
5665 stbir__simdf8_0123to00000000( c, cs ); \
5666 stbir__simdf8_mult_mem( tot0, c, decode ); \
5667 stbir__simdf8_0123to11111111( c, cs ); \
5668 stbir__simdf8_mult_mem( tot1, c, decode+7 ); \
5669 stbir__simdf8_0123to22222222( c, cs ); \
5670 stbir__simdf8_madd_mem( tot0, tot0, c, decode+14 ); \
5671 stbir__simdf8_0123to33333333( c, cs ); \
5672 stbir__simdf8_madd_mem( tot1, tot1, c, decode+21 );
5674#define stbir__4_coeff_continue_from_4( ofs ) \
5675 STBIR_SIMD_NO_UNROLL(decode); \
5676 stbir__simdf8_load4b( cs, hc + (ofs) ); \
5677 stbir__simdf8_0123to00000000( c, cs ); \
5678 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7 ); \
5679 stbir__simdf8_0123to11111111( c, cs ); \
5680 stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+7 ); \
5681 stbir__simdf8_0123to22222222( c, cs ); \
5682 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 ); \
5683 stbir__simdf8_0123to33333333( c, cs ); \
5684 stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+21 );
5686#define stbir__1_coeff_remnant( ofs ) \
5687 STBIR_SIMD_NO_UNROLL(decode); \
5688 stbir__simdf8_load1b( c, hc + (ofs) ); \
5689 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7 );
5691#define stbir__2_coeff_remnant( ofs ) \
5692 STBIR_SIMD_NO_UNROLL(decode); \
5693 stbir__simdf8_load1b( c, hc + (ofs) ); \
5694 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7 ); \
5695 stbir__simdf8_load1b( c, hc + (ofs)+1 ); \
5696 stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+7 );
5698#define stbir__3_coeff_remnant( ofs ) \
5699 STBIR_SIMD_NO_UNROLL(decode); \
5700 stbir__simdf8_load4b( cs, hc + (ofs) ); \
5701 stbir__simdf8_0123to00000000( c, cs ); \
5702 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7 ); \
5703 stbir__simdf8_0123to11111111( c, cs ); \
5704 stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+7 ); \
5705 stbir__simdf8_0123to22222222( c, cs ); \
5706 stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 );
5708#define stbir__store_output() \
5709 stbir__simdf8_add( tot0, tot0, tot1 ); \
5710 horizontal_coefficients += coefficient_width; \
5711 ++horizontal_contributors; \
5712 output += 7; \
5713 if ( output < output_end ) \
5714 { \
5715 stbir__simdf8_store( output-7, tot0 ); \
5716 continue; \
5717 } \
5718 stbir__simdf_store( output-7+3, stbir__simdf_swiz(stbir__simdf8_gettop4(tot0),0,0,1,2) ); \
5719 stbir__simdf_store( output-7, stbir__if_simdf8_cast_to_simdf4(tot0) ); \
5720 break;
5722#else
5724#define stbir__4_coeff_start() \
5725 stbir__simdf tot0,tot1,tot2,tot3,c,cs; \
5726 STBIR_SIMD_NO_UNROLL(decode); \
5727 stbir__simdf_load( cs, hc ); \
5728 stbir__simdf_0123to0000( c, cs ); \
5729 stbir__simdf_mult_mem( tot0, c, decode ); \
5730 stbir__simdf_mult_mem( tot1, c, decode+3 ); \
5731 stbir__simdf_0123to1111( c, cs ); \
5732 stbir__simdf_mult_mem( tot2, c, decode+7 ); \
5733 stbir__simdf_mult_mem( tot3, c, decode+10 ); \
5734 stbir__simdf_0123to2222( c, cs ); \
5735 stbir__simdf_madd_mem( tot0, tot0, c, decode+14 ); \
5736 stbir__simdf_madd_mem( tot1, tot1, c, decode+17 ); \
5737 stbir__simdf_0123to3333( c, cs ); \
5738 stbir__simdf_madd_mem( tot2, tot2, c, decode+21 ); \
5739 stbir__simdf_madd_mem( tot3, tot3, c, decode+24 );
5741#define stbir__4_coeff_continue_from_4( ofs ) \
5742 STBIR_SIMD_NO_UNROLL(decode); \
5743 stbir__simdf_load( cs, hc + (ofs) ); \
5744 stbir__simdf_0123to0000( c, cs ); \
5745 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7 ); \
5746 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+3 ); \
5747 stbir__simdf_0123to1111( c, cs ); \
5748 stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*7+7 ); \
5749 stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+10 ); \
5750 stbir__simdf_0123to2222( c, cs ); \
5751 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 ); \
5752 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+17 ); \
5753 stbir__simdf_0123to3333( c, cs ); \
5754 stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*7+21 ); \
5755 stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+24 );
5757#define stbir__1_coeff_remnant( ofs ) \
5758 STBIR_SIMD_NO_UNROLL(decode); \
5759 stbir__simdf_load1( c, hc + (ofs) ); \
5760 stbir__simdf_0123to0000( c, c ); \
5761 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7 ); \
5762 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+3 ); \
5764#define stbir__2_coeff_remnant( ofs ) \
5765 STBIR_SIMD_NO_UNROLL(decode); \
5766 stbir__simdf_load2( cs, hc + (ofs) ); \
5767 stbir__simdf_0123to0000( c, cs ); \
5768 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7 ); \
5769 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+3 ); \
5770 stbir__simdf_0123to1111( c, cs ); \
5771 stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*7+7 ); \
5772 stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+10 );
5774#define stbir__3_coeff_remnant( ofs ) \
5775 STBIR_SIMD_NO_UNROLL(decode); \
5776 stbir__simdf_load( cs, hc + (ofs) ); \
5777 stbir__simdf_0123to0000( c, cs ); \
5778 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7 ); \
5779 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+3 ); \
5780 stbir__simdf_0123to1111( c, cs ); \
5781 stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*7+7 ); \
5782 stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+10 ); \
5783 stbir__simdf_0123to2222( c, cs ); \
5784 stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 ); \
5785 stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+17 );
5787#define stbir__store_output() \
5788 stbir__simdf_add( tot0, tot0, tot2 ); \
5789 stbir__simdf_add( tot1, tot1, tot3 ); \
5790 stbir__simdf_store( output+3, tot1 ); \
5791 stbir__simdf_store( output, tot0 ); \
5792 horizontal_coefficients += coefficient_width; \
5793 ++horizontal_contributors; \
5794 output += 7;
5796#endif
5798#else
5800#define stbir__1_coeff_only() \
5801 float tot0, tot1, tot2, tot3, tot4, tot5, tot6, c; \
5802 c = hc[0]; \
5803 tot0 = decode[0]*c; \
5804 tot1 = decode[1]*c; \
5805 tot2 = decode[2]*c; \
5806 tot3 = decode[3]*c; \
5807 tot4 = decode[4]*c; \
5808 tot5 = decode[5]*c; \
5809 tot6 = decode[6]*c;
5811#define stbir__2_coeff_only() \
5812 float tot0, tot1, tot2, tot3, tot4, tot5, tot6, c; \
5813 c = hc[0]; \
5814 tot0 = decode[0]*c; \
5815 tot1 = decode[1]*c; \
5816 tot2 = decode[2]*c; \
5817 tot3 = decode[3]*c; \
5818 tot4 = decode[4]*c; \
5819 tot5 = decode[5]*c; \
5820 tot6 = decode[6]*c; \
5821 c = hc[1]; \
5822 tot0 += decode[7]*c; \
5823 tot1 += decode[8]*c; \
5824 tot2 += decode[9]*c; \
5825 tot3 += decode[10]*c; \
5826 tot4 += decode[11]*c; \
5827 tot5 += decode[12]*c; \
5828 tot6 += decode[13]*c; \
5830#define stbir__3_coeff_only() \
5831 float tot0, tot1, tot2, tot3, tot4, tot5, tot6, c; \
5832 c = hc[0]; \
5833 tot0 = decode[0]*c; \
5834 tot1 = decode[1]*c; \
5835 tot2 = decode[2]*c; \
5836 tot3 = decode[3]*c; \
5837 tot4 = decode[4]*c; \
5838 tot5 = decode[5]*c; \
5839 tot6 = decode[6]*c; \
5840 c = hc[1]; \
5841 tot0 += decode[7]*c; \
5842 tot1 += decode[8]*c; \
5843 tot2 += decode[9]*c; \
5844 tot3 += decode[10]*c; \
5845 tot4 += decode[11]*c; \
5846 tot5 += decode[12]*c; \
5847 tot6 += decode[13]*c; \
5848 c = hc[2]; \
5849 tot0 += decode[14]*c; \
5850 tot1 += decode[15]*c; \
5851 tot2 += decode[16]*c; \
5852 tot3 += decode[17]*c; \
5853 tot4 += decode[18]*c; \
5854 tot5 += decode[19]*c; \
5855 tot6 += decode[20]*c; \
5857#define stbir__store_output_tiny() \
5858 output[0] = tot0; \
5859 output[1] = tot1; \
5860 output[2] = tot2; \
5861 output[3] = tot3; \
5862 output[4] = tot4; \
5863 output[5] = tot5; \
5864 output[6] = tot6; \
5865 horizontal_coefficients += coefficient_width; \
5866 ++horizontal_contributors; \
5867 output += 7;
5869#define stbir__4_coeff_start() \
5870 float x0,x1,x2,x3,x4,x5,x6,y0,y1,y2,y3,y4,y5,y6,c; \
5871 STBIR_SIMD_NO_UNROLL(decode); \
5872 c = hc[0]; \
5873 x0 = decode[0] * c; \
5874 x1 = decode[1] * c; \
5875 x2 = decode[2] * c; \
5876 x3 = decode[3] * c; \
5877 x4 = decode[4] * c; \
5878 x5 = decode[5] * c; \
5879 x6 = decode[6] * c; \
5880 c = hc[1]; \
5881 y0 = decode[7] * c; \
5882 y1 = decode[8] * c; \
5883 y2 = decode[9] * c; \
5884 y3 = decode[10] * c; \
5885 y4 = decode[11] * c; \
5886 y5 = decode[12] * c; \
5887 y6 = decode[13] * c; \
5888 c = hc[2]; \
5889 x0 += decode[14] * c; \
5890 x1 += decode[15] * c; \
5891 x2 += decode[16] * c; \
5892 x3 += decode[17] * c; \
5893 x4 += decode[18] * c; \
5894 x5 += decode[19] * c; \
5895 x6 += decode[20] * c; \
5896 c = hc[3]; \
5897 y0 += decode[21] * c; \
5898 y1 += decode[22] * c; \
5899 y2 += decode[23] * c; \
5900 y3 += decode[24] * c; \
5901 y4 += decode[25] * c; \
5902 y5 += decode[26] * c; \
5903 y6 += decode[27] * c;
5905#define stbir__4_coeff_continue_from_4( ofs ) \
5906 STBIR_SIMD_NO_UNROLL(decode); \
5907 c = hc[0+(ofs)]; \
5908 x0 += decode[0+(ofs)*7] * c; \
5909 x1 += decode[1+(ofs)*7] * c; \
5910 x2 += decode[2+(ofs)*7] * c; \
5911 x3 += decode[3+(ofs)*7] * c; \
5912 x4 += decode[4+(ofs)*7] * c; \
5913 x5 += decode[5+(ofs)*7] * c; \
5914 x6 += decode[6+(ofs)*7] * c; \
5915 c = hc[1+(ofs)]; \
5916 y0 += decode[7+(ofs)*7] * c; \
5917 y1 += decode[8+(ofs)*7] * c; \
5918 y2 += decode[9+(ofs)*7] * c; \
5919 y3 += decode[10+(ofs)*7] * c; \
5920 y4 += decode[11+(ofs)*7] * c; \
5921 y5 += decode[12+(ofs)*7] * c; \
5922 y6 += decode[13+(ofs)*7] * c; \
5923 c = hc[2+(ofs)]; \
5924 x0 += decode[14+(ofs)*7] * c; \
5925 x1 += decode[15+(ofs)*7] * c; \
5926 x2 += decode[16+(ofs)*7] * c; \
5927 x3 += decode[17+(ofs)*7] * c; \
5928 x4 += decode[18+(ofs)*7] * c; \
5929 x5 += decode[19+(ofs)*7] * c; \
5930 x6 += decode[20+(ofs)*7] * c; \
5931 c = hc[3+(ofs)]; \
5932 y0 += decode[21+(ofs)*7] * c; \
5933 y1 += decode[22+(ofs)*7] * c; \
5934 y2 += decode[23+(ofs)*7] * c; \
5935 y3 += decode[24+(ofs)*7] * c; \
5936 y4 += decode[25+(ofs)*7] * c; \
5937 y5 += decode[26+(ofs)*7] * c; \
5938 y6 += decode[27+(ofs)*7] * c;
5940#define stbir__1_coeff_remnant( ofs ) \
5941 STBIR_SIMD_NO_UNROLL(decode); \
5942 c = hc[0+(ofs)]; \
5943 x0 += decode[0+(ofs)*7] * c; \
5944 x1 += decode[1+(ofs)*7] * c; \
5945 x2 += decode[2+(ofs)*7] * c; \
5946 x3 += decode[3+(ofs)*7] * c; \
5947 x4 += decode[4+(ofs)*7] * c; \
5948 x5 += decode[5+(ofs)*7] * c; \
5949 x6 += decode[6+(ofs)*7] * c; \
5951#define stbir__2_coeff_remnant( ofs ) \
5952 STBIR_SIMD_NO_UNROLL(decode); \
5953 c = hc[0+(ofs)]; \
5954 x0 += decode[0+(ofs)*7] * c; \
5955 x1 += decode[1+(ofs)*7] * c; \
5956 x2 += decode[2+(ofs)*7] * c; \
5957 x3 += decode[3+(ofs)*7] * c; \
5958 x4 += decode[4+(ofs)*7] * c; \
5959 x5 += decode[5+(ofs)*7] * c; \
5960 x6 += decode[6+(ofs)*7] * c; \
5961 c = hc[1+(ofs)]; \
5962 y0 += decode[7+(ofs)*7] * c; \
5963 y1 += decode[8+(ofs)*7] * c; \
5964 y2 += decode[9+(ofs)*7] * c; \
5965 y3 += decode[10+(ofs)*7] * c; \
5966 y4 += decode[11+(ofs)*7] * c; \
5967 y5 += decode[12+(ofs)*7] * c; \
5968 y6 += decode[13+(ofs)*7] * c; \
5970#define stbir__3_coeff_remnant( ofs ) \
5971 STBIR_SIMD_NO_UNROLL(decode); \
5972 c = hc[0+(ofs)]; \
5973 x0 += decode[0+(ofs)*7] * c; \
5974 x1 += decode[1+(ofs)*7] * c; \
5975 x2 += decode[2+(ofs)*7] * c; \
5976 x3 += decode[3+(ofs)*7] * c; \
5977 x4 += decode[4+(ofs)*7] * c; \
5978 x5 += decode[5+(ofs)*7] * c; \
5979 x6 += decode[6+(ofs)*7] * c; \
5980 c = hc[1+(ofs)]; \
5981 y0 += decode[7+(ofs)*7] * c; \
5982 y1 += decode[8+(ofs)*7] * c; \
5983 y2 += decode[9+(ofs)*7] * c; \
5984 y3 += decode[10+(ofs)*7] * c; \
5985 y4 += decode[11+(ofs)*7] * c; \
5986 y5 += decode[12+(ofs)*7] * c; \
5987 y6 += decode[13+(ofs)*7] * c; \
5988 c = hc[2+(ofs)]; \
5989 x0 += decode[14+(ofs)*7] * c; \
5990 x1 += decode[15+(ofs)*7] * c; \
5991 x2 += decode[16+(ofs)*7] * c; \
5992 x3 += decode[17+(ofs)*7] * c; \
5993 x4 += decode[18+(ofs)*7] * c; \
5994 x5 += decode[19+(ofs)*7] * c; \
5995 x6 += decode[20+(ofs)*7] * c; \
5997#define stbir__store_output() \
5998 output[0] = x0 + y0; \
5999 output[1] = x1 + y1; \
6000 output[2] = x2 + y2; \
6001 output[3] = x3 + y3; \
6002 output[4] = x4 + y4; \
6003 output[5] = x5 + y5; \
6004 output[6] = x6 + y6; \
6005 horizontal_coefficients += coefficient_width; \
6006 ++horizontal_contributors; \
6007 output += 7;
6009#endif
6011#define STBIR__horizontal_channels 7
6012#define STB_IMAGE_RESIZE_DO_HORIZONTALS
6013#include STBIR__HEADER_FILENAME
6016// include all of the vertical resamplers (both scatter and gather versions)
6018#define STBIR__vertical_channels 1
6019#define STB_IMAGE_RESIZE_DO_VERTICALS
6020#include STBIR__HEADER_FILENAME
6022#define STBIR__vertical_channels 1
6023#define STB_IMAGE_RESIZE_DO_VERTICALS
6024#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
6025#include STBIR__HEADER_FILENAME
6027#define STBIR__vertical_channels 2
6028#define STB_IMAGE_RESIZE_DO_VERTICALS
6029#include STBIR__HEADER_FILENAME
6031#define STBIR__vertical_channels 2
6032#define STB_IMAGE_RESIZE_DO_VERTICALS
6033#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
6034#include STBIR__HEADER_FILENAME
6036#define STBIR__vertical_channels 3
6037#define STB_IMAGE_RESIZE_DO_VERTICALS
6038#include STBIR__HEADER_FILENAME
6040#define STBIR__vertical_channels 3
6041#define STB_IMAGE_RESIZE_DO_VERTICALS
6042#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
6043#include STBIR__HEADER_FILENAME
6045#define STBIR__vertical_channels 4
6046#define STB_IMAGE_RESIZE_DO_VERTICALS
6047#include STBIR__HEADER_FILENAME
6049#define STBIR__vertical_channels 4
6050#define STB_IMAGE_RESIZE_DO_VERTICALS
6051#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
6052#include STBIR__HEADER_FILENAME
6054#define STBIR__vertical_channels 5
6055#define STB_IMAGE_RESIZE_DO_VERTICALS
6056#include STBIR__HEADER_FILENAME
6058#define STBIR__vertical_channels 5
6059#define STB_IMAGE_RESIZE_DO_VERTICALS
6060#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
6061#include STBIR__HEADER_FILENAME
6063#define STBIR__vertical_channels 6
6064#define STB_IMAGE_RESIZE_DO_VERTICALS
6065#include STBIR__HEADER_FILENAME
6067#define STBIR__vertical_channels 6
6068#define STB_IMAGE_RESIZE_DO_VERTICALS
6069#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
6070#include STBIR__HEADER_FILENAME
6072#define STBIR__vertical_channels 7
6073#define STB_IMAGE_RESIZE_DO_VERTICALS
6074#include STBIR__HEADER_FILENAME
6076#define STBIR__vertical_channels 7
6077#define STB_IMAGE_RESIZE_DO_VERTICALS
6078#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
6079#include STBIR__HEADER_FILENAME
6081#define STBIR__vertical_channels 8
6082#define STB_IMAGE_RESIZE_DO_VERTICALS
6083#include STBIR__HEADER_FILENAME
6085#define STBIR__vertical_channels 8
6086#define STB_IMAGE_RESIZE_DO_VERTICALS
6087#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
6088#include STBIR__HEADER_FILENAME
6090typedef void STBIR_VERTICAL_GATHERFUNC( float * output, float const * coeffs, float const ** inputs, float const * input0_end );
6092static STBIR_VERTICAL_GATHERFUNC * stbir__vertical_gathers[ 8 ] =
6093{
6094 stbir__vertical_gather_with_1_coeffs,stbir__vertical_gather_with_2_coeffs,stbir__vertical_gather_with_3_coeffs,stbir__vertical_gather_with_4_coeffs,stbir__vertical_gather_with_5_coeffs,stbir__vertical_gather_with_6_coeffs,stbir__vertical_gather_with_7_coeffs,stbir__vertical_gather_with_8_coeffs
6095};
6097static STBIR_VERTICAL_GATHERFUNC * stbir__vertical_gathers_continues[ 8 ] =
6098{
6099 stbir__vertical_gather_with_1_coeffs_cont,stbir__vertical_gather_with_2_coeffs_cont,stbir__vertical_gather_with_3_coeffs_cont,stbir__vertical_gather_with_4_coeffs_cont,stbir__vertical_gather_with_5_coeffs_cont,stbir__vertical_gather_with_6_coeffs_cont,stbir__vertical_gather_with_7_coeffs_cont,stbir__vertical_gather_with_8_coeffs_cont
6100};
6102typedef void STBIR_VERTICAL_SCATTERFUNC( float ** outputs, float const * coeffs, float const * input, float const * input_end );
6104static STBIR_VERTICAL_SCATTERFUNC * stbir__vertical_scatter_sets[ 8 ] =
6105{
6106 stbir__vertical_scatter_with_1_coeffs,stbir__vertical_scatter_with_2_coeffs,stbir__vertical_scatter_with_3_coeffs,stbir__vertical_scatter_with_4_coeffs,stbir__vertical_scatter_with_5_coeffs,stbir__vertical_scatter_with_6_coeffs,stbir__vertical_scatter_with_7_coeffs,stbir__vertical_scatter_with_8_coeffs
6107};
6109static STBIR_VERTICAL_SCATTERFUNC * stbir__vertical_scatter_blends[ 8 ] =
6110{
6111 stbir__vertical_scatter_with_1_coeffs_cont,stbir__vertical_scatter_with_2_coeffs_cont,stbir__vertical_scatter_with_3_coeffs_cont,stbir__vertical_scatter_with_4_coeffs_cont,stbir__vertical_scatter_with_5_coeffs_cont,stbir__vertical_scatter_with_6_coeffs_cont,stbir__vertical_scatter_with_7_coeffs_cont,stbir__vertical_scatter_with_8_coeffs_cont
6112};
6115static void stbir__encode_scanline( stbir__info const * stbir_info, void *output_buffer_data, float * encode_buffer, int row STBIR_ONLY_PROFILE_GET_SPLIT_INFO )
6116{
6117 int num_pixels = stbir_info->horizontal.scale_info.output_sub_size;
6118 int channels = stbir_info->channels;
6119 int width_times_channels = num_pixels * channels;
6120 void * output_buffer;
6122 // un-alpha weight if we need to
6123 if ( stbir_info->alpha_unweight )
6124 {
6125 STBIR_PROFILE_START( unalpha );
6126 stbir_info->alpha_unweight( encode_buffer, width_times_channels );
6127 STBIR_PROFILE_END( unalpha );
6128 }
6130 // write directly into output by default
6131 output_buffer = output_buffer_data;
6133 // if we have an output callback, we first convert the decode buffer in place (and then hand that to the callback)
6134 if ( stbir_info->out_pixels_cb )
6135 output_buffer = encode_buffer;
6137 STBIR_PROFILE_START( encode );
6138 // convert into the output buffer
6139 stbir_info->encode_pixels( output_buffer, width_times_channels, encode_buffer );
6140 STBIR_PROFILE_END( encode );
6142 // if we have an output callback, call it to send the data
6143 if ( stbir_info->out_pixels_cb )
6144 stbir_info->out_pixels_cb( output_buffer, num_pixels, row, stbir_info->user_data );
6145}
6148// Get the ring buffer pointer for an index
6149static float* stbir__get_ring_buffer_entry(stbir__info const * stbir_info, stbir__per_split_info const * split_info, int index )
6150{
6151 STBIR_ASSERT( index < stbir_info->ring_buffer_num_entries );
6153 #ifdef STBIR__SEPARATE_ALLOCATIONS
6154 return split_info->ring_buffers[ index ];
6155 #else
6156 return (float*) ( ( (char*) split_info->ring_buffer ) + ( index * stbir_info->ring_buffer_length_bytes ) );
6157 #endif
6158}
6160// Get the specified scan line from the ring buffer
6161static float* stbir__get_ring_buffer_scanline(stbir__info const * stbir_info, stbir__per_split_info const * split_info, int get_scanline)
6162{
6163 int ring_buffer_index = (split_info->ring_buffer_begin_index + (get_scanline - split_info->ring_buffer_first_scanline)) % stbir_info->ring_buffer_num_entries;
6164 return stbir__get_ring_buffer_entry( stbir_info, split_info, ring_buffer_index );
6165}
6167static void stbir__resample_horizontal_gather(stbir__info const * stbir_info, float* output_buffer, float const * input_buffer STBIR_ONLY_PROFILE_GET_SPLIT_INFO )
6168{
6169 float const * decode_buffer = input_buffer - ( stbir_info->scanline_extents.conservative.n0 * stbir_info->effective_channels );
6171 STBIR_PROFILE_START( horizontal );
6172 if ( ( stbir_info->horizontal.filter_enum == STBIR_FILTER_POINT_SAMPLE ) && ( stbir_info->horizontal.scale_info.scale == 1.0f ) )
6173 STBIR_MEMCPY( output_buffer, input_buffer, stbir_info->horizontal.scale_info.output_sub_size * sizeof( float ) * stbir_info->effective_channels );
6174 else
6175 stbir_info->horizontal_gather_channels( output_buffer, stbir_info->horizontal.scale_info.output_sub_size, decode_buffer, stbir_info->horizontal.contributors, stbir_info->horizontal.coefficients, stbir_info->horizontal.coefficient_width );
6176 STBIR_PROFILE_END( horizontal );
6177}
6179static void stbir__resample_vertical_gather(stbir__info const * stbir_info, stbir__per_split_info* split_info, int n, int contrib_n0, int contrib_n1, float const * vertical_coefficients )
6180{
6181 float* encode_buffer = split_info->vertical_buffer;
6182 float* decode_buffer = split_info->decode_buffer;
6183 int vertical_first = stbir_info->vertical_first;
6184 int width = (vertical_first) ? ( stbir_info->scanline_extents.conservative.n1-stbir_info->scanline_extents.conservative.n0+1 ) : stbir_info->horizontal.scale_info.output_sub_size;
6185 int width_times_channels = stbir_info->effective_channels * width;
6187 STBIR_ASSERT( stbir_info->vertical.is_gather );
6189 // loop over the contributing scanlines and scale into the buffer
6190 STBIR_PROFILE_START( vertical );
6191 {
6192 int k = 0, total = contrib_n1 - contrib_n0 + 1;
6193 STBIR_ASSERT( total > 0 );
6194 do {
6195 float const * inputs[8];
6196 int i, cnt = total; if ( cnt > 8 ) cnt = 8;
6197 for( i = 0 ; i < cnt ; i++ )
6198 inputs[ i ] = stbir__get_ring_buffer_scanline(stbir_info, split_info, k+i+contrib_n0 );
6200 // call the N scanlines at a time function (up to 8 scanlines of blending at once)
6201 ((k==0)?stbir__vertical_gathers:stbir__vertical_gathers_continues)[cnt-1]( (vertical_first) ? decode_buffer : encode_buffer, vertical_coefficients + k, inputs, inputs[0] + width_times_channels );
6202 k += cnt;
6203 total -= cnt;
6204 } while ( total );
6205 }
6206 STBIR_PROFILE_END( vertical );
6208 if ( vertical_first )
6209 {
6210 // Now resample the gathered vertical data in the horizontal axis into the encode buffer
6211 stbir__resample_horizontal_gather(stbir_info, encode_buffer, decode_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
6212 }
6214 stbir__encode_scanline( stbir_info, ( (char *) stbir_info->output_data ) + ((size_t)n * (size_t)stbir_info->output_stride_bytes),
6215 encode_buffer, n STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
6216}
6218static void stbir__decode_and_resample_for_vertical_gather_loop(stbir__info const * stbir_info, stbir__per_split_info* split_info, int n)
6219{
6220 int ring_buffer_index;
6221 float* ring_buffer;
6223 // Decode the nth scanline from the source image into the decode buffer.
6224 stbir__decode_scanline( stbir_info, n, split_info->decode_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
6226 // update new end scanline
6227 split_info->ring_buffer_last_scanline = n;
6229 // get ring buffer
6230 ring_buffer_index = (split_info->ring_buffer_begin_index + (split_info->ring_buffer_last_scanline - split_info->ring_buffer_first_scanline)) % stbir_info->ring_buffer_num_entries;
6231 ring_buffer = stbir__get_ring_buffer_entry(stbir_info, split_info, ring_buffer_index);
6233 // Now resample it into the ring buffer.
6234 stbir__resample_horizontal_gather( stbir_info, ring_buffer, split_info->decode_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
6236 // Now it's sitting in the ring buffer ready to be used as source for the vertical sampling.
6237}
6239static void stbir__vertical_gather_loop( stbir__info const * stbir_info, stbir__per_split_info* split_info, int split_count )
6240{
6241 int y, start_output_y, end_output_y;
6242 stbir__contributors* vertical_contributors = stbir_info->vertical.contributors;
6243 float const * vertical_coefficients = stbir_info->vertical.coefficients;
6245 STBIR_ASSERT( stbir_info->vertical.is_gather );
6247 start_output_y = split_info->start_output_y;
6248 end_output_y = split_info[split_count-1].end_output_y;
6250 vertical_contributors += start_output_y;
6251 vertical_coefficients += start_output_y * stbir_info->vertical.coefficient_width;
6253 // initialize the ring buffer for gathering
6254 split_info->ring_buffer_begin_index = 0;
6255 split_info->ring_buffer_first_scanline = vertical_contributors->n0;
6256 split_info->ring_buffer_last_scanline = split_info->ring_buffer_first_scanline - 1; // means "empty"
6258 for (y = start_output_y; y < end_output_y; y++)
6259 {
6260 int in_first_scanline, in_last_scanline;
6262 in_first_scanline = vertical_contributors->n0;
6263 in_last_scanline = vertical_contributors->n1;
6265 // make sure the indexing hasn't broken
6266 STBIR_ASSERT( in_first_scanline >= split_info->ring_buffer_first_scanline );
6268 // Load in new scanlines
6269 while (in_last_scanline > split_info->ring_buffer_last_scanline)
6270 {
6271 STBIR_ASSERT( ( split_info->ring_buffer_last_scanline - split_info->ring_buffer_first_scanline + 1 ) <= stbir_info->ring_buffer_num_entries );
6273 // make sure there was room in the ring buffer when we add new scanlines
6274 if ( ( split_info->ring_buffer_last_scanline - split_info->ring_buffer_first_scanline + 1 ) == stbir_info->ring_buffer_num_entries )
6275 {
6276 split_info->ring_buffer_first_scanline++;
6277 split_info->ring_buffer_begin_index++;
6278 }
6280 if ( stbir_info->vertical_first )
6281 {
6282 float * ring_buffer = stbir__get_ring_buffer_scanline( stbir_info, split_info, ++split_info->ring_buffer_last_scanline );
6283 // Decode the nth scanline from the source image into the decode buffer.
6284 stbir__decode_scanline( stbir_info, split_info->ring_buffer_last_scanline, ring_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
6285 }
6286 else
6287 {
6288 stbir__decode_and_resample_for_vertical_gather_loop(stbir_info, split_info, split_info->ring_buffer_last_scanline + 1);
6289 }
6290 }
6292 // Now all buffers should be ready to write a row of vertical sampling, so do it.
6293 stbir__resample_vertical_gather(stbir_info, split_info, y, in_first_scanline, in_last_scanline, vertical_coefficients );
6295 ++vertical_contributors;
6296 vertical_coefficients += stbir_info->vertical.coefficient_width;
6297 }
6298}
6300#define STBIR__FLOAT_EMPTY_MARKER 3.0e+38F
6301#define STBIR__FLOAT_BUFFER_IS_EMPTY(ptr) ((ptr)[0]==STBIR__FLOAT_EMPTY_MARKER)
6303static void stbir__encode_first_scanline_from_scatter(stbir__info const * stbir_info, stbir__per_split_info* split_info)
6304{
6305 // evict a scanline out into the output buffer
6306 float* ring_buffer_entry = stbir__get_ring_buffer_entry(stbir_info, split_info, split_info->ring_buffer_begin_index );
6308 // dump the scanline out
6309 stbir__encode_scanline( stbir_info, ( (char *)stbir_info->output_data ) + ( (size_t)split_info->ring_buffer_first_scanline * (size_t)stbir_info->output_stride_bytes ), ring_buffer_entry, split_info->ring_buffer_first_scanline STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
6311 // mark it as empty
6312 ring_buffer_entry[ 0 ] = STBIR__FLOAT_EMPTY_MARKER;
6314 // advance the first scanline
6315 split_info->ring_buffer_first_scanline++;
6316 if ( ++split_info->ring_buffer_begin_index == stbir_info->ring_buffer_num_entries )
6317 split_info->ring_buffer_begin_index = 0;
6318}
6320static void stbir__horizontal_resample_and_encode_first_scanline_from_scatter(stbir__info const * stbir_info, stbir__per_split_info* split_info)
6321{
6322 // evict a scanline out into the output buffer
6324 float* ring_buffer_entry = stbir__get_ring_buffer_entry(stbir_info, split_info, split_info->ring_buffer_begin_index );
6326 // Now resample it into the buffer.
6327 stbir__resample_horizontal_gather( stbir_info, split_info->vertical_buffer, ring_buffer_entry STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
6329 // dump the scanline out
6330 stbir__encode_scanline( stbir_info, ( (char *)stbir_info->output_data ) + ( (size_t)split_info->ring_buffer_first_scanline * (size_t)stbir_info->output_stride_bytes ), split_info->vertical_buffer, split_info->ring_buffer_first_scanline STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
6332 // mark it as empty
6333 ring_buffer_entry[ 0 ] = STBIR__FLOAT_EMPTY_MARKER;
6335 // advance the first scanline
6336 split_info->ring_buffer_first_scanline++;
6337 if ( ++split_info->ring_buffer_begin_index == stbir_info->ring_buffer_num_entries )
6338 split_info->ring_buffer_begin_index = 0;
6339}
6341static void stbir__resample_vertical_scatter(stbir__info const * stbir_info, stbir__per_split_info* split_info, int n0, int n1, float const * vertical_coefficients, float const * vertical_buffer, float const * vertical_buffer_end )
6342{
6343 STBIR_ASSERT( !stbir_info->vertical.is_gather );
6345 STBIR_PROFILE_START( vertical );
6346 {
6347 int k = 0, total = n1 - n0 + 1;
6348 STBIR_ASSERT( total > 0 );
6349 do {
6350 float * outputs[8];
6351 int i, n = total; if ( n > 8 ) n = 8;
6352 for( i = 0 ; i < n ; i++ )
6353 {
6354 outputs[ i ] = stbir__get_ring_buffer_scanline(stbir_info, split_info, k+i+n0 );
6355 if ( ( i ) && ( STBIR__FLOAT_BUFFER_IS_EMPTY( outputs[i] ) != STBIR__FLOAT_BUFFER_IS_EMPTY( outputs[0] ) ) ) // make sure runs are of the same type
6356 {
6357 n = i;
6358 break;
6359 }
6360 }
6361 // call the scatter to N scanlines at a time function (up to 8 scanlines of scattering at once)
6362 ((STBIR__FLOAT_BUFFER_IS_EMPTY( outputs[0] ))?stbir__vertical_scatter_sets:stbir__vertical_scatter_blends)[n-1]( outputs, vertical_coefficients + k, vertical_buffer, vertical_buffer_end );
6363 k += n;
6364 total -= n;
6365 } while ( total );
6366 }
6368 STBIR_PROFILE_END( vertical );
6369}
6371typedef void stbir__handle_scanline_for_scatter_func(stbir__info const * stbir_info, stbir__per_split_info* split_info);
6373static void stbir__vertical_scatter_loop( stbir__info const * stbir_info, stbir__per_split_info* split_info, int split_count )
6374{
6375 int y, start_output_y, end_output_y, start_input_y, end_input_y;
6376 stbir__contributors* vertical_contributors = stbir_info->vertical.contributors;
6377 float const * vertical_coefficients = stbir_info->vertical.coefficients;
6378 stbir__handle_scanline_for_scatter_func * handle_scanline_for_scatter;
6379 void * scanline_scatter_buffer;
6380 void * scanline_scatter_buffer_end;
6381 int on_first_input_y, last_input_y;
6383 STBIR_ASSERT( !stbir_info->vertical.is_gather );
6385 start_output_y = split_info->start_output_y;
6386 end_output_y = split_info[split_count-1].end_output_y; // may do multiple split counts
6388 start_input_y = split_info->start_input_y;
6389 end_input_y = split_info[split_count-1].end_input_y;
6391 // adjust for starting offset start_input_y
6392 y = start_input_y + stbir_info->vertical.filter_pixel_margin;
6393 vertical_contributors += y ;
6394 vertical_coefficients += stbir_info->vertical.coefficient_width * y;
6396 if ( stbir_info->vertical_first )
6397 {
6398 handle_scanline_for_scatter = stbir__horizontal_resample_and_encode_first_scanline_from_scatter;
6399 scanline_scatter_buffer = split_info->decode_buffer;
6400 scanline_scatter_buffer_end = ( (char*) scanline_scatter_buffer ) + sizeof( float ) * stbir_info->effective_channels * (stbir_info->scanline_extents.conservative.n1-stbir_info->scanline_extents.conservative.n0+1);
6401 }
6402 else
6403 {
6404 handle_scanline_for_scatter = stbir__encode_first_scanline_from_scatter;
6405 scanline_scatter_buffer = split_info->vertical_buffer;
6406 scanline_scatter_buffer_end = ( (char*) scanline_scatter_buffer ) + sizeof( float ) * stbir_info->effective_channels * stbir_info->horizontal.scale_info.output_sub_size;
6407 }
6409 // initialize the ring buffer for scattering
6410 split_info->ring_buffer_first_scanline = start_output_y;
6411 split_info->ring_buffer_last_scanline = -1;
6412 split_info->ring_buffer_begin_index = -1;
6414 // mark all the buffers as empty to start
6415 for( y = 0 ; y < stbir_info->ring_buffer_num_entries ; y++ )
6416 stbir__get_ring_buffer_entry( stbir_info, split_info, y )[0] = STBIR__FLOAT_EMPTY_MARKER; // only used on scatter
6418 // do the loop in input space
6419 on_first_input_y = 1; last_input_y = start_input_y;
6420 for (y = start_input_y ; y < end_input_y; y++)
6421 {
6422 int out_first_scanline, out_last_scanline;
6424 out_first_scanline = vertical_contributors->n0;
6425 out_last_scanline = vertical_contributors->n1;
6427 STBIR_ASSERT(out_last_scanline - out_first_scanline + 1 <= stbir_info->ring_buffer_num_entries);
6429 if ( ( out_last_scanline >= out_first_scanline ) && ( ( ( out_first_scanline >= start_output_y ) && ( out_first_scanline < end_output_y ) ) || ( ( out_last_scanline >= start_output_y ) && ( out_last_scanline < end_output_y ) ) ) )
6430 {
6431 float const * vc = vertical_coefficients;
6433 // keep track of the range actually seen for the next resize
6434 last_input_y = y;
6435 if ( ( on_first_input_y ) && ( y > start_input_y ) )
6436 split_info->start_input_y = y;
6437 on_first_input_y = 0;
6439 // clip the region
6440 if ( out_first_scanline < start_output_y )
6441 {
6442 vc += start_output_y - out_first_scanline;
6443 out_first_scanline = start_output_y;
6444 }
6446 if ( out_last_scanline >= end_output_y )
6447 out_last_scanline = end_output_y - 1;
6449 // if very first scanline, init the index
6450 if (split_info->ring_buffer_begin_index < 0)
6451 split_info->ring_buffer_begin_index = out_first_scanline - start_output_y;
6453 STBIR_ASSERT( split_info->ring_buffer_begin_index <= out_first_scanline );
6455 // Decode the nth scanline from the source image into the decode buffer.
6456 stbir__decode_scanline( stbir_info, y, split_info->decode_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
6458 // When horizontal first, we resample horizontally into the vertical buffer before we scatter it out
6459 if ( !stbir_info->vertical_first )
6460 stbir__resample_horizontal_gather( stbir_info, split_info->vertical_buffer, split_info->decode_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
6462 // Now it's sitting in the buffer ready to be distributed into the ring buffers.
6464 // evict from the ringbuffer, if we need are full
6465 if ( ( ( split_info->ring_buffer_last_scanline - split_info->ring_buffer_first_scanline + 1 ) == stbir_info->ring_buffer_num_entries ) &&
6466 ( out_last_scanline > split_info->ring_buffer_last_scanline ) )
6467 handle_scanline_for_scatter( stbir_info, split_info );
6469 // Now the horizontal buffer is ready to write to all ring buffer rows, so do it.
6470 stbir__resample_vertical_scatter(stbir_info, split_info, out_first_scanline, out_last_scanline, vc, (float*)scanline_scatter_buffer, (float*)scanline_scatter_buffer_end );
6472 // update the end of the buffer
6473 if ( out_last_scanline > split_info->ring_buffer_last_scanline )
6474 split_info->ring_buffer_last_scanline = out_last_scanline;
6475 }
6476 ++vertical_contributors;
6477 vertical_coefficients += stbir_info->vertical.coefficient_width;
6478 }
6480 // now evict the scanlines that are left over in the ring buffer
6481 while ( split_info->ring_buffer_first_scanline < end_output_y )
6482 handle_scanline_for_scatter(stbir_info, split_info);
6484 // update the end_input_y if we do multiple resizes with the same data
6485 ++last_input_y;
6486 for( y = 0 ; y < split_count; y++ )
6487 if ( split_info[y].end_input_y > last_input_y )
6488 split_info[y].end_input_y = last_input_y;
6489}
6492static stbir__kernel_callback * stbir__builtin_kernels[] = { 0, stbir__filter_trapezoid, stbir__filter_triangle, stbir__filter_cubic, stbir__filter_catmullrom, stbir__filter_mitchell, stbir__filter_point };
6493static stbir__support_callback * stbir__builtin_supports[] = { 0, stbir__support_trapezoid, stbir__support_one, stbir__support_two, stbir__support_two, stbir__support_two, stbir__support_zeropoint5 };
6495static void stbir__set_sampler(stbir__sampler * samp, stbir_filter filter, stbir__kernel_callback * kernel, stbir__support_callback * support, stbir_edge edge, stbir__scale_info * scale_info, int always_gather, void * user_data )
6496{
6497 // set filter
6498 if (filter == 0)
6499 {
6500 filter = STBIR_DEFAULT_FILTER_DOWNSAMPLE; // default to downsample
6501 if (scale_info->scale >= ( 1.0f - stbir__small_float ) )
6502 {
6503 if ( (scale_info->scale <= ( 1.0f + stbir__small_float ) ) && ( STBIR_CEILF(scale_info->pixel_shift) == scale_info->pixel_shift ) )
6504 filter = STBIR_FILTER_POINT_SAMPLE;
6505 else
6506 filter = STBIR_DEFAULT_FILTER_UPSAMPLE;
6507 }
6508 }
6509 samp->filter_enum = filter;
6511 STBIR_ASSERT(samp->filter_enum != 0);
6512 STBIR_ASSERT((unsigned)samp->filter_enum < STBIR_FILTER_OTHER);
6513 samp->filter_kernel = stbir__builtin_kernels[ filter ];
6514 samp->filter_support = stbir__builtin_supports[ filter ];
6516 if ( kernel && support )
6517 {
6518 samp->filter_kernel = kernel;
6519 samp->filter_support = support;
6520 samp->filter_enum = STBIR_FILTER_OTHER;
6521 }
6523 samp->edge = edge;
6524 samp->filter_pixel_width = stbir__get_filter_pixel_width (samp->filter_support, scale_info->scale, user_data );
6525 // Gather is always better, but in extreme downsamples, you have to most or all of the data in memory
6526 // For horizontal, we always have all the pixels, so we always use gather here (always_gather==1).
6527 // For vertical, we use gather if scaling up (which means we will have samp->filter_pixel_width
6528 // scanlines in memory at once).
6529 samp->is_gather = 0;
6530 if ( scale_info->scale >= ( 1.0f - stbir__small_float ) )
6531 samp->is_gather = 1;
6532 else if ( ( always_gather ) || ( samp->filter_pixel_width <= STBIR_FORCE_GATHER_FILTER_SCANLINES_AMOUNT ) )
6533 samp->is_gather = 2;
6535 // pre calculate stuff based on the above
6536 samp->coefficient_width = stbir__get_coefficient_width(samp, samp->is_gather, user_data);
6538 // filter_pixel_width is the conservative size in pixels of input that affect an output pixel.
6539 // In rare cases (only with 2 pix to 1 pix with the default filters), it's possible that the
6540 // filter will extend before or after the scanline beyond just one extra entire copy of the
6541 // scanline (we would hit the edge twice). We don't let you do that, so we clamp the total
6542 // width to 3x the total of input pixel (once for the scanline, once for the left side
6543 // overhang, and once for the right side). We only do this for edge mode, since the other
6544 // modes can just re-edge clamp back in again.
6545 if ( edge == STBIR_EDGE_WRAP )
6546 if ( samp->filter_pixel_width > ( scale_info->input_full_size * 3 ) )
6547 samp->filter_pixel_width = scale_info->input_full_size * 3;
6549 // This is how much to expand buffers to account for filters seeking outside
6550 // the image boundaries.
6551 samp->filter_pixel_margin = samp->filter_pixel_width / 2;
6553 // filter_pixel_margin is the amount that this filter can overhang on just one side of either
6554 // end of the scanline (left or the right). Since we only allow you to overhang 1 scanline's
6555 // worth of pixels, we clamp this one side of overhang to the input scanline size. Again,
6556 // this clamping only happens in rare cases with the default filters (2 pix to 1 pix).
6557 if ( edge == STBIR_EDGE_WRAP )
6558 if ( samp->filter_pixel_margin > scale_info->input_full_size )
6559 samp->filter_pixel_margin = scale_info->input_full_size;
6561 samp->num_contributors = stbir__get_contributors(samp, samp->is_gather);
6563 samp->contributors_size = samp->num_contributors * sizeof(stbir__contributors);
6564 samp->coefficients_size = samp->num_contributors * samp->coefficient_width * sizeof(float) + sizeof(float); // extra sizeof(float) is padding
6566 samp->gather_prescatter_contributors = 0;
6567 samp->gather_prescatter_coefficients = 0;
6568 if ( samp->is_gather == 0 )
6569 {
6570 samp->gather_prescatter_coefficient_width = samp->filter_pixel_width;
6571 samp->gather_prescatter_num_contributors = stbir__get_contributors(samp, 2);
6572 samp->gather_prescatter_contributors_size = samp->gather_prescatter_num_contributors * sizeof(stbir__contributors);
6573 samp->gather_prescatter_coefficients_size = samp->gather_prescatter_num_contributors * samp->gather_prescatter_coefficient_width * sizeof(float);
6574 }
6575}
6577static void stbir__get_conservative_extents( stbir__sampler * samp, stbir__contributors * range, void * user_data )
6578{
6579 float scale = samp->scale_info.scale;
6580 float out_shift = samp->scale_info.pixel_shift;
6581 stbir__support_callback * support = samp->filter_support;
6582 int input_full_size = samp->scale_info.input_full_size;
6583 stbir_edge edge = samp->edge;
6584 float inv_scale = samp->scale_info.inv_scale;
6586 STBIR_ASSERT( samp->is_gather != 0 );
6588 if ( samp->is_gather == 1 )
6589 {
6590 int in_first_pixel, in_last_pixel;
6591 float out_filter_radius = support(inv_scale, user_data) * scale;
6593 stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, 0.5, out_filter_radius, inv_scale, out_shift, input_full_size, edge );
6594 range->n0 = in_first_pixel;
6595 stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, ( (float)(samp->scale_info.output_sub_size-1) ) + 0.5f, out_filter_radius, inv_scale, out_shift, input_full_size, edge );
6596 range->n1 = in_last_pixel;
6597 }
6598 else if ( samp->is_gather == 2 ) // downsample gather, refine
6599 {
6600 float in_pixels_radius = support(scale, user_data) * inv_scale;
6601 int filter_pixel_margin = samp->filter_pixel_margin;
6602 int output_sub_size = samp->scale_info.output_sub_size;
6603 int input_end;
6604 int n;
6605 int in_first_pixel, in_last_pixel;
6607 // get a conservative area of the input range
6608 stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, 0, 0, inv_scale, out_shift, input_full_size, edge );
6609 range->n0 = in_first_pixel;
6610 stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, (float)output_sub_size, 0, inv_scale, out_shift, input_full_size, edge );
6611 range->n1 = in_last_pixel;
6613 // now go through the margin to the start of area to find bottom
6614 n = range->n0 + 1;
6615 input_end = -filter_pixel_margin;
6616 while( n >= input_end )
6617 {
6618 int out_first_pixel, out_last_pixel;
6619 stbir__calculate_out_pixel_range( &out_first_pixel, &out_last_pixel, ((float)n)+0.5f, in_pixels_radius, scale, out_shift, output_sub_size );
6620 if ( out_first_pixel > out_last_pixel )
6621 break;
6623 if ( ( out_first_pixel < output_sub_size ) || ( out_last_pixel >= 0 ) )
6624 range->n0 = n;
6625 --n;
6626 }
6628 // now go through the end of the area through the margin to find top
6629 n = range->n1 - 1;
6630 input_end = n + 1 + filter_pixel_margin;
6631 while( n <= input_end )
6632 {
6633 int out_first_pixel, out_last_pixel;
6634 stbir__calculate_out_pixel_range( &out_first_pixel, &out_last_pixel, ((float)n)+0.5f, in_pixels_radius, scale, out_shift, output_sub_size );
6635 if ( out_first_pixel > out_last_pixel )
6636 break;
6637 if ( ( out_first_pixel < output_sub_size ) || ( out_last_pixel >= 0 ) )
6638 range->n1 = n;
6639 ++n;
6640 }
6641 }
6643 if ( samp->edge == STBIR_EDGE_WRAP )
6644 {
6645 // if we are wrapping, and we are very close to the image size (so the edges might merge), just use the scanline up to the edge
6646 if ( ( range->n0 > 0 ) && ( range->n1 >= input_full_size ) )
6647 {
6648 int marg = range->n1 - input_full_size + 1;
6649 if ( ( marg + STBIR__MERGE_RUNS_PIXEL_THRESHOLD ) >= range->n0 )
6650 range->n0 = 0;
6651 }
6652 if ( ( range->n0 < 0 ) && ( range->n1 < (input_full_size-1) ) )
6653 {
6654 int marg = -range->n0;
6655 if ( ( input_full_size - marg - STBIR__MERGE_RUNS_PIXEL_THRESHOLD - 1 ) <= range->n1 )
6656 range->n1 = input_full_size - 1;
6657 }
6658 }
6659 else
6660 {
6661 // for non-edge-wrap modes, we never read over the edge, so clamp
6662 if ( range->n0 < 0 )
6663 range->n0 = 0;
6664 if ( range->n1 >= input_full_size )
6665 range->n1 = input_full_size - 1;
6666 }
6667}
6669static void stbir__get_split_info( stbir__per_split_info* split_info, int splits, int output_height, int vertical_pixel_margin, int input_full_height )
6670{
6671 int i, cur;
6672 int left = output_height;
6674 cur = 0;
6675 for( i = 0 ; i < splits ; i++ )
6676 {
6677 int each;
6678 split_info[i].start_output_y = cur;
6679 each = left / ( splits - i );
6680 split_info[i].end_output_y = cur + each;
6681 cur += each;
6682 left -= each;
6684 // scatter range (updated to minimum as you run it)
6685 split_info[i].start_input_y = -vertical_pixel_margin;
6686 split_info[i].end_input_y = input_full_height + vertical_pixel_margin;
6687 }
6688}
6690static void stbir__free_internal_mem( stbir__info *info )
6691{
6692 #define STBIR__FREE_AND_CLEAR( ptr ) { if ( ptr ) { void * p = (ptr); (ptr) = 0; STBIR_FREE( p, info->user_data); } }
6694 if ( info )
6695 {
6696 #ifndef STBIR__SEPARATE_ALLOCATIONS
6697 STBIR__FREE_AND_CLEAR( info->alloced_mem );
6698 #else
6699 int i,j;
6701 if ( ( info->vertical.gather_prescatter_contributors ) && ( (void*)info->vertical.gather_prescatter_contributors != (void*)info->split_info[0].decode_buffer ) )
6702 {
6703 STBIR__FREE_AND_CLEAR( info->vertical.gather_prescatter_coefficients );
6704 STBIR__FREE_AND_CLEAR( info->vertical.gather_prescatter_contributors );
6705 }
6706 for( i = 0 ; i < info->splits ; i++ )
6707 {
6708 for( j = 0 ; j < info->alloc_ring_buffer_num_entries ; j++ )
6709 {
6710 #ifdef STBIR_SIMD8
6711 if ( info->effective_channels == 3 )
6712 --info->split_info[i].ring_buffers[j]; // avx in 3 channel mode needs one float at the start of the buffer
6713 #endif
6714 STBIR__FREE_AND_CLEAR( info->split_info[i].ring_buffers[j] );
6715 }
6717 #ifdef STBIR_SIMD8
6718 if ( info->effective_channels == 3 )
6719 --info->split_info[i].decode_buffer; // avx in 3 channel mode needs one float at the start of the buffer
6720 #endif
6721 STBIR__FREE_AND_CLEAR( info->split_info[i].decode_buffer );
6722 STBIR__FREE_AND_CLEAR( info->split_info[i].ring_buffers );
6723 STBIR__FREE_AND_CLEAR( info->split_info[i].vertical_buffer );
6724 }
6725 STBIR__FREE_AND_CLEAR( info->split_info );
6726 if ( info->vertical.coefficients != info->horizontal.coefficients )
6727 {
6728 STBIR__FREE_AND_CLEAR( info->vertical.coefficients );
6729 STBIR__FREE_AND_CLEAR( info->vertical.contributors );
6730 }
6731 STBIR__FREE_AND_CLEAR( info->horizontal.coefficients );
6732 STBIR__FREE_AND_CLEAR( info->horizontal.contributors );
6733 STBIR__FREE_AND_CLEAR( info->alloced_mem );
6734 STBIR_FREE( info, info->user_data );
6735 #endif
6736 }
6738 #undef STBIR__FREE_AND_CLEAR
6739}
6741static int stbir__get_max_split( int splits, int height )
6742{
6743 int i;
6744 int max = 0;
6746 for( i = 0 ; i < splits ; i++ )
6747 {
6748 int each = height / ( splits - i );
6749 if ( each > max )
6750 max = each;
6751 height -= each;
6752 }
6753 return max;
6754}
6756static stbir__horizontal_gather_channels_func ** stbir__horizontal_gather_n_coeffs_funcs[8] =
6757{
6758 0, stbir__horizontal_gather_1_channels_with_n_coeffs_funcs, stbir__horizontal_gather_2_channels_with_n_coeffs_funcs, stbir__horizontal_gather_3_channels_with_n_coeffs_funcs, stbir__horizontal_gather_4_channels_with_n_coeffs_funcs, 0,0, stbir__horizontal_gather_7_channels_with_n_coeffs_funcs
6759};
6761static stbir__horizontal_gather_channels_func ** stbir__horizontal_gather_channels_funcs[8] =
6762{
6763 0, stbir__horizontal_gather_1_channels_funcs, stbir__horizontal_gather_2_channels_funcs, stbir__horizontal_gather_3_channels_funcs, stbir__horizontal_gather_4_channels_funcs, 0,0, stbir__horizontal_gather_7_channels_funcs
6764};
6766// there are six resize classifications: 0 == vertical scatter, 1 == vertical gather < 1x scale, 2 == vertical gather 1x-2x scale, 4 == vertical gather < 3x scale, 4 == vertical gather > 3x scale, 5 == <=4 pixel height, 6 == <=4 pixel wide column
6767#define STBIR_RESIZE_CLASSIFICATIONS 8
6769static float stbir__compute_weights[5][STBIR_RESIZE_CLASSIFICATIONS][4]= // 5 = 0=1chan, 1=2chan, 2=3chan, 3=4chan, 4=7chan
6770{
6771 {
6772 { 1.00000f, 1.00000f, 0.31250f, 1.00000f },
6773 { 0.56250f, 0.59375f, 0.00000f, 0.96875f },
6774 { 1.00000f, 0.06250f, 0.00000f, 1.00000f },
6775 { 0.00000f, 0.09375f, 1.00000f, 1.00000f },
6776 { 1.00000f, 1.00000f, 1.00000f, 1.00000f },
6777 { 0.03125f, 0.12500f, 1.00000f, 1.00000f },
6778 { 0.06250f, 0.12500f, 0.00000f, 1.00000f },
6779 { 0.00000f, 1.00000f, 0.00000f, 0.03125f },
6780 }, {
6781 { 0.00000f, 0.84375f, 0.00000f, 0.03125f },
6782 { 0.09375f, 0.93750f, 0.00000f, 0.78125f },
6783 { 0.87500f, 0.21875f, 0.00000f, 0.96875f },
6784 { 0.09375f, 0.09375f, 1.00000f, 1.00000f },
6785 { 1.00000f, 1.00000f, 1.00000f, 1.00000f },
6786 { 0.03125f, 0.12500f, 1.00000f, 1.00000f },
6787 { 0.06250f, 0.12500f, 0.00000f, 1.00000f },
6788 { 0.00000f, 1.00000f, 0.00000f, 0.53125f },
6789 }, {
6790 { 0.00000f, 0.53125f, 0.00000f, 0.03125f },
6791 { 0.06250f, 0.96875f, 0.00000f, 0.53125f },
6792 { 0.87500f, 0.18750f, 0.00000f, 0.93750f },
6793 { 0.00000f, 0.09375f, 1.00000f, 1.00000f },
6794 { 1.00000f, 1.00000f, 1.00000f, 1.00000f },
6795 { 0.03125f, 0.12500f, 1.00000f, 1.00000f },
6796 { 0.06250f, 0.12500f, 0.00000f, 1.00000f },
6797 { 0.00000f, 1.00000f, 0.00000f, 0.56250f },
6798 }, {
6799 { 0.00000f, 0.50000f, 0.00000f, 0.71875f },
6800 { 0.06250f, 0.84375f, 0.00000f, 0.87500f },
6801 { 1.00000f, 0.50000f, 0.50000f, 0.96875f },
6802 { 1.00000f, 0.09375f, 0.31250f, 0.50000f },
6803 { 1.00000f, 1.00000f, 1.00000f, 1.00000f },
6804 { 1.00000f, 0.03125f, 0.03125f, 0.53125f },
6805 { 0.18750f, 0.12500f, 0.00000f, 1.00000f },
6806 { 0.00000f, 1.00000f, 0.03125f, 0.18750f },
6807 }, {
6808 { 0.00000f, 0.59375f, 0.00000f, 0.96875f },
6809 { 0.06250f, 0.81250f, 0.06250f, 0.59375f },
6810 { 0.75000f, 0.43750f, 0.12500f, 0.96875f },
6811 { 0.87500f, 0.06250f, 0.18750f, 0.43750f },
6812 { 1.00000f, 1.00000f, 1.00000f, 1.00000f },
6813 { 0.15625f, 0.12500f, 1.00000f, 1.00000f },
6814 { 0.06250f, 0.12500f, 0.00000f, 1.00000f },
6815 { 0.00000f, 1.00000f, 0.03125f, 0.34375f },
6816 }
6817};
6819// structure that allow us to query and override info for training the costs
6820typedef struct STBIR__V_FIRST_INFO
6821{
6822 double v_cost, h_cost;
6823 int control_v_first; // 0 = no control, 1 = force hori, 2 = force vert
6824 int v_first;
6825 int v_resize_classification;
6826 int is_gather;
6827} STBIR__V_FIRST_INFO;
6829#ifdef STBIR__V_FIRST_INFO_BUFFER
6830static STBIR__V_FIRST_INFO STBIR__V_FIRST_INFO_BUFFER = {0};
6831#define STBIR__V_FIRST_INFO_POINTER &STBIR__V_FIRST_INFO_BUFFER
6832#else
6833#define STBIR__V_FIRST_INFO_POINTER 0
6834#endif
6836// Figure out whether to scale along the horizontal or vertical first.
6837// This only *super* important when you are scaling by a massively
6838// different amount in the vertical vs the horizontal (for example, if
6839// you are scaling by 2x in the width, and 0.5x in the height, then you
6840// want to do the vertical scale first, because it's around 3x faster
6841// in that order.
6842//
6843// In more normal circumstances, this makes a 20-40% differences, so
6844// it's good to get right, but not critical. The normal way that you
6845// decide which direction goes first is just figuring out which
6846// direction does more multiplies. But with modern CPUs with their
6847// fancy caches and SIMD and high IPC abilities, so there's just a lot
6848// more that goes into it.
6849//
6850// My handwavy sort of solution is to have an app that does a whole
6851// bunch of timing for both vertical and horizontal first modes,
6852// and then another app that can read lots of these timing files
6853// and try to search for the best weights to use. Dotimings.c
6854// is the app that does a bunch of timings, and vf_train.c is the
6855// app that solves for the best weights (and shows how well it
6856// does currently).
6858static int stbir__should_do_vertical_first( float weights_table[STBIR_RESIZE_CLASSIFICATIONS][4], int horizontal_filter_pixel_width, float horizontal_scale, int horizontal_output_size, int vertical_filter_pixel_width, float vertical_scale, int vertical_output_size, int is_gather, STBIR__V_FIRST_INFO * info )
6859{
6860 double v_cost, h_cost;
6861 float * weights;
6862 int vertical_first;
6863 int v_classification;
6865 // categorize the resize into buckets
6866 if ( ( vertical_output_size <= 4 ) || ( horizontal_output_size <= 4 ) )
6867 v_classification = ( vertical_output_size < horizontal_output_size ) ? 6 : 7;
6868 else if ( vertical_scale <= 1.0f )
6869 v_classification = ( is_gather ) ? 1 : 0;
6870 else if ( vertical_scale <= 2.0f)
6871 v_classification = 2;
6872 else if ( vertical_scale <= 3.0f)
6873 v_classification = 3;
6874 else if ( vertical_scale <= 4.0f)
6875 v_classification = 5;
6876 else
6877 v_classification = 6;
6879 // use the right weights
6880 weights = weights_table[ v_classification ];
6882 // this is the costs when you don't take into account modern CPUs with high ipc and simd and caches - wish we had a better estimate
6883 h_cost = (float)horizontal_filter_pixel_width * weights[0] + horizontal_scale * (float)vertical_filter_pixel_width * weights[1];
6884 v_cost = (float)vertical_filter_pixel_width * weights[2] + vertical_scale * (float)horizontal_filter_pixel_width * weights[3];
6886 // use computation estimate to decide vertical first or not
6887 vertical_first = ( v_cost <= h_cost ) ? 1 : 0;
6889 // save these, if requested
6890 if ( info )
6891 {
6892 info->h_cost = h_cost;
6893 info->v_cost = v_cost;
6894 info->v_resize_classification = v_classification;
6895 info->v_first = vertical_first;
6896 info->is_gather = is_gather;
6897 }
6899 // and this allows us to override everything for testing (see dotiming.c)
6900 if ( ( info ) && ( info->control_v_first ) )
6901 vertical_first = ( info->control_v_first == 2 ) ? 1 : 0;
6903 return vertical_first;
6904}
6906// layout lookups - must match stbir_internal_pixel_layout
6907static unsigned char stbir__pixel_channels[] = {
6908 1,2,3,3,4, // 1ch, 2ch, rgb, bgr, 4ch
6909 4,4,4,4,2,2, // RGBA,BGRA,ARGB,ABGR,RA,AR
6910 4,4,4,4,2,2, // RGBA_PM,BGRA_PM,ARGB_PM,ABGR_PM,RA_PM,AR_PM
6911};
6913// the internal pixel layout enums are in a different order, so we can easily do range comparisons of types
6914// the public pixel layout is ordered in a way that if you cast num_channels (1-4) to the enum, you get something sensible
6915static stbir_internal_pixel_layout stbir__pixel_layout_convert_public_to_internal[] = {
6916 STBIRI_BGR, STBIRI_1CHANNEL, STBIRI_2CHANNEL, STBIRI_RGB, STBIRI_RGBA,
6917 STBIRI_4CHANNEL, STBIRI_BGRA, STBIRI_ARGB, STBIRI_ABGR, STBIRI_RA, STBIRI_AR,
6918 STBIRI_RGBA_PM, STBIRI_BGRA_PM, STBIRI_ARGB_PM, STBIRI_ABGR_PM, STBIRI_RA_PM, STBIRI_AR_PM,
6919};
6921static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sampler * horizontal, stbir__sampler * vertical, stbir__contributors * conservative, stbir_pixel_layout input_pixel_layout_public, stbir_pixel_layout output_pixel_layout_public, int splits, int new_x, int new_y, int fast_alpha, void * user_data STBIR_ONLY_PROFILE_BUILD_GET_INFO )
6922{
6923 static char stbir_channel_count_index[8]={ 9,0,1,2, 3,9,9,4 };
6925 stbir__info * info = 0;
6926 void * alloced = 0;
6927 size_t alloced_total = 0;
6928 int vertical_first;
6929 int decode_buffer_size, ring_buffer_length_bytes, ring_buffer_size, vertical_buffer_size, alloc_ring_buffer_num_entries;
6931 int alpha_weighting_type = 0; // 0=none, 1=simple, 2=fancy
6932 int conservative_split_output_size = stbir__get_max_split( splits, vertical->scale_info.output_sub_size );
6933 stbir_internal_pixel_layout input_pixel_layout = stbir__pixel_layout_convert_public_to_internal[ input_pixel_layout_public ];
6934 stbir_internal_pixel_layout output_pixel_layout = stbir__pixel_layout_convert_public_to_internal[ output_pixel_layout_public ];
6935 int channels = stbir__pixel_channels[ input_pixel_layout ];
6936 int effective_channels = channels;
6938 // first figure out what type of alpha weighting to use (if any)
6939 if ( ( horizontal->filter_enum != STBIR_FILTER_POINT_SAMPLE ) || ( vertical->filter_enum != STBIR_FILTER_POINT_SAMPLE ) ) // no alpha weighting on point sampling
6940 {
6941 if ( ( input_pixel_layout >= STBIRI_RGBA ) && ( input_pixel_layout <= STBIRI_AR ) && ( output_pixel_layout >= STBIRI_RGBA ) && ( output_pixel_layout <= STBIRI_AR ) )
6942 {
6943 if ( fast_alpha )
6944 {
6945 alpha_weighting_type = 4;
6946 }
6947 else
6948 {
6949 static int fancy_alpha_effective_cnts[6] = { 7, 7, 7, 7, 3, 3 };
6950 alpha_weighting_type = 2;
6951 effective_channels = fancy_alpha_effective_cnts[ input_pixel_layout - STBIRI_RGBA ];
6952 }
6953 }
6954 else if ( ( input_pixel_layout >= STBIRI_RGBA_PM ) && ( input_pixel_layout <= STBIRI_AR_PM ) && ( output_pixel_layout >= STBIRI_RGBA ) && ( output_pixel_layout <= STBIRI_AR ) )
6955 {
6956 // input premult, output non-premult
6957 alpha_weighting_type = 3;
6958 }
6959 else if ( ( input_pixel_layout >= STBIRI_RGBA ) && ( input_pixel_layout <= STBIRI_AR ) && ( output_pixel_layout >= STBIRI_RGBA_PM ) && ( output_pixel_layout <= STBIRI_AR_PM ) )
6960 {
6961 // input non-premult, output premult
6962 alpha_weighting_type = 1;
6963 }
6964 }
6966 // channel in and out count must match currently
6967 if ( channels != stbir__pixel_channels[ output_pixel_layout ] )
6968 return 0;
6970 // get vertical first
6971 vertical_first = stbir__should_do_vertical_first( stbir__compute_weights[ (int)stbir_channel_count_index[ effective_channels ] ], horizontal->filter_pixel_width, horizontal->scale_info.scale, horizontal->scale_info.output_sub_size, vertical->filter_pixel_width, vertical->scale_info.scale, vertical->scale_info.output_sub_size, vertical->is_gather, STBIR__V_FIRST_INFO_POINTER );
6973 // sometimes read one float off in some of the unrolled loops (with a weight of zero coeff, so it doesn't have an effect)
6974 decode_buffer_size = ( conservative->n1 - conservative->n0 + 1 ) * effective_channels * sizeof(float) + sizeof(float); // extra float for padding
6976#if defined( STBIR__SEPARATE_ALLOCATIONS ) && defined(STBIR_SIMD8)
6977 if ( effective_channels == 3 )
6978 decode_buffer_size += sizeof(float); // avx in 3 channel mode needs one float at the start of the buffer (only with separate allocations)
6979#endif
6981 ring_buffer_length_bytes = horizontal->scale_info.output_sub_size * effective_channels * sizeof(float) + sizeof(float); // extra float for padding
6983 // if we do vertical first, the ring buffer holds a whole decoded line
6984 if ( vertical_first )
6985 ring_buffer_length_bytes = ( decode_buffer_size + 15 ) & ~15;
6987 if ( ( ring_buffer_length_bytes & 4095 ) == 0 ) ring_buffer_length_bytes += 64*3; // avoid 4k alias
6989 // One extra entry because floating point precision problems sometimes cause an extra to be necessary.
6990 alloc_ring_buffer_num_entries = vertical->filter_pixel_width + 1;
6992 // we never need more ring buffer entries than the scanlines we're outputting when in scatter mode
6993 if ( ( !vertical->is_gather ) && ( alloc_ring_buffer_num_entries > conservative_split_output_size ) )
6994 alloc_ring_buffer_num_entries = conservative_split_output_size;
6996 ring_buffer_size = alloc_ring_buffer_num_entries * ring_buffer_length_bytes;
6998 // The vertical buffer is used differently, depending on whether we are scattering
6999 // the vertical scanlines, or gathering them.
7000 // If scattering, it's used at the temp buffer to accumulate each output.
7001 // If gathering, it's just the output buffer.
7002 vertical_buffer_size = horizontal->scale_info.output_sub_size * effective_channels * sizeof(float) + sizeof(float); // extra float for padding
7004 // we make two passes through this loop, 1st to add everything up, 2nd to allocate and init
7005 for(;;)
7006 {
7007 int i;
7008 void * advance_mem = alloced;
7009 int copy_horizontal = 0;
7010 stbir__sampler * possibly_use_horizontal_for_pivot = 0;
7012#ifdef STBIR__SEPARATE_ALLOCATIONS
7013 #define STBIR__NEXT_PTR( ptr, size, ntype ) if ( alloced ) { void * p = STBIR_MALLOC( size, user_data); if ( p == 0 ) { stbir__free_internal_mem( info ); return 0; } (ptr) = (ntype*)p; }
7014#else
7015 #define STBIR__NEXT_PTR( ptr, size, ntype ) advance_mem = (void*) ( ( ((size_t)advance_mem) + 15 ) & ~15 ); if ( alloced ) ptr = (ntype*)advance_mem; advance_mem = ((char*)advance_mem) + (size);
7016#endif
7018 STBIR__NEXT_PTR( info, sizeof( stbir__info ), stbir__info );
7020 STBIR__NEXT_PTR( info->split_info, sizeof( stbir__per_split_info ) * splits, stbir__per_split_info );
7022 if ( info )
7023 {
7024 static stbir__alpha_weight_func * fancy_alpha_weights[6] = { stbir__fancy_alpha_weight_4ch, stbir__fancy_alpha_weight_4ch, stbir__fancy_alpha_weight_4ch, stbir__fancy_alpha_weight_4ch, stbir__fancy_alpha_weight_2ch, stbir__fancy_alpha_weight_2ch };
7025 static stbir__alpha_unweight_func * fancy_alpha_unweights[6] = { stbir__fancy_alpha_unweight_4ch, stbir__fancy_alpha_unweight_4ch, stbir__fancy_alpha_unweight_4ch, stbir__fancy_alpha_unweight_4ch, stbir__fancy_alpha_unweight_2ch, stbir__fancy_alpha_unweight_2ch };
7026 static stbir__alpha_weight_func * simple_alpha_weights[6] = { stbir__simple_alpha_weight_4ch, stbir__simple_alpha_weight_4ch, stbir__simple_alpha_weight_4ch, stbir__simple_alpha_weight_4ch, stbir__simple_alpha_weight_2ch, stbir__simple_alpha_weight_2ch };
7027 static stbir__alpha_unweight_func * simple_alpha_unweights[6] = { stbir__simple_alpha_unweight_4ch, stbir__simple_alpha_unweight_4ch, stbir__simple_alpha_unweight_4ch, stbir__simple_alpha_unweight_4ch, stbir__simple_alpha_unweight_2ch, stbir__simple_alpha_unweight_2ch };
7029 // initialize info fields
7030 info->alloced_mem = alloced;
7031 info->alloced_total = alloced_total;
7033 info->channels = channels;
7034 info->effective_channels = effective_channels;
7036 info->offset_x = new_x;
7037 info->offset_y = new_y;
7038 info->alloc_ring_buffer_num_entries = alloc_ring_buffer_num_entries;
7039 info->ring_buffer_num_entries = 0;
7040 info->ring_buffer_length_bytes = ring_buffer_length_bytes;
7041 info->splits = splits;
7042 info->vertical_first = vertical_first;
7044 info->input_pixel_layout_internal = input_pixel_layout;
7045 info->output_pixel_layout_internal = output_pixel_layout;
7047 // setup alpha weight functions
7048 info->alpha_weight = 0;
7049 info->alpha_unweight = 0;
7051 // handle alpha weighting functions and overrides
7052 if ( alpha_weighting_type == 2 )
7053 {
7054 // high quality alpha multiplying on the way in, dividing on the way out
7055 info->alpha_weight = fancy_alpha_weights[ input_pixel_layout - STBIRI_RGBA ];
7056 info->alpha_unweight = fancy_alpha_unweights[ output_pixel_layout - STBIRI_RGBA ];
7057 }
7058 else if ( alpha_weighting_type == 4 )
7059 {
7060 // fast alpha multiplying on the way in, dividing on the way out
7061 info->alpha_weight = simple_alpha_weights[ input_pixel_layout - STBIRI_RGBA ];
7062 info->alpha_unweight = simple_alpha_unweights[ output_pixel_layout - STBIRI_RGBA ];
7063 }
7064 else if ( alpha_weighting_type == 1 )
7065 {
7066 // fast alpha on the way in, leave in premultiplied form on way out
7067 info->alpha_weight = simple_alpha_weights[ input_pixel_layout - STBIRI_RGBA ];
7068 }
7069 else if ( alpha_weighting_type == 3 )
7070 {
7071 // incoming is premultiplied, fast alpha dividing on the way out - non-premultiplied output
7072 info->alpha_unweight = simple_alpha_unweights[ output_pixel_layout - STBIRI_RGBA ];
7073 }
7075 // handle 3-chan color flipping, using the alpha weight path
7076 if ( ( ( input_pixel_layout == STBIRI_RGB ) && ( output_pixel_layout == STBIRI_BGR ) ) ||
7077 ( ( input_pixel_layout == STBIRI_BGR ) && ( output_pixel_layout == STBIRI_RGB ) ) )
7078 {
7079 // do the flipping on the smaller of the two ends
7080 if ( horizontal->scale_info.scale < 1.0f )
7081 info->alpha_unweight = stbir__simple_flip_3ch;
7082 else
7083 info->alpha_weight = stbir__simple_flip_3ch;
7084 }
7086 }
7088 // get all the per-split buffers
7089 for( i = 0 ; i < splits ; i++ )
7090 {
7091 STBIR__NEXT_PTR( info->split_info[i].decode_buffer, decode_buffer_size, float );
7093#ifdef STBIR__SEPARATE_ALLOCATIONS
7095 #ifdef STBIR_SIMD8
7096 if ( ( info ) && ( effective_channels == 3 ) )
7097 ++info->split_info[i].decode_buffer; // avx in 3 channel mode needs one float at the start of the buffer
7098 #endif
7100 STBIR__NEXT_PTR( info->split_info[i].ring_buffers, alloc_ring_buffer_num_entries * sizeof(float*), float* );
7101 {
7102 int j;
7103 for( j = 0 ; j < alloc_ring_buffer_num_entries ; j++ )
7104 {
7105 STBIR__NEXT_PTR( info->split_info[i].ring_buffers[j], ring_buffer_length_bytes, float );
7106 #ifdef STBIR_SIMD8
7107 if ( ( info ) && ( effective_channels == 3 ) )
7108 ++info->split_info[i].ring_buffers[j]; // avx in 3 channel mode needs one float at the start of the buffer
7109 #endif
7110 }
7111 }
7112#else
7113 STBIR__NEXT_PTR( info->split_info[i].ring_buffer, ring_buffer_size, float );
7114#endif
7115 STBIR__NEXT_PTR( info->split_info[i].vertical_buffer, vertical_buffer_size, float );
7116 }
7118 // alloc memory for to-be-pivoted coeffs (if necessary)
7119 if ( vertical->is_gather == 0 )
7120 {
7121 int both;
7122 int temp_mem_amt;
7124 // when in vertical scatter mode, we first build the coefficients in gather mode, and then pivot after,
7125 // that means we need two buffers, so we try to use the decode buffer and ring buffer for this. if that
7126 // is too small, we just allocate extra memory to use as this temp.
7128 both = vertical->gather_prescatter_contributors_size + vertical->gather_prescatter_coefficients_size;
7130#ifdef STBIR__SEPARATE_ALLOCATIONS
7131 temp_mem_amt = decode_buffer_size;
7133 #ifdef STBIR_SIMD8
7134 if ( effective_channels == 3 )
7135 --temp_mem_amt; // avx in 3 channel mode needs one float at the start of the buffer
7136 #endif
7137#else
7138 temp_mem_amt = ( decode_buffer_size + ring_buffer_size + vertical_buffer_size ) * splits;
7139#endif
7140 if ( temp_mem_amt >= both )
7141 {
7142 if ( info )
7143 {
7144 vertical->gather_prescatter_contributors = (stbir__contributors*)info->split_info[0].decode_buffer;
7145 vertical->gather_prescatter_coefficients = (float*) ( ( (char*)info->split_info[0].decode_buffer ) + vertical->gather_prescatter_contributors_size );
7146 }
7147 }
7148 else
7149 {
7150 // ring+decode memory is too small, so allocate temp memory
7151 STBIR__NEXT_PTR( vertical->gather_prescatter_contributors, vertical->gather_prescatter_contributors_size, stbir__contributors );
7152 STBIR__NEXT_PTR( vertical->gather_prescatter_coefficients, vertical->gather_prescatter_coefficients_size, float );
7153 }
7154 }
7156 STBIR__NEXT_PTR( horizontal->contributors, horizontal->contributors_size, stbir__contributors );
7157 STBIR__NEXT_PTR( horizontal->coefficients, horizontal->coefficients_size, float );
7159 // are the two filters identical?? (happens a lot with mipmap generation)
7160 if ( ( horizontal->filter_kernel == vertical->filter_kernel ) && ( horizontal->filter_support == vertical->filter_support ) && ( horizontal->edge == vertical->edge ) && ( horizontal->scale_info.output_sub_size == vertical->scale_info.output_sub_size ) )
7161 {
7162 float diff_scale = horizontal->scale_info.scale - vertical->scale_info.scale;
7163 float diff_shift = horizontal->scale_info.pixel_shift - vertical->scale_info.pixel_shift;
7164 if ( diff_scale < 0.0f ) diff_scale = -diff_scale;
7165 if ( diff_shift < 0.0f ) diff_shift = -diff_shift;
7166 if ( ( diff_scale <= stbir__small_float ) && ( diff_shift <= stbir__small_float ) )
7167 {
7168 if ( horizontal->is_gather == vertical->is_gather )
7169 {
7170 copy_horizontal = 1;
7171 goto no_vert_alloc;
7172 }
7173 // everything matches, but vertical is scatter, horizontal is gather, use horizontal coeffs for vertical pivot coeffs
7174 possibly_use_horizontal_for_pivot = horizontal;
7175 }
7176 }
7178 STBIR__NEXT_PTR( vertical->contributors, vertical->contributors_size, stbir__contributors );
7179 STBIR__NEXT_PTR( vertical->coefficients, vertical->coefficients_size, float );
7181 no_vert_alloc:
7183 if ( info )
7184 {
7185 STBIR_PROFILE_BUILD_START( horizontal );
7187 stbir__calculate_filters( horizontal, 0, user_data STBIR_ONLY_PROFILE_BUILD_SET_INFO );
7189 // setup the horizontal gather functions
7190 // start with defaulting to the n_coeffs functions (specialized on channels and remnant leftover)
7191 info->horizontal_gather_channels = stbir__horizontal_gather_n_coeffs_funcs[ effective_channels ][ horizontal->extent_info.widest & 3 ];
7192 // but if the number of coeffs <= 12, use another set of special cases. <=12 coeffs is any enlarging resize, or shrinking resize down to about 1/3 size
7193 if ( horizontal->extent_info.widest <= 12 )
7194 info->horizontal_gather_channels = stbir__horizontal_gather_channels_funcs[ effective_channels ][ horizontal->extent_info.widest - 1 ];
7196 info->scanline_extents.conservative.n0 = conservative->n0;
7197 info->scanline_extents.conservative.n1 = conservative->n1;
7199 // get exact extents
7200 stbir__get_extents( horizontal, &info->scanline_extents );
7202 // pack the horizontal coeffs
7203 horizontal->coefficient_width = stbir__pack_coefficients(horizontal->num_contributors, horizontal->contributors, horizontal->coefficients, horizontal->coefficient_width, horizontal->extent_info.widest, info->scanline_extents.conservative.n0, info->scanline_extents.conservative.n1 );
7205 STBIR_MEMCPY( &info->horizontal, horizontal, sizeof( stbir__sampler ) );
7207 STBIR_PROFILE_BUILD_END( horizontal );
7209 if ( copy_horizontal )
7210 {
7211 STBIR_MEMCPY( &info->vertical, horizontal, sizeof( stbir__sampler ) );
7212 }
7213 else
7214 {
7215 STBIR_PROFILE_BUILD_START( vertical );
7217 stbir__calculate_filters( vertical, possibly_use_horizontal_for_pivot, user_data STBIR_ONLY_PROFILE_BUILD_SET_INFO );
7218 STBIR_MEMCPY( &info->vertical, vertical, sizeof( stbir__sampler ) );
7220 STBIR_PROFILE_BUILD_END( vertical );
7221 }
7223 // setup the vertical split ranges
7224 stbir__get_split_info( info->split_info, info->splits, info->vertical.scale_info.output_sub_size, info->vertical.filter_pixel_margin, info->vertical.scale_info.input_full_size );
7226 // now we know precisely how many entries we need
7227 info->ring_buffer_num_entries = info->vertical.extent_info.widest;
7229 // we never need more ring buffer entries than the scanlines we're outputting
7230 if ( ( !info->vertical.is_gather ) && ( info->ring_buffer_num_entries > conservative_split_output_size ) )
7231 info->ring_buffer_num_entries = conservative_split_output_size;
7232 STBIR_ASSERT( info->ring_buffer_num_entries <= info->alloc_ring_buffer_num_entries );
7234 // a few of the horizontal gather functions read past the end of the decode (but mask it out),
7235 // so put in normal values so no snans or denormals accidentally sneak in (also, in the ring
7236 // buffer for vertical first)
7237 for( i = 0 ; i < splits ; i++ )
7238 {
7239 int t, ofs, start;
7241 ofs = decode_buffer_size / 4;
7243 #if defined( STBIR__SEPARATE_ALLOCATIONS ) && defined(STBIR_SIMD8)
7244 if ( effective_channels == 3 )
7245 --ofs; // avx in 3 channel mode needs one float at the start of the buffer, so we snap back for clearing
7246 #endif
7248 start = ofs - 4;
7249 if ( start < 0 ) start = 0;
7251 for( t = start ; t < ofs; t++ )
7252 info->split_info[i].decode_buffer[ t ] = 9999.0f;
7254 if ( vertical_first )
7255 {
7256 int j;
7257 for( j = 0; j < info->ring_buffer_num_entries ; j++ )
7258 {
7259 for( t = start ; t < ofs; t++ )
7260 stbir__get_ring_buffer_entry( info, info->split_info + i, j )[ t ] = 9999.0f;
7261 }
7262 }
7263 }
7264 }
7266 #undef STBIR__NEXT_PTR
7269 // is this the first time through loop?
7270 if ( info == 0 )
7271 {
7272 alloced_total = ( 15 + (size_t)advance_mem );
7273 alloced = STBIR_MALLOC( alloced_total, user_data );
7274 if ( alloced == 0 )
7275 return 0;
7276 }
7277 else
7278 return info; // success
7279 }
7280}
7282static int stbir__perform_resize( stbir__info const * info, int split_start, int split_count )
7283{
7284 stbir__per_split_info * split_info = info->split_info + split_start;
7286 STBIR_PROFILE_CLEAR_EXTRAS();
7288 STBIR_PROFILE_FIRST_START( looping );
7289 if (info->vertical.is_gather)
7290 stbir__vertical_gather_loop( info, split_info, split_count );
7291 else
7292 stbir__vertical_scatter_loop( info, split_info, split_count );
7293 STBIR_PROFILE_END( looping );
7295 return 1;
7296}
7298static void stbir__update_info_from_resize( stbir__info * info, STBIR_RESIZE * resize )
7299{
7300 static stbir__decode_pixels_func * decode_simple[STBIR_TYPE_HALF_FLOAT-STBIR_TYPE_UINT8_SRGB+1]=
7301 {
7302 /* 1ch-4ch */ stbir__decode_uint8_srgb, stbir__decode_uint8_srgb, 0, stbir__decode_float_linear, stbir__decode_half_float_linear,
7303 };
7305 static stbir__decode_pixels_func * decode_alphas[STBIRI_AR-STBIRI_RGBA+1][STBIR_TYPE_HALF_FLOAT-STBIR_TYPE_UINT8_SRGB+1]=
7306 {
7307 { /* RGBA */ stbir__decode_uint8_srgb4_linearalpha, stbir__decode_uint8_srgb, 0, stbir__decode_float_linear, stbir__decode_half_float_linear },
7308 { /* BGRA */ stbir__decode_uint8_srgb4_linearalpha_BGRA, stbir__decode_uint8_srgb_BGRA, 0, stbir__decode_float_linear_BGRA, stbir__decode_half_float_linear_BGRA },
7309 { /* ARGB */ stbir__decode_uint8_srgb4_linearalpha_ARGB, stbir__decode_uint8_srgb_ARGB, 0, stbir__decode_float_linear_ARGB, stbir__decode_half_float_linear_ARGB },
7310 { /* ABGR */ stbir__decode_uint8_srgb4_linearalpha_ABGR, stbir__decode_uint8_srgb_ABGR, 0, stbir__decode_float_linear_ABGR, stbir__decode_half_float_linear_ABGR },
7311 { /* RA */ stbir__decode_uint8_srgb2_linearalpha, stbir__decode_uint8_srgb, 0, stbir__decode_float_linear, stbir__decode_half_float_linear },
7312 { /* AR */ stbir__decode_uint8_srgb2_linearalpha_AR, stbir__decode_uint8_srgb_AR, 0, stbir__decode_float_linear_AR, stbir__decode_half_float_linear_AR },
7313 };
7315 static stbir__decode_pixels_func * decode_simple_scaled_or_not[2][2]=
7316 {
7317 { stbir__decode_uint8_linear_scaled, stbir__decode_uint8_linear }, { stbir__decode_uint16_linear_scaled, stbir__decode_uint16_linear },
7318 };
7320 static stbir__decode_pixels_func * decode_alphas_scaled_or_not[STBIRI_AR-STBIRI_RGBA+1][2][2]=
7321 {
7322 { /* RGBA */ { stbir__decode_uint8_linear_scaled, stbir__decode_uint8_linear }, { stbir__decode_uint16_linear_scaled, stbir__decode_uint16_linear } },
7323 { /* BGRA */ { stbir__decode_uint8_linear_scaled_BGRA, stbir__decode_uint8_linear_BGRA }, { stbir__decode_uint16_linear_scaled_BGRA, stbir__decode_uint16_linear_BGRA } },
7324 { /* ARGB */ { stbir__decode_uint8_linear_scaled_ARGB, stbir__decode_uint8_linear_ARGB }, { stbir__decode_uint16_linear_scaled_ARGB, stbir__decode_uint16_linear_ARGB } },
7325 { /* ABGR */ { stbir__decode_uint8_linear_scaled_ABGR, stbir__decode_uint8_linear_ABGR }, { stbir__decode_uint16_linear_scaled_ABGR, stbir__decode_uint16_linear_ABGR } },
7326 { /* RA */ { stbir__decode_uint8_linear_scaled, stbir__decode_uint8_linear }, { stbir__decode_uint16_linear_scaled, stbir__decode_uint16_linear } },
7327 { /* AR */ { stbir__decode_uint8_linear_scaled_AR, stbir__decode_uint8_linear_AR }, { stbir__decode_uint16_linear_scaled_AR, stbir__decode_uint16_linear_AR } }
7328 };
7330 static stbir__encode_pixels_func * encode_simple[STBIR_TYPE_HALF_FLOAT-STBIR_TYPE_UINT8_SRGB+1]=
7331 {
7332 /* 1ch-4ch */ stbir__encode_uint8_srgb, stbir__encode_uint8_srgb, 0, stbir__encode_float_linear, stbir__encode_half_float_linear,
7333 };
7335 static stbir__encode_pixels_func * encode_alphas[STBIRI_AR-STBIRI_RGBA+1][STBIR_TYPE_HALF_FLOAT-STBIR_TYPE_UINT8_SRGB+1]=
7336 {
7337 { /* RGBA */ stbir__encode_uint8_srgb4_linearalpha, stbir__encode_uint8_srgb, 0, stbir__encode_float_linear, stbir__encode_half_float_linear },
7338 { /* BGRA */ stbir__encode_uint8_srgb4_linearalpha_BGRA, stbir__encode_uint8_srgb_BGRA, 0, stbir__encode_float_linear_BGRA, stbir__encode_half_float_linear_BGRA },
7339 { /* ARGB */ stbir__encode_uint8_srgb4_linearalpha_ARGB, stbir__encode_uint8_srgb_ARGB, 0, stbir__encode_float_linear_ARGB, stbir__encode_half_float_linear_ARGB },
7340 { /* ABGR */ stbir__encode_uint8_srgb4_linearalpha_ABGR, stbir__encode_uint8_srgb_ABGR, 0, stbir__encode_float_linear_ABGR, stbir__encode_half_float_linear_ABGR },
7341 { /* RA */ stbir__encode_uint8_srgb2_linearalpha, stbir__encode_uint8_srgb, 0, stbir__encode_float_linear, stbir__encode_half_float_linear },
7342 { /* AR */ stbir__encode_uint8_srgb2_linearalpha_AR, stbir__encode_uint8_srgb_AR, 0, stbir__encode_float_linear_AR, stbir__encode_half_float_linear_AR }
7343 };
7345 static stbir__encode_pixels_func * encode_simple_scaled_or_not[2][2]=
7346 {
7347 { stbir__encode_uint8_linear_scaled, stbir__encode_uint8_linear }, { stbir__encode_uint16_linear_scaled, stbir__encode_uint16_linear },
7348 };
7350 static stbir__encode_pixels_func * encode_alphas_scaled_or_not[STBIRI_AR-STBIRI_RGBA+1][2][2]=
7351 {
7352 { /* RGBA */ { stbir__encode_uint8_linear_scaled, stbir__encode_uint8_linear }, { stbir__encode_uint16_linear_scaled, stbir__encode_uint16_linear } },
7353 { /* BGRA */ { stbir__encode_uint8_linear_scaled_BGRA, stbir__encode_uint8_linear_BGRA }, { stbir__encode_uint16_linear_scaled_BGRA, stbir__encode_uint16_linear_BGRA } },
7354 { /* ARGB */ { stbir__encode_uint8_linear_scaled_ARGB, stbir__encode_uint8_linear_ARGB }, { stbir__encode_uint16_linear_scaled_ARGB, stbir__encode_uint16_linear_ARGB } },
7355 { /* ABGR */ { stbir__encode_uint8_linear_scaled_ABGR, stbir__encode_uint8_linear_ABGR }, { stbir__encode_uint16_linear_scaled_ABGR, stbir__encode_uint16_linear_ABGR } },
7356 { /* RA */ { stbir__encode_uint8_linear_scaled, stbir__encode_uint8_linear }, { stbir__encode_uint16_linear_scaled, stbir__encode_uint16_linear } },
7357 { /* AR */ { stbir__encode_uint8_linear_scaled_AR, stbir__encode_uint8_linear_AR }, { stbir__encode_uint16_linear_scaled_AR, stbir__encode_uint16_linear_AR } }
7358 };
7360 stbir__decode_pixels_func * decode_pixels = 0;
7361 stbir__encode_pixels_func * encode_pixels = 0;
7362 stbir_datatype input_type, output_type;
7364 input_type = resize->input_data_type;
7365 output_type = resize->output_data_type;
7366 info->input_data = resize->input_pixels;
7367 info->input_stride_bytes = resize->input_stride_in_bytes;
7368 info->output_stride_bytes = resize->output_stride_in_bytes;
7370 // if we're completely point sampling, then we can turn off SRGB
7371 if ( ( info->horizontal.filter_enum == STBIR_FILTER_POINT_SAMPLE ) && ( info->vertical.filter_enum == STBIR_FILTER_POINT_SAMPLE ) )
7372 {
7373 if ( ( ( input_type == STBIR_TYPE_UINT8_SRGB ) || ( input_type == STBIR_TYPE_UINT8_SRGB_ALPHA ) ) &&
7374 ( ( output_type == STBIR_TYPE_UINT8_SRGB ) || ( output_type == STBIR_TYPE_UINT8_SRGB_ALPHA ) ) )
7375 {
7376 input_type = STBIR_TYPE_UINT8;
7377 output_type = STBIR_TYPE_UINT8;
7378 }
7379 }
7381 // recalc the output and input strides
7382 if ( info->input_stride_bytes == 0 )
7383 info->input_stride_bytes = info->channels * info->horizontal.scale_info.input_full_size * stbir__type_size[input_type];
7385 if ( info->output_stride_bytes == 0 )
7386 info->output_stride_bytes = info->channels * info->horizontal.scale_info.output_sub_size * stbir__type_size[output_type];
7388 // calc offset
7389 info->output_data = ( (char*) resize->output_pixels ) + ( (size_t) info->offset_y * (size_t) resize->output_stride_in_bytes ) + ( info->offset_x * info->channels * stbir__type_size[output_type] );
7391 info->in_pixels_cb = resize->input_cb;
7392 info->user_data = resize->user_data;
7393 info->out_pixels_cb = resize->output_cb;
7395 // setup the input format converters
7396 if ( ( input_type == STBIR_TYPE_UINT8 ) || ( input_type == STBIR_TYPE_UINT16 ) )
7397 {
7398 int non_scaled = 0;
7400 // check if we can run unscaled - 0-255.0/0-65535.0 instead of 0-1.0 (which is a tiny bit faster when doing linear 8->8 or 16->16)
7401 if ( ( !info->alpha_weight ) && ( !info->alpha_unweight ) ) // don't short circuit when alpha weighting (get everything to 0-1.0 as usual)
7402 if ( ( ( input_type == STBIR_TYPE_UINT8 ) && ( output_type == STBIR_TYPE_UINT8 ) ) || ( ( input_type == STBIR_TYPE_UINT16 ) && ( output_type == STBIR_TYPE_UINT16 ) ) )
7403 non_scaled = 1;
7405 if ( info->input_pixel_layout_internal <= STBIRI_4CHANNEL )
7406 decode_pixels = decode_simple_scaled_or_not[ input_type == STBIR_TYPE_UINT16 ][ non_scaled ];
7407 else
7408 decode_pixels = decode_alphas_scaled_or_not[ ( info->input_pixel_layout_internal - STBIRI_RGBA ) % ( STBIRI_AR-STBIRI_RGBA+1 ) ][ input_type == STBIR_TYPE_UINT16 ][ non_scaled ];
7409 }
7410 else
7411 {
7412 if ( info->input_pixel_layout_internal <= STBIRI_4CHANNEL )
7413 decode_pixels = decode_simple[ input_type - STBIR_TYPE_UINT8_SRGB ];
7414 else
7415 decode_pixels = decode_alphas[ ( info->input_pixel_layout_internal - STBIRI_RGBA ) % ( STBIRI_AR-STBIRI_RGBA+1 ) ][ input_type - STBIR_TYPE_UINT8_SRGB ];
7416 }
7418 // setup the output format converters
7419 if ( ( output_type == STBIR_TYPE_UINT8 ) || ( output_type == STBIR_TYPE_UINT16 ) )
7420 {
7421 int non_scaled = 0;
7423 // check if we can run unscaled - 0-255.0/0-65535.0 instead of 0-1.0 (which is a tiny bit faster when doing linear 8->8 or 16->16)
7424 if ( ( !info->alpha_weight ) && ( !info->alpha_unweight ) ) // don't short circuit when alpha weighting (get everything to 0-1.0 as usual)
7425 if ( ( ( input_type == STBIR_TYPE_UINT8 ) && ( output_type == STBIR_TYPE_UINT8 ) ) || ( ( input_type == STBIR_TYPE_UINT16 ) && ( output_type == STBIR_TYPE_UINT16 ) ) )
7426 non_scaled = 1;
7428 if ( info->output_pixel_layout_internal <= STBIRI_4CHANNEL )
7429 encode_pixels = encode_simple_scaled_or_not[ output_type == STBIR_TYPE_UINT16 ][ non_scaled ];
7430 else
7431 encode_pixels = encode_alphas_scaled_or_not[ ( info->output_pixel_layout_internal - STBIRI_RGBA ) % ( STBIRI_AR-STBIRI_RGBA+1 ) ][ output_type == STBIR_TYPE_UINT16 ][ non_scaled ];
7432 }
7433 else
7434 {
7435 if ( info->output_pixel_layout_internal <= STBIRI_4CHANNEL )
7436 encode_pixels = encode_simple[ output_type - STBIR_TYPE_UINT8_SRGB ];
7437 else
7438 encode_pixels = encode_alphas[ ( info->output_pixel_layout_internal - STBIRI_RGBA ) % ( STBIRI_AR-STBIRI_RGBA+1 ) ][ output_type - STBIR_TYPE_UINT8_SRGB ];
7439 }
7441 info->input_type = input_type;
7442 info->output_type = output_type;
7443 info->decode_pixels = decode_pixels;
7444 info->encode_pixels = encode_pixels;
7445}
7447static void stbir__clip( int * outx, int * outsubw, int outw, double * u0, double * u1 )
7448{
7449 double per, adj;
7450 int over;
7452 // do left/top edge
7453 if ( *outx < 0 )
7454 {
7455 per = ( (double)*outx ) / ( (double)*outsubw ); // is negative
7456 adj = per * ( *u1 - *u0 );
7457 *u0 -= adj; // increases u0
7458 *outx = 0;
7459 }
7461 // do right/bot edge
7462 over = outw - ( *outx + *outsubw );
7463 if ( over < 0 )
7464 {
7465 per = ( (double)over ) / ( (double)*outsubw ); // is negative
7466 adj = per * ( *u1 - *u0 );
7467 *u1 += adj; // decrease u1
7468 *outsubw = outw - *outx;
7469 }
7470}
7472// converts a double to a rational that has less than one float bit of error (returns 0 if unable to do so)
7473static int stbir__double_to_rational(double f, stbir_uint32 limit, stbir_uint32 *numer, stbir_uint32 *denom, int limit_denom ) // limit_denom (1) or limit numer (0)
7474{
7475 double err;
7476 stbir_uint64 top, bot;
7477 stbir_uint64 numer_last = 0;
7478 stbir_uint64 denom_last = 1;
7479 stbir_uint64 numer_estimate = 1;
7480 stbir_uint64 denom_estimate = 0;
7482 // scale to past float error range
7483 top = (stbir_uint64)( f * (double)(1 << 25) );
7484 bot = 1 << 25;
7486 // keep refining, but usually stops in a few loops - usually 5 for bad cases
7487 for(;;)
7488 {
7489 stbir_uint64 est, temp;
7491 // hit limit, break out and do best full range estimate
7492 if ( ( ( limit_denom ) ? denom_estimate : numer_estimate ) >= limit )
7493 break;
7495 // is the current error less than 1 bit of a float? if so, we're done
7496 if ( denom_estimate )
7497 {
7498 err = ( (double)numer_estimate / (double)denom_estimate ) - f;
7499 if ( err < 0.0 ) err = -err;
7500 if ( err < ( 1.0 / (double)(1<<24) ) )
7501 {
7502 // yup, found it
7503 *numer = (stbir_uint32) numer_estimate;
7504 *denom = (stbir_uint32) denom_estimate;
7505 return 1;
7506 }
7507 }
7509 // no more refinement bits left? break out and do full range estimate
7510 if ( bot == 0 )
7511 break;
7513 // gcd the estimate bits
7514 est = top / bot;
7515 temp = top % bot;
7516 top = bot;
7517 bot = temp;
7519 // move remainders
7520 temp = est * denom_estimate + denom_last;
7521 denom_last = denom_estimate;
7522 denom_estimate = temp;
7524 // move remainders
7525 temp = est * numer_estimate + numer_last;
7526 numer_last = numer_estimate;
7527 numer_estimate = temp;
7528 }
7530 // we didn't fine anything good enough for float, use a full range estimate
7531 if ( limit_denom )
7532 {
7533 numer_estimate= (stbir_uint64)( f * (double)limit + 0.5 );
7534 denom_estimate = limit;
7535 }
7536 else
7537 {
7538 numer_estimate = limit;
7539 denom_estimate = (stbir_uint64)( ( (double)limit / f ) + 0.5 );
7540 }
7542 *numer = (stbir_uint32) numer_estimate;
7543 *denom = (stbir_uint32) denom_estimate;
7545 err = ( denom_estimate ) ? ( ( (double)(stbir_uint32)numer_estimate / (double)(stbir_uint32)denom_estimate ) - f ) : 1.0;
7546 if ( err < 0.0 ) err = -err;
7547 return ( err < ( 1.0 / (double)(1<<24) ) ) ? 1 : 0;
7548}
7550static int stbir__calculate_region_transform( stbir__scale_info * scale_info, int output_full_range, int * output_offset, int output_sub_range, int input_full_range, double input_s0, double input_s1 )
7551{
7552 double output_range, input_range, output_s, input_s, ratio, scale;
7554 input_s = input_s1 - input_s0;
7556 // null area
7557 if ( ( output_full_range == 0 ) || ( input_full_range == 0 ) ||
7558 ( output_sub_range == 0 ) || ( input_s <= stbir__small_float ) )
7559 return 0;
7561 // are either of the ranges completely out of bounds?
7562 if ( ( *output_offset >= output_full_range ) || ( ( *output_offset + output_sub_range ) <= 0 ) || ( input_s0 >= (1.0f-stbir__small_float) ) || ( input_s1 <= stbir__small_float ) )
7563 return 0;
7565 output_range = (double)output_full_range;
7566 input_range = (double)input_full_range;
7568 output_s = ( (double)output_sub_range) / output_range;
7570 // figure out the scaling to use
7571 ratio = output_s / input_s;
7573 // save scale before clipping
7574 scale = ( output_range / input_range ) * ratio;
7575 scale_info->scale = (float)scale;
7576 scale_info->inv_scale = (float)( 1.0 / scale );
7578 // clip output area to left/right output edges (and adjust input area)
7579 stbir__clip( output_offset, &output_sub_range, output_full_range, &input_s0, &input_s1 );
7581 // recalc input area
7582 input_s = input_s1 - input_s0;
7584 // after clipping do we have zero input area?
7585 if ( input_s <= stbir__small_float )
7586 return 0;
7588 // calculate and store the starting source offsets in output pixel space
7589 scale_info->pixel_shift = (float) ( input_s0 * ratio * output_range );
7591 scale_info->scale_is_rational = stbir__double_to_rational( scale, ( scale <= 1.0 ) ? output_full_range : input_full_range, &scale_info->scale_numerator, &scale_info->scale_denominator, ( scale >= 1.0 ) );
7593 scale_info->input_full_size = input_full_range;
7594 scale_info->output_sub_size = output_sub_range;
7596 return 1;
7597}
7600static void stbir__init_and_set_layout( STBIR_RESIZE * resize, stbir_pixel_layout pixel_layout, stbir_datatype data_type )
7601{
7602 resize->input_cb = 0;
7603 resize->output_cb = 0;
7604 resize->user_data = resize;
7605 resize->samplers = 0;
7606 resize->called_alloc = 0;
7607 resize->horizontal_filter = STBIR_FILTER_DEFAULT;
7608 resize->horizontal_filter_kernel = 0; resize->horizontal_filter_support = 0;
7609 resize->vertical_filter = STBIR_FILTER_DEFAULT;
7610 resize->vertical_filter_kernel = 0; resize->vertical_filter_support = 0;
7611 resize->horizontal_edge = STBIR_EDGE_CLAMP;
7612 resize->vertical_edge = STBIR_EDGE_CLAMP;
7613 resize->input_s0 = 0; resize->input_t0 = 0; resize->input_s1 = 1; resize->input_t1 = 1;
7614 resize->output_subx = 0; resize->output_suby = 0; resize->output_subw = resize->output_w; resize->output_subh = resize->output_h;
7615 resize->input_data_type = data_type;
7616 resize->output_data_type = data_type;
7617 resize->input_pixel_layout_public = pixel_layout;
7618 resize->output_pixel_layout_public = pixel_layout;
7619 resize->needs_rebuild = 1;
7620}
7622STBIRDEF void stbir_resize_init( STBIR_RESIZE * resize,
7623 const void *input_pixels, int input_w, int input_h, int input_stride_in_bytes, // stride can be zero
7624 void *output_pixels, int output_w, int output_h, int output_stride_in_bytes, // stride can be zero
7625 stbir_pixel_layout pixel_layout, stbir_datatype data_type )
7626{
7627 resize->input_pixels = input_pixels;
7628 resize->input_w = input_w;
7629 resize->input_h = input_h;
7630 resize->input_stride_in_bytes = input_stride_in_bytes;
7631 resize->output_pixels = output_pixels;
7632 resize->output_w = output_w;
7633 resize->output_h = output_h;
7634 resize->output_stride_in_bytes = output_stride_in_bytes;
7635 resize->fast_alpha = 0;
7637 stbir__init_and_set_layout( resize, pixel_layout, data_type );
7638}
7640// You can update parameters any time after resize_init
7641STBIRDEF void stbir_set_datatypes( STBIR_RESIZE * resize, stbir_datatype input_type, stbir_datatype output_type ) // by default, datatype from resize_init
7642{
7643 resize->input_data_type = input_type;
7644 resize->output_data_type = output_type;
7645 if ( ( resize->samplers ) && ( !resize->needs_rebuild ) )
7646 stbir__update_info_from_resize( resize->samplers, resize );
7647}
7649STBIRDEF void stbir_set_pixel_callbacks( STBIR_RESIZE * resize, stbir_input_callback * input_cb, stbir_output_callback * output_cb ) // no callbacks by default
7650{
7651 resize->input_cb = input_cb;
7652 resize->output_cb = output_cb;
7654 if ( ( resize->samplers ) && ( !resize->needs_rebuild ) )
7655 {
7656 resize->samplers->in_pixels_cb = input_cb;
7657 resize->samplers->out_pixels_cb = output_cb;
7658 }
7659}
7661STBIRDEF void stbir_set_user_data( STBIR_RESIZE * resize, void * user_data ) // pass back STBIR_RESIZE* by default
7662{
7663 resize->user_data = user_data;
7664 if ( ( resize->samplers ) && ( !resize->needs_rebuild ) )
7665 resize->samplers->user_data = user_data;
7666}
7668STBIRDEF void stbir_set_buffer_ptrs( STBIR_RESIZE * resize, const void * input_pixels, int input_stride_in_bytes, void * output_pixels, int output_stride_in_bytes )
7669{
7670 resize->input_pixels = input_pixels;
7671 resize->input_stride_in_bytes = input_stride_in_bytes;
7672 resize->output_pixels = output_pixels;
7673 resize->output_stride_in_bytes = output_stride_in_bytes;
7674 if ( ( resize->samplers ) && ( !resize->needs_rebuild ) )
7675 stbir__update_info_from_resize( resize->samplers, resize );
7676}
7679STBIRDEF int stbir_set_edgemodes( STBIR_RESIZE * resize, stbir_edge horizontal_edge, stbir_edge vertical_edge ) // CLAMP by default
7680{
7681 resize->horizontal_edge = horizontal_edge;
7682 resize->vertical_edge = vertical_edge;
7683 resize->needs_rebuild = 1;
7684 return 1;
7685}
7687STBIRDEF int stbir_set_filters( STBIR_RESIZE * resize, stbir_filter horizontal_filter, stbir_filter vertical_filter ) // STBIR_DEFAULT_FILTER_UPSAMPLE/DOWNSAMPLE by default
7688{
7689 resize->horizontal_filter = horizontal_filter;
7690 resize->vertical_filter = vertical_filter;
7691 resize->needs_rebuild = 1;
7692 return 1;
7693}
7695STBIRDEF int stbir_set_filter_callbacks( STBIR_RESIZE * resize, stbir__kernel_callback * horizontal_filter, stbir__support_callback * horizontal_support, stbir__kernel_callback * vertical_filter, stbir__support_callback * vertical_support )
7696{
7697 resize->horizontal_filter_kernel = horizontal_filter; resize->horizontal_filter_support = horizontal_support;
7698 resize->vertical_filter_kernel = vertical_filter; resize->vertical_filter_support = vertical_support;
7699 resize->needs_rebuild = 1;
7700 return 1;
7701}
7703STBIRDEF int stbir_set_pixel_layouts( STBIR_RESIZE * resize, stbir_pixel_layout input_pixel_layout, stbir_pixel_layout output_pixel_layout ) // sets new pixel layouts
7704{
7705 resize->input_pixel_layout_public = input_pixel_layout;
7706 resize->output_pixel_layout_public = output_pixel_layout;
7707 resize->needs_rebuild = 1;
7708 return 1;
7709}
7712STBIRDEF int stbir_set_non_pm_alpha_speed_over_quality( STBIR_RESIZE * resize, int non_pma_alpha_speed_over_quality ) // sets alpha speed
7713{
7714 resize->fast_alpha = non_pma_alpha_speed_over_quality;
7715 resize->needs_rebuild = 1;
7716 return 1;
7717}
7719STBIRDEF int stbir_set_input_subrect( STBIR_RESIZE * resize, double s0, double t0, double s1, double t1 ) // sets input region (full region by default)
7720{
7721 resize->input_s0 = s0;
7722 resize->input_t0 = t0;
7723 resize->input_s1 = s1;
7724 resize->input_t1 = t1;
7725 resize->needs_rebuild = 1;
7727 // are we inbounds?
7728 if ( ( s1 < stbir__small_float ) || ( (s1-s0) < stbir__small_float ) ||
7729 ( t1 < stbir__small_float ) || ( (t1-t0) < stbir__small_float ) ||
7730 ( s0 > (1.0f-stbir__small_float) ) ||
7731 ( t0 > (1.0f-stbir__small_float) ) )
7732 return 0;
7734 return 1;
7735}
7737STBIRDEF int stbir_set_output_pixel_subrect( STBIR_RESIZE * resize, int subx, int suby, int subw, int subh ) // sets input region (full region by default)
7738{
7739 resize->output_subx = subx;
7740 resize->output_suby = suby;
7741 resize->output_subw = subw;
7742 resize->output_subh = subh;
7743 resize->needs_rebuild = 1;
7745 // are we inbounds?
7746 if ( ( subx >= resize->output_w ) || ( ( subx + subw ) <= 0 ) || ( suby >= resize->output_h ) || ( ( suby + subh ) <= 0 ) || ( subw == 0 ) || ( subh == 0 ) )
7747 return 0;
7749 return 1;
7750}
7752STBIRDEF int stbir_set_pixel_subrect( STBIR_RESIZE * resize, int subx, int suby, int subw, int subh ) // sets both regions (full regions by default)
7753{
7754 double s0, t0, s1, t1;
7756 s0 = ( (double)subx ) / ( (double)resize->output_w );
7757 t0 = ( (double)suby ) / ( (double)resize->output_h );
7758 s1 = ( (double)(subx+subw) ) / ( (double)resize->output_w );
7759 t1 = ( (double)(suby+subh) ) / ( (double)resize->output_h );
7761 resize->input_s0 = s0;
7762 resize->input_t0 = t0;
7763 resize->input_s1 = s1;
7764 resize->input_t1 = t1;
7765 resize->output_subx = subx;
7766 resize->output_suby = suby;
7767 resize->output_subw = subw;
7768 resize->output_subh = subh;
7769 resize->needs_rebuild = 1;
7771 // are we inbounds?
7772 if ( ( subx >= resize->output_w ) || ( ( subx + subw ) <= 0 ) || ( suby >= resize->output_h ) || ( ( suby + subh ) <= 0 ) || ( subw == 0 ) || ( subh == 0 ) )
7773 return 0;
7775 return 1;
7776}
7778static int stbir__perform_build( STBIR_RESIZE * resize, int splits )
7779{
7780 stbir__contributors conservative = { 0, 0 };
7781 stbir__sampler horizontal, vertical;
7782 int new_output_subx, new_output_suby;
7783 stbir__info * out_info;
7784 #ifdef STBIR_PROFILE
7785 stbir__info profile_infod; // used to contain building profile info before everything is allocated
7786 stbir__info * profile_info = &profile_infod;
7787 #endif
7789 // have we already built the samplers?
7790 if ( resize->samplers )
7791 return 0;
7793 #define STBIR_RETURN_ERROR_AND_ASSERT( exp ) STBIR_ASSERT( !(exp) ); if (exp) return 0;
7794 STBIR_RETURN_ERROR_AND_ASSERT( (unsigned)resize->horizontal_filter >= STBIR_FILTER_OTHER)
7795 STBIR_RETURN_ERROR_AND_ASSERT( (unsigned)resize->vertical_filter >= STBIR_FILTER_OTHER)
7796 #undef STBIR_RETURN_ERROR_AND_ASSERT
7798 if ( splits <= 0 )
7799 return 0;
7801 STBIR_PROFILE_BUILD_FIRST_START( build );
7803 new_output_subx = resize->output_subx;
7804 new_output_suby = resize->output_suby;
7806 // do horizontal clip and scale calcs
7807 if ( !stbir__calculate_region_transform( &horizontal.scale_info, resize->output_w, &new_output_subx, resize->output_subw, resize->input_w, resize->input_s0, resize->input_s1 ) )
7808 return 0;
7810 // do vertical clip and scale calcs
7811 if ( !stbir__calculate_region_transform( &vertical.scale_info, resize->output_h, &new_output_suby, resize->output_subh, resize->input_h, resize->input_t0, resize->input_t1 ) )
7812 return 0;
7814 // if nothing to do, just return
7815 if ( ( horizontal.scale_info.output_sub_size == 0 ) || ( vertical.scale_info.output_sub_size == 0 ) )
7816 return 0;
7818 stbir__set_sampler(&horizontal, resize->horizontal_filter, resize->horizontal_filter_kernel, resize->horizontal_filter_support, resize->horizontal_edge, &horizontal.scale_info, 1, resize->user_data );
7819 stbir__get_conservative_extents( &horizontal, &conservative, resize->user_data );
7820 stbir__set_sampler(&vertical, resize->vertical_filter, resize->horizontal_filter_kernel, resize->vertical_filter_support, resize->vertical_edge, &vertical.scale_info, 0, resize->user_data );
7822 if ( ( vertical.scale_info.output_sub_size / splits ) < STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS ) // each split should be a minimum of 4 scanlines (handwavey choice)
7823 {
7824 splits = vertical.scale_info.output_sub_size / STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS;
7825 if ( splits == 0 ) splits = 1;
7826 }
7828 STBIR_PROFILE_BUILD_START( alloc );
7829 out_info = stbir__alloc_internal_mem_and_build_samplers( &horizontal, &vertical, &conservative, resize->input_pixel_layout_public, resize->output_pixel_layout_public, splits, new_output_subx, new_output_suby, resize->fast_alpha, resize->user_data STBIR_ONLY_PROFILE_BUILD_SET_INFO );
7830 STBIR_PROFILE_BUILD_END( alloc );
7831 STBIR_PROFILE_BUILD_END( build );
7833 if ( out_info )
7834 {
7835 resize->splits = splits;
7836 resize->samplers = out_info;
7837 resize->needs_rebuild = 0;
7838 #ifdef STBIR_PROFILE
7839 STBIR_MEMCPY( &out_info->profile, &profile_infod.profile, sizeof( out_info->profile ) );
7840 #endif
7842 // update anything that can be changed without recalcing samplers
7843 stbir__update_info_from_resize( out_info, resize );
7845 return splits;
7846 }
7848 return 0;
7849}
7851void stbir_free_samplers( STBIR_RESIZE * resize )
7852{
7853 if ( resize->samplers )
7854 {
7855 stbir__free_internal_mem( resize->samplers );
7856 resize->samplers = 0;
7857 resize->called_alloc = 0;
7858 }
7859}
7861STBIRDEF int stbir_build_samplers_with_splits( STBIR_RESIZE * resize, int splits )
7862{
7863 if ( ( resize->samplers == 0 ) || ( resize->needs_rebuild ) )
7864 {
7865 if ( resize->samplers )
7866 stbir_free_samplers( resize );
7868 resize->called_alloc = 1;
7869 return stbir__perform_build( resize, splits );
7870 }
7872 STBIR_PROFILE_BUILD_CLEAR( resize->samplers );
7874 return 1;
7875}
7877STBIRDEF int stbir_build_samplers( STBIR_RESIZE * resize )
7878{
7879 return stbir_build_samplers_with_splits( resize, 1 );
7880}
7882STBIRDEF int stbir_resize_extended( STBIR_RESIZE * resize )
7883{
7884 int result;
7886 if ( ( resize->samplers == 0 ) || ( resize->needs_rebuild ) )
7887 {
7888 int alloc_state = resize->called_alloc; // remember allocated state
7890 if ( resize->samplers )
7891 {
7892 stbir__free_internal_mem( resize->samplers );
7893 resize->samplers = 0;
7894 }
7896 if ( !stbir_build_samplers( resize ) )
7897 return 0;
7899 resize->called_alloc = alloc_state;
7901 // if build_samplers succeeded (above), but there are no samplers set, then
7902 // the area to stretch into was zero pixels, so don't do anything and return
7903 // success
7904 if ( resize->samplers == 0 )
7905 return 1;
7906 }
7907 else
7908 {
7909 // didn't build anything - clear it
7910 STBIR_PROFILE_BUILD_CLEAR( resize->samplers );
7911 }
7913 // do resize
7914 result = stbir__perform_resize( resize->samplers, 0, resize->splits );
7916 // if we alloced, then free
7917 if ( !resize->called_alloc )
7918 {
7919 stbir_free_samplers( resize );
7920 resize->samplers = 0;
7921 }
7923 return result;
7924}
7926STBIRDEF int stbir_resize_extended_split( STBIR_RESIZE * resize, int split_start, int split_count )
7927{
7928 STBIR_ASSERT( resize->samplers );
7930 // if we're just doing the whole thing, call full
7931 if ( ( split_start == -1 ) || ( ( split_start == 0 ) && ( split_count == resize->splits ) ) )
7932 return stbir_resize_extended( resize );
7934 // you **must** build samplers first when using split resize
7935 if ( ( resize->samplers == 0 ) || ( resize->needs_rebuild ) )
7936 return 0;
7938 if ( ( split_start >= resize->splits ) || ( split_start < 0 ) || ( ( split_start + split_count ) > resize->splits ) || ( split_count <= 0 ) )
7939 return 0;
7941 // do resize
7942 return stbir__perform_resize( resize->samplers, split_start, split_count );
7943}
7945static int stbir__check_output_stuff( void ** ret_ptr, int * ret_pitch, void * output_pixels, int type_size, int output_w, int output_h, int output_stride_in_bytes, stbir_internal_pixel_layout pixel_layout )
7946{
7947 size_t size;
7948 int pitch;
7949 void * ptr;
7951 pitch = output_w * type_size * stbir__pixel_channels[ pixel_layout ];
7952 if ( pitch == 0 )
7953 return 0;
7955 if ( output_stride_in_bytes == 0 )
7956 output_stride_in_bytes = pitch;
7958 if ( output_stride_in_bytes < pitch )
7959 return 0;
7961 size = (size_t)output_stride_in_bytes * (size_t)output_h;
7962 if ( size == 0 )
7963 return 0;
7965 *ret_ptr = 0;
7966 *ret_pitch = output_stride_in_bytes;
7968 if ( output_pixels == 0 )
7969 {
7970 ptr = STBIR_MALLOC( size, 0 );
7971 if ( ptr == 0 )
7972 return 0;
7974 *ret_ptr = ptr;
7975 *ret_pitch = pitch;
7976 }
7978 return 1;
7979}
7982STBIRDEF unsigned char * stbir_resize_uint8_linear( const unsigned char *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
7983 unsigned char *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
7984 stbir_pixel_layout pixel_layout )
7985{
7986 STBIR_RESIZE resize;
7987 unsigned char * optr;
7988 int opitch;
7990 if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, sizeof( unsigned char ), output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) )
7991 return 0;
7993 stbir_resize_init( &resize,
7994 input_pixels, input_w, input_h, input_stride_in_bytes,
7995 (optr) ? optr : output_pixels, output_w, output_h, opitch,
7996 pixel_layout, STBIR_TYPE_UINT8 );
7998 if ( !stbir_resize_extended( &resize ) )
7999 {
8000 if ( optr )
8001 STBIR_FREE( optr, 0 );
8002 return 0;
8003 }
8005 return (optr) ? optr : output_pixels;
8006}
8008STBIRDEF unsigned char * stbir_resize_uint8_srgb( const unsigned char *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
8009 unsigned char *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
8010 stbir_pixel_layout pixel_layout )
8011{
8012 STBIR_RESIZE resize;
8013 unsigned char * optr;
8014 int opitch;
8016 if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, sizeof( unsigned char ), output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) )
8017 return 0;
8019 stbir_resize_init( &resize,
8020 input_pixels, input_w, input_h, input_stride_in_bytes,
8021 (optr) ? optr : output_pixels, output_w, output_h, opitch,
8022 pixel_layout, STBIR_TYPE_UINT8_SRGB );
8024 if ( !stbir_resize_extended( &resize ) )
8025 {
8026 if ( optr )
8027 STBIR_FREE( optr, 0 );
8028 return 0;
8029 }
8031 return (optr) ? optr : output_pixels;
8032}
8035STBIRDEF float * stbir_resize_float_linear( const float *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
8036 float *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
8037 stbir_pixel_layout pixel_layout )
8038{
8039 STBIR_RESIZE resize;
8040 float * optr;
8041 int opitch;
8043 if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, sizeof( float ), output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) )
8044 return 0;
8046 stbir_resize_init( &resize,
8047 input_pixels, input_w, input_h, input_stride_in_bytes,
8048 (optr) ? optr : output_pixels, output_w, output_h, opitch,
8049 pixel_layout, STBIR_TYPE_FLOAT );
8051 if ( !stbir_resize_extended( &resize ) )
8052 {
8053 if ( optr )
8054 STBIR_FREE( optr, 0 );
8055 return 0;
8056 }
8058 return (optr) ? optr : output_pixels;
8059}
8062STBIRDEF void * stbir_resize( const void *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
8063 void *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
8064 stbir_pixel_layout pixel_layout, stbir_datatype data_type,
8065 stbir_edge edge, stbir_filter filter )
8066{
8067 STBIR_RESIZE resize;
8068 float * optr;
8069 int opitch;
8071 if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, stbir__type_size[data_type], output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) )
8072 return 0;
8074 stbir_resize_init( &resize,
8075 input_pixels, input_w, input_h, input_stride_in_bytes,
8076 (optr) ? optr : output_pixels, output_w, output_h, output_stride_in_bytes,
8077 pixel_layout, data_type );
8079 resize.horizontal_edge = edge;
8080 resize.vertical_edge = edge;
8081 resize.horizontal_filter = filter;
8082 resize.vertical_filter = filter;
8084 if ( !stbir_resize_extended( &resize ) )
8085 {
8086 if ( optr )
8087 STBIR_FREE( optr, 0 );
8088 return 0;
8089 }
8091 return (optr) ? optr : output_pixels;
8092}
8094#ifdef STBIR_PROFILE
8096STBIRDEF void stbir_resize_build_profile_info( STBIR_PROFILE_INFO * info, STBIR_RESIZE const * resize )
8097{
8098 static char const * bdescriptions[6] = { "Building", "Allocating", "Horizontal sampler", "Vertical sampler", "Coefficient cleanup", "Coefficient piovot" } ;
8099 stbir__info* samp = resize->samplers;
8100 int i;
8102 typedef int testa[ (STBIR__ARRAY_SIZE( bdescriptions ) == (STBIR__ARRAY_SIZE( samp->profile.array )-1) )?1:-1];
8103 typedef int testb[ (sizeof( samp->profile.array ) == (sizeof(samp->profile.named)) )?1:-1];
8104 typedef int testc[ (sizeof( info->clocks ) >= (sizeof(samp->profile.named)) )?1:-1];
8106 for( i = 0 ; i < STBIR__ARRAY_SIZE( bdescriptions ) ; i++)
8107 info->clocks[i] = samp->profile.array[i+1];
8109 info->total_clocks = samp->profile.named.total;
8110 info->descriptions = bdescriptions;
8111 info->count = STBIR__ARRAY_SIZE( bdescriptions );
8112}
8114STBIRDEF void stbir_resize_split_profile_info( STBIR_PROFILE_INFO * info, STBIR_RESIZE const * resize, int split_start, int split_count )
8115{
8116 static char const * descriptions[7] = { "Looping", "Vertical sampling", "Horizontal sampling", "Scanline input", "Scanline output", "Alpha weighting", "Alpha unweighting" };
8117 stbir__per_split_info * split_info;
8118 int s, i;
8120 typedef int testa[ (STBIR__ARRAY_SIZE( descriptions ) == (STBIR__ARRAY_SIZE( split_info->profile.array )-1) )?1:-1];
8121 typedef int testb[ (sizeof( split_info->profile.array ) == (sizeof(split_info->profile.named)) )?1:-1];
8122 typedef int testc[ (sizeof( info->clocks ) >= (sizeof(split_info->profile.named)) )?1:-1];
8124 if ( split_start == -1 )
8125 {
8126 split_start = 0;
8127 split_count = resize->samplers->splits;
8128 }
8130 if ( ( split_start >= resize->splits ) || ( split_start < 0 ) || ( ( split_start + split_count ) > resize->splits ) || ( split_count <= 0 ) )
8131 {
8132 info->total_clocks = 0;
8133 info->descriptions = 0;
8134 info->count = 0;
8135 return;
8136 }
8138 split_info = resize->samplers->split_info + split_start;
8140 // sum up the profile from all the splits
8141 for( i = 0 ; i < STBIR__ARRAY_SIZE( descriptions ) ; i++ )
8142 {
8143 stbir_uint64 sum = 0;
8144 for( s = 0 ; s < split_count ; s++ )
8145 sum += split_info[s].profile.array[i+1];
8146 info->clocks[i] = sum;
8147 }
8149 info->total_clocks = split_info->profile.named.total;
8150 info->descriptions = descriptions;
8151 info->count = STBIR__ARRAY_SIZE( descriptions );
8152}
8154STBIRDEF void stbir_resize_extended_profile_info( STBIR_PROFILE_INFO * info, STBIR_RESIZE const * resize )
8155{
8156 stbir_resize_split_profile_info( info, resize, -1, 0 );
8157}
8159#endif // STBIR_PROFILE
8161#undef STBIR_BGR
8162#undef STBIR_1CHANNEL
8163#undef STBIR_2CHANNEL
8164#undef STBIR_RGB
8165#undef STBIR_RGBA
8166#undef STBIR_4CHANNEL
8167#undef STBIR_BGRA
8168#undef STBIR_ARGB
8169#undef STBIR_ABGR
8170#undef STBIR_RA
8171#undef STBIR_AR
8172#undef STBIR_RGBA_PM
8173#undef STBIR_BGRA_PM
8174#undef STBIR_ARGB_PM
8175#undef STBIR_ABGR_PM
8176#undef STBIR_RA_PM
8177#undef STBIR_AR_PM
8179#endif // STB_IMAGE_RESIZE_IMPLEMENTATION
8181#else // STB_IMAGE_RESIZE_HORIZONTALS&STB_IMAGE_RESIZE_DO_VERTICALS
8183// we reinclude the header file to define all the horizontal functions
8184// specializing each function for the number of coeffs is 20-40% faster *OVERALL*
8186// by including the header file again this way, we can still debug the functions
8188#define STBIR_strs_join2( start, mid, end ) start##mid##end
8189#define STBIR_strs_join1( start, mid, end ) STBIR_strs_join2( start, mid, end )
8191#define STBIR_strs_join24( start, mid1, mid2, end ) start##mid1##mid2##end
8192#define STBIR_strs_join14( start, mid1, mid2, end ) STBIR_strs_join24( start, mid1, mid2, end )
8194#ifdef STB_IMAGE_RESIZE_DO_CODERS
8196#ifdef stbir__decode_suffix
8197#define STBIR__CODER_NAME( name ) STBIR_strs_join1( name, _, stbir__decode_suffix )
8198#else
8199#define STBIR__CODER_NAME( name ) name
8200#endif
8202#ifdef stbir__decode_swizzle
8203#define stbir__decode_simdf8_flip(reg) STBIR_strs_join1( STBIR_strs_join1( STBIR_strs_join1( STBIR_strs_join1( stbir__simdf8_0123to,stbir__decode_order0,stbir__decode_order1),stbir__decode_order2,stbir__decode_order3),stbir__decode_order0,stbir__decode_order1),stbir__decode_order2,stbir__decode_order3)(reg, reg)
8204#define stbir__decode_simdf4_flip(reg) STBIR_strs_join1( STBIR_strs_join1( stbir__simdf_0123to,stbir__decode_order0,stbir__decode_order1),stbir__decode_order2,stbir__decode_order3)(reg, reg)
8205#define stbir__encode_simdf8_unflip(reg) STBIR_strs_join1( STBIR_strs_join1( STBIR_strs_join1( STBIR_strs_join1( stbir__simdf8_0123to,stbir__encode_order0,stbir__encode_order1),stbir__encode_order2,stbir__encode_order3),stbir__encode_order0,stbir__encode_order1),stbir__encode_order2,stbir__encode_order3)(reg, reg)
8206#define stbir__encode_simdf4_unflip(reg) STBIR_strs_join1( STBIR_strs_join1( stbir__simdf_0123to,stbir__encode_order0,stbir__encode_order1),stbir__encode_order2,stbir__encode_order3)(reg, reg)
8207#else
8208#define stbir__decode_order0 0
8209#define stbir__decode_order1 1
8210#define stbir__decode_order2 2
8211#define stbir__decode_order3 3
8212#define stbir__encode_order0 0
8213#define stbir__encode_order1 1
8214#define stbir__encode_order2 2
8215#define stbir__encode_order3 3
8216#define stbir__decode_simdf8_flip(reg)
8217#define stbir__decode_simdf4_flip(reg)
8218#define stbir__encode_simdf8_unflip(reg)
8219#define stbir__encode_simdf4_unflip(reg)
8220#endif
8222#ifdef STBIR_SIMD8
8223#define stbir__encode_simdfX_unflip stbir__encode_simdf8_unflip
8224#else
8225#define stbir__encode_simdfX_unflip stbir__encode_simdf4_unflip
8226#endif
8228static void STBIR__CODER_NAME( stbir__decode_uint8_linear_scaled )( float * decodep, int width_times_channels, void const * inputp )
8229{
8230 float STBIR_STREAMOUT_PTR( * ) decode = decodep;
8231 float * decode_end = (float*) decode + width_times_channels;
8232 unsigned char const * input = (unsigned char const*)inputp;
8234 #ifdef STBIR_SIMD
8235 unsigned char const * end_input_m16 = input + width_times_channels - 16;
8236 if ( width_times_channels >= 16 )
8237 {
8238 decode_end -= 16;
8239 STBIR_NO_UNROLL_LOOP_START_INF_FOR
8240 for(;;)
8241 {
8242 #ifdef STBIR_SIMD8
8243 stbir__simdi i; stbir__simdi8 o0,o1;
8244 stbir__simdf8 of0, of1;
8245 STBIR_NO_UNROLL(decode);
8246 stbir__simdi_load( i, input );
8247 stbir__simdi8_expand_u8_to_u32( o0, o1, i );
8248 stbir__simdi8_convert_i32_to_float( of0, o0 );
8249 stbir__simdi8_convert_i32_to_float( of1, o1 );
8250 stbir__simdf8_mult( of0, of0, STBIR_max_uint8_as_float_inverted8);
8251 stbir__simdf8_mult( of1, of1, STBIR_max_uint8_as_float_inverted8);
8252 stbir__decode_simdf8_flip( of0 );
8253 stbir__decode_simdf8_flip( of1 );
8254 stbir__simdf8_store( decode + 0, of0 );
8255 stbir__simdf8_store( decode + 8, of1 );
8256 #else
8257 stbir__simdi i, o0, o1, o2, o3;
8258 stbir__simdf of0, of1, of2, of3;
8259 STBIR_NO_UNROLL(decode);
8260 stbir__simdi_load( i, input );
8261 stbir__simdi_expand_u8_to_u32( o0,o1,o2,o3,i);
8262 stbir__simdi_convert_i32_to_float( of0, o0 );
8263 stbir__simdi_convert_i32_to_float( of1, o1 );
8264 stbir__simdi_convert_i32_to_float( of2, o2 );
8265 stbir__simdi_convert_i32_to_float( of3, o3 );
8266 stbir__simdf_mult( of0, of0, STBIR__CONSTF(STBIR_max_uint8_as_float_inverted) );
8267 stbir__simdf_mult( of1, of1, STBIR__CONSTF(STBIR_max_uint8_as_float_inverted) );
8268 stbir__simdf_mult( of2, of2, STBIR__CONSTF(STBIR_max_uint8_as_float_inverted) );
8269 stbir__simdf_mult( of3, of3, STBIR__CONSTF(STBIR_max_uint8_as_float_inverted) );
8270 stbir__decode_simdf4_flip( of0 );
8271 stbir__decode_simdf4_flip( of1 );
8272 stbir__decode_simdf4_flip( of2 );
8273 stbir__decode_simdf4_flip( of3 );
8274 stbir__simdf_store( decode + 0, of0 );
8275 stbir__simdf_store( decode + 4, of1 );
8276 stbir__simdf_store( decode + 8, of2 );
8277 stbir__simdf_store( decode + 12, of3 );
8278 #endif
8279 decode += 16;
8280 input += 16;
8281 if ( decode <= decode_end )
8282 continue;
8283 if ( decode == ( decode_end + 16 ) )
8284 break;
8285 decode = decode_end; // backup and do last couple
8286 input = end_input_m16;
8287 }
8288 return;
8289 }
8290 #endif
8292 // try to do blocks of 4 when you can
8293 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
8294 decode += 4;
8295 STBIR_SIMD_NO_UNROLL_LOOP_START
8296 while( decode <= decode_end )
8297 {
8298 STBIR_SIMD_NO_UNROLL(decode);
8299 decode[0-4] = ((float)(input[stbir__decode_order0])) * stbir__max_uint8_as_float_inverted;
8300 decode[1-4] = ((float)(input[stbir__decode_order1])) * stbir__max_uint8_as_float_inverted;
8301 decode[2-4] = ((float)(input[stbir__decode_order2])) * stbir__max_uint8_as_float_inverted;
8302 decode[3-4] = ((float)(input[stbir__decode_order3])) * stbir__max_uint8_as_float_inverted;
8303 decode += 4;
8304 input += 4;
8305 }
8306 decode -= 4;
8307 #endif
8309 // do the remnants
8310 #if stbir__coder_min_num < 4
8311 STBIR_NO_UNROLL_LOOP_START
8312 while( decode < decode_end )
8313 {
8314 STBIR_NO_UNROLL(decode);
8315 decode[0] = ((float)(input[stbir__decode_order0])) * stbir__max_uint8_as_float_inverted;
8316 #if stbir__coder_min_num >= 2
8317 decode[1] = ((float)(input[stbir__decode_order1])) * stbir__max_uint8_as_float_inverted;
8318 #endif
8319 #if stbir__coder_min_num >= 3
8320 decode[2] = ((float)(input[stbir__decode_order2])) * stbir__max_uint8_as_float_inverted;
8321 #endif
8322 decode += stbir__coder_min_num;
8323 input += stbir__coder_min_num;
8324 }
8325 #endif
8326}
8328static void STBIR__CODER_NAME( stbir__encode_uint8_linear_scaled )( void * outputp, int width_times_channels, float const * encode )
8329{
8330 unsigned char STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned char *) outputp;
8331 unsigned char * end_output = ( (unsigned char *) output ) + width_times_channels;
8333 #ifdef STBIR_SIMD
8334 if ( width_times_channels >= stbir__simdfX_float_count*2 )
8335 {
8336 float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2;
8337 end_output -= stbir__simdfX_float_count*2;
8338 STBIR_NO_UNROLL_LOOP_START_INF_FOR
8339 for(;;)
8340 {
8341 stbir__simdfX e0, e1;
8342 stbir__simdi i;
8343 STBIR_SIMD_NO_UNROLL(encode);
8344 stbir__simdfX_madd_mem( e0, STBIR_simd_point5X, STBIR_max_uint8_as_floatX, encode );
8345 stbir__simdfX_madd_mem( e1, STBIR_simd_point5X, STBIR_max_uint8_as_floatX, encode+stbir__simdfX_float_count );
8346 stbir__encode_simdfX_unflip( e0 );
8347 stbir__encode_simdfX_unflip( e1 );
8348 #ifdef STBIR_SIMD8
8349 stbir__simdf8_pack_to_16bytes( i, e0, e1 );
8350 stbir__simdi_store( output, i );
8351 #else
8352 stbir__simdf_pack_to_8bytes( i, e0, e1 );
8353 stbir__simdi_store2( output, i );
8354 #endif
8355 encode += stbir__simdfX_float_count*2;
8356 output += stbir__simdfX_float_count*2;
8357 if ( output <= end_output )
8358 continue;
8359 if ( output == ( end_output + stbir__simdfX_float_count*2 ) )
8360 break;
8361 output = end_output; // backup and do last couple
8362 encode = end_encode_m8;
8363 }
8364 return;
8365 }
8367 // try to do blocks of 4 when you can
8368 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
8369 output += 4;
8370 STBIR_NO_UNROLL_LOOP_START
8371 while( output <= end_output )
8372 {
8373 stbir__simdf e0;
8374 stbir__simdi i0;
8375 STBIR_NO_UNROLL(encode);
8376 stbir__simdf_load( e0, encode );
8377 stbir__simdf_madd( e0, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint8_as_float), e0 );
8378 stbir__encode_simdf4_unflip( e0 );
8379 stbir__simdf_pack_to_8bytes( i0, e0, e0 ); // only use first 4
8380 *(int*)(output-4) = stbir__simdi_to_int( i0 );
8381 output += 4;
8382 encode += 4;
8383 }
8384 output -= 4;
8385 #endif
8387 // do the remnants
8388 #if stbir__coder_min_num < 4
8389 STBIR_NO_UNROLL_LOOP_START
8390 while( output < end_output )
8391 {
8392 stbir__simdf e0;
8393 STBIR_NO_UNROLL(encode);
8394 stbir__simdf_madd1_mem( e0, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint8_as_float), encode+stbir__encode_order0 ); output[0] = stbir__simdf_convert_float_to_uint8( e0 );
8395 #if stbir__coder_min_num >= 2
8396 stbir__simdf_madd1_mem( e0, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint8_as_float), encode+stbir__encode_order1 ); output[1] = stbir__simdf_convert_float_to_uint8( e0 );
8397 #endif
8398 #if stbir__coder_min_num >= 3
8399 stbir__simdf_madd1_mem( e0, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint8_as_float), encode+stbir__encode_order2 ); output[2] = stbir__simdf_convert_float_to_uint8( e0 );
8400 #endif
8401 output += stbir__coder_min_num;
8402 encode += stbir__coder_min_num;
8403 }
8404 #endif
8406 #else
8408 // try to do blocks of 4 when you can
8409 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
8410 output += 4;
8411 while( output <= end_output )
8412 {
8413 float f;
8414 f = encode[stbir__encode_order0] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[0-4] = (unsigned char)f;
8415 f = encode[stbir__encode_order1] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[1-4] = (unsigned char)f;
8416 f = encode[stbir__encode_order2] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[2-4] = (unsigned char)f;
8417 f = encode[stbir__encode_order3] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[3-4] = (unsigned char)f;
8418 output += 4;
8419 encode += 4;
8420 }
8421 output -= 4;
8422 #endif
8424 // do the remnants
8425 #if stbir__coder_min_num < 4
8426 STBIR_NO_UNROLL_LOOP_START
8427 while( output < end_output )
8428 {
8429 float f;
8430 STBIR_NO_UNROLL(encode);
8431 f = encode[stbir__encode_order0] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[0] = (unsigned char)f;
8432 #if stbir__coder_min_num >= 2
8433 f = encode[stbir__encode_order1] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[1] = (unsigned char)f;
8434 #endif
8435 #if stbir__coder_min_num >= 3
8436 f = encode[stbir__encode_order2] * stbir__max_uint8_as_float + 0.5f; STBIR_CLAMP(f, 0, 255); output[2] = (unsigned char)f;
8437 #endif
8438 output += stbir__coder_min_num;
8439 encode += stbir__coder_min_num;
8440 }
8441 #endif
8442 #endif
8443}
8445static void STBIR__CODER_NAME(stbir__decode_uint8_linear)( float * decodep, int width_times_channels, void const * inputp )
8446{
8447 float STBIR_STREAMOUT_PTR( * ) decode = decodep;
8448 float * decode_end = (float*) decode + width_times_channels;
8449 unsigned char const * input = (unsigned char const*)inputp;
8451 #ifdef STBIR_SIMD
8452 unsigned char const * end_input_m16 = input + width_times_channels - 16;
8453 if ( width_times_channels >= 16 )
8454 {
8455 decode_end -= 16;
8456 STBIR_NO_UNROLL_LOOP_START_INF_FOR
8457 for(;;)
8458 {
8459 #ifdef STBIR_SIMD8
8460 stbir__simdi i; stbir__simdi8 o0,o1;
8461 stbir__simdf8 of0, of1;
8462 STBIR_NO_UNROLL(decode);
8463 stbir__simdi_load( i, input );
8464 stbir__simdi8_expand_u8_to_u32( o0, o1, i );
8465 stbir__simdi8_convert_i32_to_float( of0, o0 );
8466 stbir__simdi8_convert_i32_to_float( of1, o1 );
8467 stbir__decode_simdf8_flip( of0 );
8468 stbir__decode_simdf8_flip( of1 );
8469 stbir__simdf8_store( decode + 0, of0 );
8470 stbir__simdf8_store( decode + 8, of1 );
8471 #else
8472 stbir__simdi i, o0, o1, o2, o3;
8473 stbir__simdf of0, of1, of2, of3;
8474 STBIR_NO_UNROLL(decode);
8475 stbir__simdi_load( i, input );
8476 stbir__simdi_expand_u8_to_u32( o0,o1,o2,o3,i);
8477 stbir__simdi_convert_i32_to_float( of0, o0 );
8478 stbir__simdi_convert_i32_to_float( of1, o1 );
8479 stbir__simdi_convert_i32_to_float( of2, o2 );
8480 stbir__simdi_convert_i32_to_float( of3, o3 );
8481 stbir__decode_simdf4_flip( of0 );
8482 stbir__decode_simdf4_flip( of1 );
8483 stbir__decode_simdf4_flip( of2 );
8484 stbir__decode_simdf4_flip( of3 );
8485 stbir__simdf_store( decode + 0, of0 );
8486 stbir__simdf_store( decode + 4, of1 );
8487 stbir__simdf_store( decode + 8, of2 );
8488 stbir__simdf_store( decode + 12, of3 );
8489#endif
8490 decode += 16;
8491 input += 16;
8492 if ( decode <= decode_end )
8493 continue;
8494 if ( decode == ( decode_end + 16 ) )
8495 break;
8496 decode = decode_end; // backup and do last couple
8497 input = end_input_m16;
8498 }
8499 return;
8500 }
8501 #endif
8503 // try to do blocks of 4 when you can
8504 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
8505 decode += 4;
8506 STBIR_SIMD_NO_UNROLL_LOOP_START
8507 while( decode <= decode_end )
8508 {
8509 STBIR_SIMD_NO_UNROLL(decode);
8510 decode[0-4] = ((float)(input[stbir__decode_order0]));
8511 decode[1-4] = ((float)(input[stbir__decode_order1]));
8512 decode[2-4] = ((float)(input[stbir__decode_order2]));
8513 decode[3-4] = ((float)(input[stbir__decode_order3]));
8514 decode += 4;
8515 input += 4;
8516 }
8517 decode -= 4;
8518 #endif
8520 // do the remnants
8521 #if stbir__coder_min_num < 4
8522 STBIR_NO_UNROLL_LOOP_START
8523 while( decode < decode_end )
8524 {
8525 STBIR_NO_UNROLL(decode);
8526 decode[0] = ((float)(input[stbir__decode_order0]));
8527 #if stbir__coder_min_num >= 2
8528 decode[1] = ((float)(input[stbir__decode_order1]));
8529 #endif
8530 #if stbir__coder_min_num >= 3
8531 decode[2] = ((float)(input[stbir__decode_order2]));
8532 #endif
8533 decode += stbir__coder_min_num;
8534 input += stbir__coder_min_num;
8535 }
8536 #endif
8537}
8539static void STBIR__CODER_NAME( stbir__encode_uint8_linear )( void * outputp, int width_times_channels, float const * encode )
8540{
8541 unsigned char STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned char *) outputp;
8542 unsigned char * end_output = ( (unsigned char *) output ) + width_times_channels;
8544 #ifdef STBIR_SIMD
8545 if ( width_times_channels >= stbir__simdfX_float_count*2 )
8546 {
8547 float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2;
8548 end_output -= stbir__simdfX_float_count*2;
8549 STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
8550 for(;;)
8551 {
8552 stbir__simdfX e0, e1;
8553 stbir__simdi i;
8554 STBIR_SIMD_NO_UNROLL(encode);
8555 stbir__simdfX_add_mem( e0, STBIR_simd_point5X, encode );
8556 stbir__simdfX_add_mem( e1, STBIR_simd_point5X, encode+stbir__simdfX_float_count );
8557 stbir__encode_simdfX_unflip( e0 );
8558 stbir__encode_simdfX_unflip( e1 );
8559 #ifdef STBIR_SIMD8
8560 stbir__simdf8_pack_to_16bytes( i, e0, e1 );
8561 stbir__simdi_store( output, i );
8562 #else
8563 stbir__simdf_pack_to_8bytes( i, e0, e1 );
8564 stbir__simdi_store2( output, i );
8565 #endif
8566 encode += stbir__simdfX_float_count*2;
8567 output += stbir__simdfX_float_count*2;
8568 if ( output <= end_output )
8569 continue;
8570 if ( output == ( end_output + stbir__simdfX_float_count*2 ) )
8571 break;
8572 output = end_output; // backup and do last couple
8573 encode = end_encode_m8;
8574 }
8575 return;
8576 }
8578 // try to do blocks of 4 when you can
8579 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
8580 output += 4;
8581 STBIR_NO_UNROLL_LOOP_START
8582 while( output <= end_output )
8583 {
8584 stbir__simdf e0;
8585 stbir__simdi i0;
8586 STBIR_NO_UNROLL(encode);
8587 stbir__simdf_load( e0, encode );
8588 stbir__simdf_add( e0, STBIR__CONSTF(STBIR_simd_point5), e0 );
8589 stbir__encode_simdf4_unflip( e0 );
8590 stbir__simdf_pack_to_8bytes( i0, e0, e0 ); // only use first 4
8591 *(int*)(output-4) = stbir__simdi_to_int( i0 );
8592 output += 4;
8593 encode += 4;
8594 }
8595 output -= 4;
8596 #endif
8598 #else
8600 // try to do blocks of 4 when you can
8601 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
8602 output += 4;
8603 while( output <= end_output )
8604 {
8605 float f;
8606 f = encode[stbir__encode_order0] + 0.5f; STBIR_CLAMP(f, 0, 255); output[0-4] = (unsigned char)f;
8607 f = encode[stbir__encode_order1] + 0.5f; STBIR_CLAMP(f, 0, 255); output[1-4] = (unsigned char)f;
8608 f = encode[stbir__encode_order2] + 0.5f; STBIR_CLAMP(f, 0, 255); output[2-4] = (unsigned char)f;
8609 f = encode[stbir__encode_order3] + 0.5f; STBIR_CLAMP(f, 0, 255); output[3-4] = (unsigned char)f;
8610 output += 4;
8611 encode += 4;
8612 }
8613 output -= 4;
8614 #endif
8616 #endif
8618 // do the remnants
8619 #if stbir__coder_min_num < 4
8620 STBIR_NO_UNROLL_LOOP_START
8621 while( output < end_output )
8622 {
8623 float f;
8624 STBIR_NO_UNROLL(encode);
8625 f = encode[stbir__encode_order0] + 0.5f; STBIR_CLAMP(f, 0, 255); output[0] = (unsigned char)f;
8626 #if stbir__coder_min_num >= 2
8627 f = encode[stbir__encode_order1] + 0.5f; STBIR_CLAMP(f, 0, 255); output[1] = (unsigned char)f;
8628 #endif
8629 #if stbir__coder_min_num >= 3
8630 f = encode[stbir__encode_order2] + 0.5f; STBIR_CLAMP(f, 0, 255); output[2] = (unsigned char)f;
8631 #endif
8632 output += stbir__coder_min_num;
8633 encode += stbir__coder_min_num;
8634 }
8635 #endif
8636}
8638static void STBIR__CODER_NAME(stbir__decode_uint8_srgb)( float * decodep, int width_times_channels, void const * inputp )
8639{
8640 float STBIR_STREAMOUT_PTR( * ) decode = decodep;
8641 float const * decode_end = (float*) decode + width_times_channels;
8642 unsigned char const * input = (unsigned char const *)inputp;
8644 // try to do blocks of 4 when you can
8645 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
8646 decode += 4;
8647 while( decode <= decode_end )
8648 {
8649 decode[0-4] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order0 ] ];
8650 decode[1-4] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order1 ] ];
8651 decode[2-4] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order2 ] ];
8652 decode[3-4] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order3 ] ];
8653 decode += 4;
8654 input += 4;
8655 }
8656 decode -= 4;
8657 #endif
8659 // do the remnants
8660 #if stbir__coder_min_num < 4
8661 STBIR_NO_UNROLL_LOOP_START
8662 while( decode < decode_end )
8663 {
8664 STBIR_NO_UNROLL(decode);
8665 decode[0] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order0 ] ];
8666 #if stbir__coder_min_num >= 2
8667 decode[1] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order1 ] ];
8668 #endif
8669 #if stbir__coder_min_num >= 3
8670 decode[2] = stbir__srgb_uchar_to_linear_float[ input[ stbir__decode_order2 ] ];
8671 #endif
8672 decode += stbir__coder_min_num;
8673 input += stbir__coder_min_num;
8674 }
8675 #endif
8676}
8678#define stbir__min_max_shift20( i, f ) \
8679 stbir__simdf_max( f, f, stbir_simdf_casti(STBIR__CONSTI( STBIR_almost_zero )) ); \
8680 stbir__simdf_min( f, f, stbir_simdf_casti(STBIR__CONSTI( STBIR_almost_one )) ); \
8681 stbir__simdi_32shr( i, stbir_simdi_castf( f ), 20 );
8683#define stbir__scale_and_convert( i, f ) \
8684 stbir__simdf_madd( f, STBIR__CONSTF( STBIR_simd_point5 ), STBIR__CONSTF( STBIR_max_uint8_as_float ), f ); \
8685 stbir__simdf_max( f, f, stbir__simdf_zeroP() ); \
8686 stbir__simdf_min( f, f, STBIR__CONSTF( STBIR_max_uint8_as_float ) ); \
8687 stbir__simdf_convert_float_to_i32( i, f );
8689#define stbir__linear_to_srgb_finish( i, f ) \
8690{ \
8691 stbir__simdi temp; \
8692 stbir__simdi_32shr( temp, stbir_simdi_castf( f ), 12 ) ; \
8693 stbir__simdi_and( temp, temp, STBIR__CONSTI(STBIR_mastissa_mask) ); \
8694 stbir__simdi_or( temp, temp, STBIR__CONSTI(STBIR_topscale) ); \
8695 stbir__simdi_16madd( i, i, temp ); \
8696 stbir__simdi_32shr( i, i, 16 ); \
8697}
8699#define stbir__simdi_table_lookup2( v0,v1, table ) \
8700{ \
8701 stbir__simdi_u32 temp0,temp1; \
8702 temp0.m128i_i128 = v0; \
8703 temp1.m128i_i128 = v1; \
8704 temp0.m128i_u32[0] = table[temp0.m128i_i32[0]]; temp0.m128i_u32[1] = table[temp0.m128i_i32[1]]; temp0.m128i_u32[2] = table[temp0.m128i_i32[2]]; temp0.m128i_u32[3] = table[temp0.m128i_i32[3]]; \
8705 temp1.m128i_u32[0] = table[temp1.m128i_i32[0]]; temp1.m128i_u32[1] = table[temp1.m128i_i32[1]]; temp1.m128i_u32[2] = table[temp1.m128i_i32[2]]; temp1.m128i_u32[3] = table[temp1.m128i_i32[3]]; \
8706 v0 = temp0.m128i_i128; \
8707 v1 = temp1.m128i_i128; \
8708}
8710#define stbir__simdi_table_lookup3( v0,v1,v2, table ) \
8711{ \
8712 stbir__simdi_u32 temp0,temp1,temp2; \
8713 temp0.m128i_i128 = v0; \
8714 temp1.m128i_i128 = v1; \
8715 temp2.m128i_i128 = v2; \
8716 temp0.m128i_u32[0] = table[temp0.m128i_i32[0]]; temp0.m128i_u32[1] = table[temp0.m128i_i32[1]]; temp0.m128i_u32[2] = table[temp0.m128i_i32[2]]; temp0.m128i_u32[3] = table[temp0.m128i_i32[3]]; \
8717 temp1.m128i_u32[0] = table[temp1.m128i_i32[0]]; temp1.m128i_u32[1] = table[temp1.m128i_i32[1]]; temp1.m128i_u32[2] = table[temp1.m128i_i32[2]]; temp1.m128i_u32[3] = table[temp1.m128i_i32[3]]; \
8718 temp2.m128i_u32[0] = table[temp2.m128i_i32[0]]; temp2.m128i_u32[1] = table[temp2.m128i_i32[1]]; temp2.m128i_u32[2] = table[temp2.m128i_i32[2]]; temp2.m128i_u32[3] = table[temp2.m128i_i32[3]]; \
8719 v0 = temp0.m128i_i128; \
8720 v1 = temp1.m128i_i128; \
8721 v2 = temp2.m128i_i128; \
8722}
8724#define stbir__simdi_table_lookup4( v0,v1,v2,v3, table ) \
8725{ \
8726 stbir__simdi_u32 temp0,temp1,temp2,temp3; \
8727 temp0.m128i_i128 = v0; \
8728 temp1.m128i_i128 = v1; \
8729 temp2.m128i_i128 = v2; \
8730 temp3.m128i_i128 = v3; \
8731 temp0.m128i_u32[0] = table[temp0.m128i_i32[0]]; temp0.m128i_u32[1] = table[temp0.m128i_i32[1]]; temp0.m128i_u32[2] = table[temp0.m128i_i32[2]]; temp0.m128i_u32[3] = table[temp0.m128i_i32[3]]; \
8732 temp1.m128i_u32[0] = table[temp1.m128i_i32[0]]; temp1.m128i_u32[1] = table[temp1.m128i_i32[1]]; temp1.m128i_u32[2] = table[temp1.m128i_i32[2]]; temp1.m128i_u32[3] = table[temp1.m128i_i32[3]]; \
8733 temp2.m128i_u32[0] = table[temp2.m128i_i32[0]]; temp2.m128i_u32[1] = table[temp2.m128i_i32[1]]; temp2.m128i_u32[2] = table[temp2.m128i_i32[2]]; temp2.m128i_u32[3] = table[temp2.m128i_i32[3]]; \
8734 temp3.m128i_u32[0] = table[temp3.m128i_i32[0]]; temp3.m128i_u32[1] = table[temp3.m128i_i32[1]]; temp3.m128i_u32[2] = table[temp3.m128i_i32[2]]; temp3.m128i_u32[3] = table[temp3.m128i_i32[3]]; \
8735 v0 = temp0.m128i_i128; \
8736 v1 = temp1.m128i_i128; \
8737 v2 = temp2.m128i_i128; \
8738 v3 = temp3.m128i_i128; \
8739}
8741static void STBIR__CODER_NAME( stbir__encode_uint8_srgb )( void * outputp, int width_times_channels, float const * encode )
8742{
8743 unsigned char STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned char*) outputp;
8744 unsigned char * end_output = ( (unsigned char*) output ) + width_times_channels;
8746 #ifdef STBIR_SIMD
8748 if ( width_times_channels >= 16 )
8749 {
8750 float const * end_encode_m16 = encode + width_times_channels - 16;
8751 end_output -= 16;
8752 STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
8753 for(;;)
8754 {
8755 stbir__simdf f0, f1, f2, f3;
8756 stbir__simdi i0, i1, i2, i3;
8757 STBIR_SIMD_NO_UNROLL(encode);
8759 stbir__simdf_load4_transposed( f0, f1, f2, f3, encode );
8761 stbir__min_max_shift20( i0, f0 );
8762 stbir__min_max_shift20( i1, f1 );
8763 stbir__min_max_shift20( i2, f2 );
8764 stbir__min_max_shift20( i3, f3 );
8766 stbir__simdi_table_lookup4( i0, i1, i2, i3, ( fp32_to_srgb8_tab4 - (127-13)*8 ) );
8768 stbir__linear_to_srgb_finish( i0, f0 );
8769 stbir__linear_to_srgb_finish( i1, f1 );
8770 stbir__linear_to_srgb_finish( i2, f2 );
8771 stbir__linear_to_srgb_finish( i3, f3 );
8773 stbir__interleave_pack_and_store_16_u8( output, STBIR_strs_join1(i, ,stbir__encode_order0), STBIR_strs_join1(i, ,stbir__encode_order1), STBIR_strs_join1(i, ,stbir__encode_order2), STBIR_strs_join1(i, ,stbir__encode_order3) );
8775 encode += 16;
8776 output += 16;
8777 if ( output <= end_output )
8778 continue;
8779 if ( output == ( end_output + 16 ) )
8780 break;
8781 output = end_output; // backup and do last couple
8782 encode = end_encode_m16;
8783 }
8784 return;
8785 }
8786 #endif
8788 // try to do blocks of 4 when you can
8789 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
8790 output += 4;
8791 STBIR_SIMD_NO_UNROLL_LOOP_START
8792 while ( output <= end_output )
8793 {
8794 STBIR_SIMD_NO_UNROLL(encode);
8796 output[0-4] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order0] );
8797 output[1-4] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order1] );
8798 output[2-4] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order2] );
8799 output[3-4] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order3] );
8801 output += 4;
8802 encode += 4;
8803 }
8804 output -= 4;
8805 #endif
8807 // do the remnants
8808 #if stbir__coder_min_num < 4
8809 STBIR_NO_UNROLL_LOOP_START
8810 while( output < end_output )
8811 {
8812 STBIR_NO_UNROLL(encode);
8813 output[0] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order0] );
8814 #if stbir__coder_min_num >= 2
8815 output[1] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order1] );
8816 #endif
8817 #if stbir__coder_min_num >= 3
8818 output[2] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order2] );
8819 #endif
8820 output += stbir__coder_min_num;
8821 encode += stbir__coder_min_num;
8822 }
8823 #endif
8824}
8826#if ( stbir__coder_min_num == 4 ) || ( ( stbir__coder_min_num == 1 ) && ( !defined(stbir__decode_swizzle) ) )
8828static void STBIR__CODER_NAME(stbir__decode_uint8_srgb4_linearalpha)( float * decodep, int width_times_channels, void const * inputp )
8829{
8830 float STBIR_STREAMOUT_PTR( * ) decode = decodep;
8831 float const * decode_end = (float*) decode + width_times_channels;
8832 unsigned char const * input = (unsigned char const *)inputp;
8833 do {
8834 decode[0] = stbir__srgb_uchar_to_linear_float[ input[stbir__decode_order0] ];
8835 decode[1] = stbir__srgb_uchar_to_linear_float[ input[stbir__decode_order1] ];
8836 decode[2] = stbir__srgb_uchar_to_linear_float[ input[stbir__decode_order2] ];
8837 decode[3] = ( (float) input[stbir__decode_order3] ) * stbir__max_uint8_as_float_inverted;
8838 input += 4;
8839 decode += 4;
8840 } while( decode < decode_end );
8841}
8844static void STBIR__CODER_NAME( stbir__encode_uint8_srgb4_linearalpha )( void * outputp, int width_times_channels, float const * encode )
8845{
8846 unsigned char STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned char*) outputp;
8847 unsigned char * end_output = ( (unsigned char*) output ) + width_times_channels;
8849 #ifdef STBIR_SIMD
8851 if ( width_times_channels >= 16 )
8852 {
8853 float const * end_encode_m16 = encode + width_times_channels - 16;
8854 end_output -= 16;
8855 STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
8856 for(;;)
8857 {
8858 stbir__simdf f0, f1, f2, f3;
8859 stbir__simdi i0, i1, i2, i3;
8861 STBIR_SIMD_NO_UNROLL(encode);
8862 stbir__simdf_load4_transposed( f0, f1, f2, f3, encode );
8864 stbir__min_max_shift20( i0, f0 );
8865 stbir__min_max_shift20( i1, f1 );
8866 stbir__min_max_shift20( i2, f2 );
8867 stbir__scale_and_convert( i3, f3 );
8869 stbir__simdi_table_lookup3( i0, i1, i2, ( fp32_to_srgb8_tab4 - (127-13)*8 ) );
8871 stbir__linear_to_srgb_finish( i0, f0 );
8872 stbir__linear_to_srgb_finish( i1, f1 );
8873 stbir__linear_to_srgb_finish( i2, f2 );
8875 stbir__interleave_pack_and_store_16_u8( output, STBIR_strs_join1(i, ,stbir__encode_order0), STBIR_strs_join1(i, ,stbir__encode_order1), STBIR_strs_join1(i, ,stbir__encode_order2), STBIR_strs_join1(i, ,stbir__encode_order3) );
8877 output += 16;
8878 encode += 16;
8880 if ( output <= end_output )
8881 continue;
8882 if ( output == ( end_output + 16 ) )
8883 break;
8884 output = end_output; // backup and do last couple
8885 encode = end_encode_m16;
8886 }
8887 return;
8888 }
8889 #endif
8891 STBIR_SIMD_NO_UNROLL_LOOP_START
8892 do {
8893 float f;
8894 STBIR_SIMD_NO_UNROLL(encode);
8896 output[stbir__decode_order0] = stbir__linear_to_srgb_uchar( encode[0] );
8897 output[stbir__decode_order1] = stbir__linear_to_srgb_uchar( encode[1] );
8898 output[stbir__decode_order2] = stbir__linear_to_srgb_uchar( encode[2] );
8900 f = encode[3] * stbir__max_uint8_as_float + 0.5f;
8901 STBIR_CLAMP(f, 0, 255);
8902 output[stbir__decode_order3] = (unsigned char) f;
8904 output += 4;
8905 encode += 4;
8906 } while( output < end_output );
8907}
8909#endif
8911#if ( stbir__coder_min_num == 2 ) || ( ( stbir__coder_min_num == 1 ) && ( !defined(stbir__decode_swizzle) ) )
8913static void STBIR__CODER_NAME(stbir__decode_uint8_srgb2_linearalpha)( float * decodep, int width_times_channels, void const * inputp )
8914{
8915 float STBIR_STREAMOUT_PTR( * ) decode = decodep;
8916 float const * decode_end = (float*) decode + width_times_channels;
8917 unsigned char const * input = (unsigned char const *)inputp;
8918 decode += 4;
8919 while( decode <= decode_end )
8920 {
8921 decode[0-4] = stbir__srgb_uchar_to_linear_float[ input[stbir__decode_order0] ];
8922 decode[1-4] = ( (float) input[stbir__decode_order1] ) * stbir__max_uint8_as_float_inverted;
8923 decode[2-4] = stbir__srgb_uchar_to_linear_float[ input[stbir__decode_order0+2] ];
8924 decode[3-4] = ( (float) input[stbir__decode_order1+2] ) * stbir__max_uint8_as_float_inverted;
8925 input += 4;
8926 decode += 4;
8927 }
8928 decode -= 4;
8929 if( decode < decode_end )
8930 {
8931 decode[0] = stbir__srgb_uchar_to_linear_float[ stbir__decode_order0 ];
8932 decode[1] = ( (float) input[stbir__decode_order1] ) * stbir__max_uint8_as_float_inverted;
8933 }
8934}
8936static void STBIR__CODER_NAME( stbir__encode_uint8_srgb2_linearalpha )( void * outputp, int width_times_channels, float const * encode )
8937{
8938 unsigned char STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned char*) outputp;
8939 unsigned char * end_output = ( (unsigned char*) output ) + width_times_channels;
8941 #ifdef STBIR_SIMD
8943 if ( width_times_channels >= 16 )
8944 {
8945 float const * end_encode_m16 = encode + width_times_channels - 16;
8946 end_output -= 16;
8947 STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
8948 for(;;)
8949 {
8950 stbir__simdf f0, f1, f2, f3;
8951 stbir__simdi i0, i1, i2, i3;
8953 STBIR_SIMD_NO_UNROLL(encode);
8954 stbir__simdf_load4_transposed( f0, f1, f2, f3, encode );
8956 stbir__min_max_shift20( i0, f0 );
8957 stbir__scale_and_convert( i1, f1 );
8958 stbir__min_max_shift20( i2, f2 );
8959 stbir__scale_and_convert( i3, f3 );
8961 stbir__simdi_table_lookup2( i0, i2, ( fp32_to_srgb8_tab4 - (127-13)*8 ) );
8963 stbir__linear_to_srgb_finish( i0, f0 );
8964 stbir__linear_to_srgb_finish( i2, f2 );
8966 stbir__interleave_pack_and_store_16_u8( output, STBIR_strs_join1(i, ,stbir__encode_order0), STBIR_strs_join1(i, ,stbir__encode_order1), STBIR_strs_join1(i, ,stbir__encode_order2), STBIR_strs_join1(i, ,stbir__encode_order3) );
8968 output += 16;
8969 encode += 16;
8970 if ( output <= end_output )
8971 continue;
8972 if ( output == ( end_output + 16 ) )
8973 break;
8974 output = end_output; // backup and do last couple
8975 encode = end_encode_m16;
8976 }
8977 return;
8978 }
8979 #endif
8981 STBIR_SIMD_NO_UNROLL_LOOP_START
8982 do {
8983 float f;
8984 STBIR_SIMD_NO_UNROLL(encode);
8986 output[stbir__decode_order0] = stbir__linear_to_srgb_uchar( encode[0] );
8988 f = encode[1] * stbir__max_uint8_as_float + 0.5f;
8989 STBIR_CLAMP(f, 0, 255);
8990 output[stbir__decode_order1] = (unsigned char) f;
8992 output += 2;
8993 encode += 2;
8994 } while( output < end_output );
8995}
8997#endif
8999static void STBIR__CODER_NAME(stbir__decode_uint16_linear_scaled)( float * decodep, int width_times_channels, void const * inputp )
9000{
9001 float STBIR_STREAMOUT_PTR( * ) decode = decodep;
9002 float * decode_end = (float*) decode + width_times_channels;
9003 unsigned short const * input = (unsigned short const *)inputp;
9005 #ifdef STBIR_SIMD
9006 unsigned short const * end_input_m8 = input + width_times_channels - 8;
9007 if ( width_times_channels >= 8 )
9008 {
9009 decode_end -= 8;
9010 STBIR_NO_UNROLL_LOOP_START_INF_FOR
9011 for(;;)
9012 {
9013 #ifdef STBIR_SIMD8
9014 stbir__simdi i; stbir__simdi8 o;
9015 stbir__simdf8 of;
9016 STBIR_NO_UNROLL(decode);
9017 stbir__simdi_load( i, input );
9018 stbir__simdi8_expand_u16_to_u32( o, i );
9019 stbir__simdi8_convert_i32_to_float( of, o );
9020 stbir__simdf8_mult( of, of, STBIR_max_uint16_as_float_inverted8);
9021 stbir__decode_simdf8_flip( of );
9022 stbir__simdf8_store( decode + 0, of );
9023 #else
9024 stbir__simdi i, o0, o1;
9025 stbir__simdf of0, of1;
9026 STBIR_NO_UNROLL(decode);
9027 stbir__simdi_load( i, input );
9028 stbir__simdi_expand_u16_to_u32( o0,o1,i );
9029 stbir__simdi_convert_i32_to_float( of0, o0 );
9030 stbir__simdi_convert_i32_to_float( of1, o1 );
9031 stbir__simdf_mult( of0, of0, STBIR__CONSTF(STBIR_max_uint16_as_float_inverted) );
9032 stbir__simdf_mult( of1, of1, STBIR__CONSTF(STBIR_max_uint16_as_float_inverted));
9033 stbir__decode_simdf4_flip( of0 );
9034 stbir__decode_simdf4_flip( of1 );
9035 stbir__simdf_store( decode + 0, of0 );
9036 stbir__simdf_store( decode + 4, of1 );
9037 #endif
9038 decode += 8;
9039 input += 8;
9040 if ( decode <= decode_end )
9041 continue;
9042 if ( decode == ( decode_end + 8 ) )
9043 break;
9044 decode = decode_end; // backup and do last couple
9045 input = end_input_m8;
9046 }
9047 return;
9048 }
9049 #endif
9051 // try to do blocks of 4 when you can
9052 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
9053 decode += 4;
9054 STBIR_SIMD_NO_UNROLL_LOOP_START
9055 while( decode <= decode_end )
9056 {
9057 STBIR_SIMD_NO_UNROLL(decode);
9058 decode[0-4] = ((float)(input[stbir__decode_order0])) * stbir__max_uint16_as_float_inverted;
9059 decode[1-4] = ((float)(input[stbir__decode_order1])) * stbir__max_uint16_as_float_inverted;
9060 decode[2-4] = ((float)(input[stbir__decode_order2])) * stbir__max_uint16_as_float_inverted;
9061 decode[3-4] = ((float)(input[stbir__decode_order3])) * stbir__max_uint16_as_float_inverted;
9062 decode += 4;
9063 input += 4;
9064 }
9065 decode -= 4;
9066 #endif
9068 // do the remnants
9069 #if stbir__coder_min_num < 4
9070 STBIR_NO_UNROLL_LOOP_START
9071 while( decode < decode_end )
9072 {
9073 STBIR_NO_UNROLL(decode);
9074 decode[0] = ((float)(input[stbir__decode_order0])) * stbir__max_uint16_as_float_inverted;
9075 #if stbir__coder_min_num >= 2
9076 decode[1] = ((float)(input[stbir__decode_order1])) * stbir__max_uint16_as_float_inverted;
9077 #endif
9078 #if stbir__coder_min_num >= 3
9079 decode[2] = ((float)(input[stbir__decode_order2])) * stbir__max_uint16_as_float_inverted;
9080 #endif
9081 decode += stbir__coder_min_num;
9082 input += stbir__coder_min_num;
9083 }
9084 #endif
9085}
9088static void STBIR__CODER_NAME(stbir__encode_uint16_linear_scaled)( void * outputp, int width_times_channels, float const * encode )
9089{
9090 unsigned short STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned short*) outputp;
9091 unsigned short * end_output = ( (unsigned short*) output ) + width_times_channels;
9093 #ifdef STBIR_SIMD
9094 {
9095 if ( width_times_channels >= stbir__simdfX_float_count*2 )
9096 {
9097 float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2;
9098 end_output -= stbir__simdfX_float_count*2;
9099 STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
9100 for(;;)
9101 {
9102 stbir__simdfX e0, e1;
9103 stbir__simdiX i;
9104 STBIR_SIMD_NO_UNROLL(encode);
9105 stbir__simdfX_madd_mem( e0, STBIR_simd_point5X, STBIR_max_uint16_as_floatX, encode );
9106 stbir__simdfX_madd_mem( e1, STBIR_simd_point5X, STBIR_max_uint16_as_floatX, encode+stbir__simdfX_float_count );
9107 stbir__encode_simdfX_unflip( e0 );
9108 stbir__encode_simdfX_unflip( e1 );
9109 stbir__simdfX_pack_to_words( i, e0, e1 );
9110 stbir__simdiX_store( output, i );
9111 encode += stbir__simdfX_float_count*2;
9112 output += stbir__simdfX_float_count*2;
9113 if ( output <= end_output )
9114 continue;
9115 if ( output == ( end_output + stbir__simdfX_float_count*2 ) )
9116 break;
9117 output = end_output; // backup and do last couple
9118 encode = end_encode_m8;
9119 }
9120 return;
9121 }
9122 }
9124 // try to do blocks of 4 when you can
9125 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
9126 output += 4;
9127 STBIR_NO_UNROLL_LOOP_START
9128 while( output <= end_output )
9129 {
9130 stbir__simdf e;
9131 stbir__simdi i;
9132 STBIR_NO_UNROLL(encode);
9133 stbir__simdf_load( e, encode );
9134 stbir__simdf_madd( e, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint16_as_float), e );
9135 stbir__encode_simdf4_unflip( e );
9136 stbir__simdf_pack_to_8words( i, e, e ); // only use first 4
9137 stbir__simdi_store2( output-4, i );
9138 output += 4;
9139 encode += 4;
9140 }
9141 output -= 4;
9142 #endif
9144 // do the remnants
9145 #if stbir__coder_min_num < 4
9146 STBIR_NO_UNROLL_LOOP_START
9147 while( output < end_output )
9148 {
9149 stbir__simdf e;
9150 STBIR_NO_UNROLL(encode);
9151 stbir__simdf_madd1_mem( e, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint16_as_float), encode+stbir__encode_order0 ); output[0] = stbir__simdf_convert_float_to_short( e );
9152 #if stbir__coder_min_num >= 2
9153 stbir__simdf_madd1_mem( e, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint16_as_float), encode+stbir__encode_order1 ); output[1] = stbir__simdf_convert_float_to_short( e );
9154 #endif
9155 #if stbir__coder_min_num >= 3
9156 stbir__simdf_madd1_mem( e, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint16_as_float), encode+stbir__encode_order2 ); output[2] = stbir__simdf_convert_float_to_short( e );
9157 #endif
9158 output += stbir__coder_min_num;
9159 encode += stbir__coder_min_num;
9160 }
9161 #endif
9163 #else
9165 // try to do blocks of 4 when you can
9166 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
9167 output += 4;
9168 STBIR_SIMD_NO_UNROLL_LOOP_START
9169 while( output <= end_output )
9170 {
9171 float f;
9172 STBIR_SIMD_NO_UNROLL(encode);
9173 f = encode[stbir__encode_order0] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[0-4] = (unsigned short)f;
9174 f = encode[stbir__encode_order1] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[1-4] = (unsigned short)f;
9175 f = encode[stbir__encode_order2] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[2-4] = (unsigned short)f;
9176 f = encode[stbir__encode_order3] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[3-4] = (unsigned short)f;
9177 output += 4;
9178 encode += 4;
9179 }
9180 output -= 4;
9181 #endif
9183 // do the remnants
9184 #if stbir__coder_min_num < 4
9185 STBIR_NO_UNROLL_LOOP_START
9186 while( output < end_output )
9187 {
9188 float f;
9189 STBIR_NO_UNROLL(encode);
9190 f = encode[stbir__encode_order0] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[0] = (unsigned short)f;
9191 #if stbir__coder_min_num >= 2
9192 f = encode[stbir__encode_order1] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[1] = (unsigned short)f;
9193 #endif
9194 #if stbir__coder_min_num >= 3
9195 f = encode[stbir__encode_order2] * stbir__max_uint16_as_float + 0.5f; STBIR_CLAMP(f, 0, 65535); output[2] = (unsigned short)f;
9196 #endif
9197 output += stbir__coder_min_num;
9198 encode += stbir__coder_min_num;
9199 }
9200 #endif
9201 #endif
9202}
9204static void STBIR__CODER_NAME(stbir__decode_uint16_linear)( float * decodep, int width_times_channels, void const * inputp )
9205{
9206 float STBIR_STREAMOUT_PTR( * ) decode = decodep;
9207 float * decode_end = (float*) decode + width_times_channels;
9208 unsigned short const * input = (unsigned short const *)inputp;
9210 #ifdef STBIR_SIMD
9211 unsigned short const * end_input_m8 = input + width_times_channels - 8;
9212 if ( width_times_channels >= 8 )
9213 {
9214 decode_end -= 8;
9215 STBIR_NO_UNROLL_LOOP_START_INF_FOR
9216 for(;;)
9217 {
9218 #ifdef STBIR_SIMD8
9219 stbir__simdi i; stbir__simdi8 o;
9220 stbir__simdf8 of;
9221 STBIR_NO_UNROLL(decode);
9222 stbir__simdi_load( i, input );
9223 stbir__simdi8_expand_u16_to_u32( o, i );
9224 stbir__simdi8_convert_i32_to_float( of, o );
9225 stbir__decode_simdf8_flip( of );
9226 stbir__simdf8_store( decode + 0, of );
9227 #else
9228 stbir__simdi i, o0, o1;
9229 stbir__simdf of0, of1;
9230 STBIR_NO_UNROLL(decode);
9231 stbir__simdi_load( i, input );
9232 stbir__simdi_expand_u16_to_u32( o0, o1, i );
9233 stbir__simdi_convert_i32_to_float( of0, o0 );
9234 stbir__simdi_convert_i32_to_float( of1, o1 );
9235 stbir__decode_simdf4_flip( of0 );
9236 stbir__decode_simdf4_flip( of1 );
9237 stbir__simdf_store( decode + 0, of0 );
9238 stbir__simdf_store( decode + 4, of1 );
9239 #endif
9240 decode += 8;
9241 input += 8;
9242 if ( decode <= decode_end )
9243 continue;
9244 if ( decode == ( decode_end + 8 ) )
9245 break;
9246 decode = decode_end; // backup and do last couple
9247 input = end_input_m8;
9248 }
9249 return;
9250 }
9251 #endif
9253 // try to do blocks of 4 when you can
9254 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
9255 decode += 4;
9256 STBIR_SIMD_NO_UNROLL_LOOP_START
9257 while( decode <= decode_end )
9258 {
9259 STBIR_SIMD_NO_UNROLL(decode);
9260 decode[0-4] = ((float)(input[stbir__decode_order0]));
9261 decode[1-4] = ((float)(input[stbir__decode_order1]));
9262 decode[2-4] = ((float)(input[stbir__decode_order2]));
9263 decode[3-4] = ((float)(input[stbir__decode_order3]));
9264 decode += 4;
9265 input += 4;
9266 }
9267 decode -= 4;
9268 #endif
9270 // do the remnants
9271 #if stbir__coder_min_num < 4
9272 STBIR_NO_UNROLL_LOOP_START
9273 while( decode < decode_end )
9274 {
9275 STBIR_NO_UNROLL(decode);
9276 decode[0] = ((float)(input[stbir__decode_order0]));
9277 #if stbir__coder_min_num >= 2
9278 decode[1] = ((float)(input[stbir__decode_order1]));
9279 #endif
9280 #if stbir__coder_min_num >= 3
9281 decode[2] = ((float)(input[stbir__decode_order2]));
9282 #endif
9283 decode += stbir__coder_min_num;
9284 input += stbir__coder_min_num;
9285 }
9286 #endif
9287}
9289static void STBIR__CODER_NAME(stbir__encode_uint16_linear)( void * outputp, int width_times_channels, float const * encode )
9290{
9291 unsigned short STBIR_SIMD_STREAMOUT_PTR( * ) output = (unsigned short*) outputp;
9292 unsigned short * end_output = ( (unsigned short*) output ) + width_times_channels;
9294 #ifdef STBIR_SIMD
9295 {
9296 if ( width_times_channels >= stbir__simdfX_float_count*2 )
9297 {
9298 float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2;
9299 end_output -= stbir__simdfX_float_count*2;
9300 STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
9301 for(;;)
9302 {
9303 stbir__simdfX e0, e1;
9304 stbir__simdiX i;
9305 STBIR_SIMD_NO_UNROLL(encode);
9306 stbir__simdfX_add_mem( e0, STBIR_simd_point5X, encode );
9307 stbir__simdfX_add_mem( e1, STBIR_simd_point5X, encode+stbir__simdfX_float_count );
9308 stbir__encode_simdfX_unflip( e0 );
9309 stbir__encode_simdfX_unflip( e1 );
9310 stbir__simdfX_pack_to_words( i, e0, e1 );
9311 stbir__simdiX_store( output, i );
9312 encode += stbir__simdfX_float_count*2;
9313 output += stbir__simdfX_float_count*2;
9314 if ( output <= end_output )
9315 continue;
9316 if ( output == ( end_output + stbir__simdfX_float_count*2 ) )
9317 break;
9318 output = end_output; // backup and do last couple
9319 encode = end_encode_m8;
9320 }
9321 return;
9322 }
9323 }
9325 // try to do blocks of 4 when you can
9326 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
9327 output += 4;
9328 STBIR_NO_UNROLL_LOOP_START
9329 while( output <= end_output )
9330 {
9331 stbir__simdf e;
9332 stbir__simdi i;
9333 STBIR_NO_UNROLL(encode);
9334 stbir__simdf_load( e, encode );
9335 stbir__simdf_add( e, STBIR__CONSTF(STBIR_simd_point5), e );
9336 stbir__encode_simdf4_unflip( e );
9337 stbir__simdf_pack_to_8words( i, e, e ); // only use first 4
9338 stbir__simdi_store2( output-4, i );
9339 output += 4;
9340 encode += 4;
9341 }
9342 output -= 4;
9343 #endif
9345 #else
9347 // try to do blocks of 4 when you can
9348 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
9349 output += 4;
9350 STBIR_SIMD_NO_UNROLL_LOOP_START
9351 while( output <= end_output )
9352 {
9353 float f;
9354 STBIR_SIMD_NO_UNROLL(encode);
9355 f = encode[stbir__encode_order0] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[0-4] = (unsigned short)f;
9356 f = encode[stbir__encode_order1] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[1-4] = (unsigned short)f;
9357 f = encode[stbir__encode_order2] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[2-4] = (unsigned short)f;
9358 f = encode[stbir__encode_order3] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[3-4] = (unsigned short)f;
9359 output += 4;
9360 encode += 4;
9361 }
9362 output -= 4;
9363 #endif
9365 #endif
9367 // do the remnants
9368 #if stbir__coder_min_num < 4
9369 STBIR_NO_UNROLL_LOOP_START
9370 while( output < end_output )
9371 {
9372 float f;
9373 STBIR_NO_UNROLL(encode);
9374 f = encode[stbir__encode_order0] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[0] = (unsigned short)f;
9375 #if stbir__coder_min_num >= 2
9376 f = encode[stbir__encode_order1] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[1] = (unsigned short)f;
9377 #endif
9378 #if stbir__coder_min_num >= 3
9379 f = encode[stbir__encode_order2] + 0.5f; STBIR_CLAMP(f, 0, 65535); output[2] = (unsigned short)f;
9380 #endif
9381 output += stbir__coder_min_num;
9382 encode += stbir__coder_min_num;
9383 }
9384 #endif
9385}
9387static void STBIR__CODER_NAME(stbir__decode_half_float_linear)( float * decodep, int width_times_channels, void const * inputp )
9388{
9389 float STBIR_STREAMOUT_PTR( * ) decode = decodep;
9390 float * decode_end = (float*) decode + width_times_channels;
9391 stbir__FP16 const * input = (stbir__FP16 const *)inputp;
9393 #ifdef STBIR_SIMD
9394 if ( width_times_channels >= 8 )
9395 {
9396 stbir__FP16 const * end_input_m8 = input + width_times_channels - 8;
9397 decode_end -= 8;
9398 STBIR_NO_UNROLL_LOOP_START_INF_FOR
9399 for(;;)
9400 {
9401 STBIR_NO_UNROLL(decode);
9403 stbir__half_to_float_SIMD( decode, input );
9404 #ifdef stbir__decode_swizzle
9405 #ifdef STBIR_SIMD8
9406 {
9407 stbir__simdf8 of;
9408 stbir__simdf8_load( of, decode );
9409 stbir__decode_simdf8_flip( of );
9410 stbir__simdf8_store( decode, of );
9411 }
9412 #else
9413 {
9414 stbir__simdf of0,of1;
9415 stbir__simdf_load( of0, decode );
9416 stbir__simdf_load( of1, decode+4 );
9417 stbir__decode_simdf4_flip( of0 );
9418 stbir__decode_simdf4_flip( of1 );
9419 stbir__simdf_store( decode, of0 );
9420 stbir__simdf_store( decode+4, of1 );
9421 }
9422 #endif
9423 #endif
9424 decode += 8;
9425 input += 8;
9426 if ( decode <= decode_end )
9427 continue;
9428 if ( decode == ( decode_end + 8 ) )
9429 break;
9430 decode = decode_end; // backup and do last couple
9431 input = end_input_m8;
9432 }
9433 return;
9434 }
9435 #endif
9437 // try to do blocks of 4 when you can
9438 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
9439 decode += 4;
9440 STBIR_SIMD_NO_UNROLL_LOOP_START
9441 while( decode <= decode_end )
9442 {
9443 STBIR_SIMD_NO_UNROLL(decode);
9444 decode[0-4] = stbir__half_to_float(input[stbir__decode_order0]);
9445 decode[1-4] = stbir__half_to_float(input[stbir__decode_order1]);
9446 decode[2-4] = stbir__half_to_float(input[stbir__decode_order2]);
9447 decode[3-4] = stbir__half_to_float(input[stbir__decode_order3]);
9448 decode += 4;
9449 input += 4;
9450 }
9451 decode -= 4;
9452 #endif
9454 // do the remnants
9455 #if stbir__coder_min_num < 4
9456 STBIR_NO_UNROLL_LOOP_START
9457 while( decode < decode_end )
9458 {
9459 STBIR_NO_UNROLL(decode);
9460 decode[0] = stbir__half_to_float(input[stbir__decode_order0]);
9461 #if stbir__coder_min_num >= 2
9462 decode[1] = stbir__half_to_float(input[stbir__decode_order1]);
9463 #endif
9464 #if stbir__coder_min_num >= 3
9465 decode[2] = stbir__half_to_float(input[stbir__decode_order2]);
9466 #endif
9467 decode += stbir__coder_min_num;
9468 input += stbir__coder_min_num;
9469 }
9470 #endif
9471}
9473static void STBIR__CODER_NAME( stbir__encode_half_float_linear )( void * outputp, int width_times_channels, float const * encode )
9474{
9475 stbir__FP16 STBIR_SIMD_STREAMOUT_PTR( * ) output = (stbir__FP16*) outputp;
9476 stbir__FP16 * end_output = ( (stbir__FP16*) output ) + width_times_channels;
9478 #ifdef STBIR_SIMD
9479 if ( width_times_channels >= 8 )
9480 {
9481 float const * end_encode_m8 = encode + width_times_channels - 8;
9482 end_output -= 8;
9483 STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
9484 for(;;)
9485 {
9486 STBIR_SIMD_NO_UNROLL(encode);
9487 #ifdef stbir__decode_swizzle
9488 #ifdef STBIR_SIMD8
9489 {
9490 stbir__simdf8 of;
9491 stbir__simdf8_load( of, encode );
9492 stbir__encode_simdf8_unflip( of );
9493 stbir__float_to_half_SIMD( output, (float*)&of );
9494 }
9495 #else
9496 {
9497 stbir__simdf of[2];
9498 stbir__simdf_load( of[0], encode );
9499 stbir__simdf_load( of[1], encode+4 );
9500 stbir__encode_simdf4_unflip( of[0] );
9501 stbir__encode_simdf4_unflip( of[1] );
9502 stbir__float_to_half_SIMD( output, (float*)of );
9503 }
9504 #endif
9505 #else
9506 stbir__float_to_half_SIMD( output, encode );
9507 #endif
9508 encode += 8;
9509 output += 8;
9510 if ( output <= end_output )
9511 continue;
9512 if ( output == ( end_output + 8 ) )
9513 break;
9514 output = end_output; // backup and do last couple
9515 encode = end_encode_m8;
9516 }
9517 return;
9518 }
9519 #endif
9521 // try to do blocks of 4 when you can
9522 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
9523 output += 4;
9524 STBIR_SIMD_NO_UNROLL_LOOP_START
9525 while( output <= end_output )
9526 {
9527 STBIR_SIMD_NO_UNROLL(output);
9528 output[0-4] = stbir__float_to_half(encode[stbir__encode_order0]);
9529 output[1-4] = stbir__float_to_half(encode[stbir__encode_order1]);
9530 output[2-4] = stbir__float_to_half(encode[stbir__encode_order2]);
9531 output[3-4] = stbir__float_to_half(encode[stbir__encode_order3]);
9532 output += 4;
9533 encode += 4;
9534 }
9535 output -= 4;
9536 #endif
9538 // do the remnants
9539 #if stbir__coder_min_num < 4
9540 STBIR_NO_UNROLL_LOOP_START
9541 while( output < end_output )
9542 {
9543 STBIR_NO_UNROLL(output);
9544 output[0] = stbir__float_to_half(encode[stbir__encode_order0]);
9545 #if stbir__coder_min_num >= 2
9546 output[1] = stbir__float_to_half(encode[stbir__encode_order1]);
9547 #endif
9548 #if stbir__coder_min_num >= 3
9549 output[2] = stbir__float_to_half(encode[stbir__encode_order2]);
9550 #endif
9551 output += stbir__coder_min_num;
9552 encode += stbir__coder_min_num;
9553 }
9554 #endif
9555}
9557static void STBIR__CODER_NAME(stbir__decode_float_linear)( float * decodep, int width_times_channels, void const * inputp )
9558{
9559 #ifdef stbir__decode_swizzle
9560 float STBIR_STREAMOUT_PTR( * ) decode = decodep;
9561 float * decode_end = (float*) decode + width_times_channels;
9562 float const * input = (float const *)inputp;
9564 #ifdef STBIR_SIMD
9565 if ( width_times_channels >= 16 )
9566 {
9567 float const * end_input_m16 = input + width_times_channels - 16;
9568 decode_end -= 16;
9569 STBIR_NO_UNROLL_LOOP_START_INF_FOR
9570 for(;;)
9571 {
9572 STBIR_NO_UNROLL(decode);
9573 #ifdef stbir__decode_swizzle
9574 #ifdef STBIR_SIMD8
9575 {
9576 stbir__simdf8 of0,of1;
9577 stbir__simdf8_load( of0, input );
9578 stbir__simdf8_load( of1, input+8 );
9579 stbir__decode_simdf8_flip( of0 );
9580 stbir__decode_simdf8_flip( of1 );
9581 stbir__simdf8_store( decode, of0 );
9582 stbir__simdf8_store( decode+8, of1 );
9583 }
9584 #else
9585 {
9586 stbir__simdf of0,of1,of2,of3;
9587 stbir__simdf_load( of0, input );
9588 stbir__simdf_load( of1, input+4 );
9589 stbir__simdf_load( of2, input+8 );
9590 stbir__simdf_load( of3, input+12 );
9591 stbir__decode_simdf4_flip( of0 );
9592 stbir__decode_simdf4_flip( of1 );
9593 stbir__decode_simdf4_flip( of2 );
9594 stbir__decode_simdf4_flip( of3 );
9595 stbir__simdf_store( decode, of0 );
9596 stbir__simdf_store( decode+4, of1 );
9597 stbir__simdf_store( decode+8, of2 );
9598 stbir__simdf_store( decode+12, of3 );
9599 }
9600 #endif
9601 #endif
9602 decode += 16;
9603 input += 16;
9604 if ( decode <= decode_end )
9605 continue;
9606 if ( decode == ( decode_end + 16 ) )
9607 break;
9608 decode = decode_end; // backup and do last couple
9609 input = end_input_m16;
9610 }
9611 return;
9612 }
9613 #endif
9615 // try to do blocks of 4 when you can
9616 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
9617 decode += 4;
9618 STBIR_SIMD_NO_UNROLL_LOOP_START
9619 while( decode <= decode_end )
9620 {
9621 STBIR_SIMD_NO_UNROLL(decode);
9622 decode[0-4] = input[stbir__decode_order0];
9623 decode[1-4] = input[stbir__decode_order1];
9624 decode[2-4] = input[stbir__decode_order2];
9625 decode[3-4] = input[stbir__decode_order3];
9626 decode += 4;
9627 input += 4;
9628 }
9629 decode -= 4;
9630 #endif
9632 // do the remnants
9633 #if stbir__coder_min_num < 4
9634 STBIR_NO_UNROLL_LOOP_START
9635 while( decode < decode_end )
9636 {
9637 STBIR_NO_UNROLL(decode);
9638 decode[0] = input[stbir__decode_order0];
9639 #if stbir__coder_min_num >= 2
9640 decode[1] = input[stbir__decode_order1];
9641 #endif
9642 #if stbir__coder_min_num >= 3
9643 decode[2] = input[stbir__decode_order2];
9644 #endif
9645 decode += stbir__coder_min_num;
9646 input += stbir__coder_min_num;
9647 }
9648 #endif
9650 #else
9652 if ( (void*)decodep != inputp )
9653 STBIR_MEMCPY( decodep, inputp, width_times_channels * sizeof( float ) );
9655 #endif
9656}
9658static void STBIR__CODER_NAME( stbir__encode_float_linear )( void * outputp, int width_times_channels, float const * encode )
9659{
9660 #if !defined( STBIR_FLOAT_HIGH_CLAMP ) && !defined(STBIR_FLOAT_LO_CLAMP) && !defined(stbir__decode_swizzle)
9662 if ( (void*)outputp != (void*) encode )
9663 STBIR_MEMCPY( outputp, encode, width_times_channels * sizeof( float ) );
9665 #else
9667 float STBIR_SIMD_STREAMOUT_PTR( * ) output = (float*) outputp;
9668 float * end_output = ( (float*) output ) + width_times_channels;
9670 #ifdef STBIR_FLOAT_HIGH_CLAMP
9671 #define stbir_scalar_hi_clamp( v ) if ( v > STBIR_FLOAT_HIGH_CLAMP ) v = STBIR_FLOAT_HIGH_CLAMP;
9672 #else
9673 #define stbir_scalar_hi_clamp( v )
9674 #endif
9675 #ifdef STBIR_FLOAT_LOW_CLAMP
9676 #define stbir_scalar_lo_clamp( v ) if ( v < STBIR_FLOAT_LOW_CLAMP ) v = STBIR_FLOAT_LOW_CLAMP;
9677 #else
9678 #define stbir_scalar_lo_clamp( v )
9679 #endif
9681 #ifdef STBIR_SIMD
9683 #ifdef STBIR_FLOAT_HIGH_CLAMP
9684 const stbir__simdfX high_clamp = stbir__simdf_frepX(STBIR_FLOAT_HIGH_CLAMP);
9685 #endif
9686 #ifdef STBIR_FLOAT_LOW_CLAMP
9687 const stbir__simdfX low_clamp = stbir__simdf_frepX(STBIR_FLOAT_LOW_CLAMP);
9688 #endif
9690 if ( width_times_channels >= ( stbir__simdfX_float_count * 2 ) )
9691 {
9692 float const * end_encode_m8 = encode + width_times_channels - ( stbir__simdfX_float_count * 2 );
9693 end_output -= ( stbir__simdfX_float_count * 2 );
9694 STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
9695 for(;;)
9696 {
9697 stbir__simdfX e0, e1;
9698 STBIR_SIMD_NO_UNROLL(encode);
9699 stbir__simdfX_load( e0, encode );
9700 stbir__simdfX_load( e1, encode+stbir__simdfX_float_count );
9701#ifdef STBIR_FLOAT_HIGH_CLAMP
9702 stbir__simdfX_min( e0, e0, high_clamp );
9703 stbir__simdfX_min( e1, e1, high_clamp );
9704#endif
9705#ifdef STBIR_FLOAT_LOW_CLAMP
9706 stbir__simdfX_max( e0, e0, low_clamp );
9707 stbir__simdfX_max( e1, e1, low_clamp );
9708#endif
9709 stbir__encode_simdfX_unflip( e0 );
9710 stbir__encode_simdfX_unflip( e1 );
9711 stbir__simdfX_store( output, e0 );
9712 stbir__simdfX_store( output+stbir__simdfX_float_count, e1 );
9713 encode += stbir__simdfX_float_count * 2;
9714 output += stbir__simdfX_float_count * 2;
9715 if ( output < end_output )
9716 continue;
9717 if ( output == ( end_output + ( stbir__simdfX_float_count * 2 ) ) )
9718 break;
9719 output = end_output; // backup and do last couple
9720 encode = end_encode_m8;
9721 }
9722 return;
9723 }
9725 // try to do blocks of 4 when you can
9726 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
9727 output += 4;
9728 STBIR_NO_UNROLL_LOOP_START
9729 while( output <= end_output )
9730 {
9731 stbir__simdf e0;
9732 STBIR_NO_UNROLL(encode);
9733 stbir__simdf_load( e0, encode );
9734#ifdef STBIR_FLOAT_HIGH_CLAMP
9735 stbir__simdf_min( e0, e0, high_clamp );
9736#endif
9737#ifdef STBIR_FLOAT_LOW_CLAMP
9738 stbir__simdf_max( e0, e0, low_clamp );
9739#endif
9740 stbir__encode_simdf4_unflip( e0 );
9741 stbir__simdf_store( output-4, e0 );
9742 output += 4;
9743 encode += 4;
9744 }
9745 output -= 4;
9746 #endif
9748 #else
9750 // try to do blocks of 4 when you can
9751 #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
9752 output += 4;
9753 STBIR_SIMD_NO_UNROLL_LOOP_START
9754 while( output <= end_output )
9755 {
9756 float e;
9757 STBIR_SIMD_NO_UNROLL(encode);
9758 e = encode[ stbir__encode_order0 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[0-4] = e;
9759 e = encode[ stbir__encode_order1 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[1-4] = e;
9760 e = encode[ stbir__encode_order2 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[2-4] = e;
9761 e = encode[ stbir__encode_order3 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[3-4] = e;
9762 output += 4;
9763 encode += 4;
9764 }
9765 output -= 4;
9767 #endif
9769 #endif
9771 // do the remnants
9772 #if stbir__coder_min_num < 4
9773 STBIR_NO_UNROLL_LOOP_START
9774 while( output < end_output )
9775 {
9776 float e;
9777 STBIR_NO_UNROLL(encode);
9778 e = encode[ stbir__encode_order0 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[0] = e;
9779 #if stbir__coder_min_num >= 2
9780 e = encode[ stbir__encode_order1 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[1] = e;
9781 #endif
9782 #if stbir__coder_min_num >= 3
9783 e = encode[ stbir__encode_order2 ]; stbir_scalar_hi_clamp( e ); stbir_scalar_lo_clamp( e ); output[2] = e;
9784 #endif
9785 output += stbir__coder_min_num;
9786 encode += stbir__coder_min_num;
9787 }
9788 #endif
9790 #endif
9791}
9793#undef stbir__decode_suffix
9794#undef stbir__decode_simdf8_flip
9795#undef stbir__decode_simdf4_flip
9796#undef stbir__decode_order0
9797#undef stbir__decode_order1
9798#undef stbir__decode_order2
9799#undef stbir__decode_order3
9800#undef stbir__encode_order0
9801#undef stbir__encode_order1
9802#undef stbir__encode_order2
9803#undef stbir__encode_order3
9804#undef stbir__encode_simdf8_unflip
9805#undef stbir__encode_simdf4_unflip
9806#undef stbir__encode_simdfX_unflip
9807#undef STBIR__CODER_NAME
9808#undef stbir__coder_min_num
9809#undef stbir__decode_swizzle
9810#undef stbir_scalar_hi_clamp
9811#undef stbir_scalar_lo_clamp
9812#undef STB_IMAGE_RESIZE_DO_CODERS
9814#elif defined( STB_IMAGE_RESIZE_DO_VERTICALS)
9816#ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
9817#define STBIR_chans( start, end ) STBIR_strs_join14(start,STBIR__vertical_channels,end,_cont)
9818#else
9819#define STBIR_chans( start, end ) STBIR_strs_join1(start,STBIR__vertical_channels,end)
9820#endif
9822#if STBIR__vertical_channels >= 1
9823#define stbIF0( code ) code
9824#else
9825#define stbIF0( code )
9826#endif
9827#if STBIR__vertical_channels >= 2
9828#define stbIF1( code ) code
9829#else
9830#define stbIF1( code )
9831#endif
9832#if STBIR__vertical_channels >= 3
9833#define stbIF2( code ) code
9834#else
9835#define stbIF2( code )
9836#endif
9837#if STBIR__vertical_channels >= 4
9838#define stbIF3( code ) code
9839#else
9840#define stbIF3( code )
9841#endif
9842#if STBIR__vertical_channels >= 5
9843#define stbIF4( code ) code
9844#else
9845#define stbIF4( code )
9846#endif
9847#if STBIR__vertical_channels >= 6
9848#define stbIF5( code ) code
9849#else
9850#define stbIF5( code )
9851#endif
9852#if STBIR__vertical_channels >= 7
9853#define stbIF6( code ) code
9854#else
9855#define stbIF6( code )
9856#endif
9857#if STBIR__vertical_channels >= 8
9858#define stbIF7( code ) code
9859#else
9860#define stbIF7( code )
9861#endif
9863static void STBIR_chans( stbir__vertical_scatter_with_,_coeffs)( float ** outputs, float const * vertical_coefficients, float const * input, float const * input_end )
9864{
9865 stbIF0( float STBIR_SIMD_STREAMOUT_PTR( * ) output0 = outputs[0]; float c0s = vertical_coefficients[0]; )
9866 stbIF1( float STBIR_SIMD_STREAMOUT_PTR( * ) output1 = outputs[1]; float c1s = vertical_coefficients[1]; )
9867 stbIF2( float STBIR_SIMD_STREAMOUT_PTR( * ) output2 = outputs[2]; float c2s = vertical_coefficients[2]; )
9868 stbIF3( float STBIR_SIMD_STREAMOUT_PTR( * ) output3 = outputs[3]; float c3s = vertical_coefficients[3]; )
9869 stbIF4( float STBIR_SIMD_STREAMOUT_PTR( * ) output4 = outputs[4]; float c4s = vertical_coefficients[4]; )
9870 stbIF5( float STBIR_SIMD_STREAMOUT_PTR( * ) output5 = outputs[5]; float c5s = vertical_coefficients[5]; )
9871 stbIF6( float STBIR_SIMD_STREAMOUT_PTR( * ) output6 = outputs[6]; float c6s = vertical_coefficients[6]; )
9872 stbIF7( float STBIR_SIMD_STREAMOUT_PTR( * ) output7 = outputs[7]; float c7s = vertical_coefficients[7]; )
9874 #ifdef STBIR_SIMD
9875 {
9876 stbIF0(stbir__simdfX c0 = stbir__simdf_frepX( c0s ); )
9877 stbIF1(stbir__simdfX c1 = stbir__simdf_frepX( c1s ); )
9878 stbIF2(stbir__simdfX c2 = stbir__simdf_frepX( c2s ); )
9879 stbIF3(stbir__simdfX c3 = stbir__simdf_frepX( c3s ); )
9880 stbIF4(stbir__simdfX c4 = stbir__simdf_frepX( c4s ); )
9881 stbIF5(stbir__simdfX c5 = stbir__simdf_frepX( c5s ); )
9882 stbIF6(stbir__simdfX c6 = stbir__simdf_frepX( c6s ); )
9883 stbIF7(stbir__simdfX c7 = stbir__simdf_frepX( c7s ); )
9884 STBIR_SIMD_NO_UNROLL_LOOP_START
9885 while ( ( (char*)input_end - (char*) input ) >= (16*stbir__simdfX_float_count) )
9886 {
9887 stbir__simdfX o0, o1, o2, o3, r0, r1, r2, r3;
9888 STBIR_SIMD_NO_UNROLL(output0);
9890 stbir__simdfX_load( r0, input ); stbir__simdfX_load( r1, input+stbir__simdfX_float_count ); stbir__simdfX_load( r2, input+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( r3, input+(3*stbir__simdfX_float_count) );
9892 #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
9893 stbIF0( stbir__simdfX_load( o0, output0 ); stbir__simdfX_load( o1, output0+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output0+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output0+(3*stbir__simdfX_float_count) );
9894 stbir__simdfX_madd( o0, o0, r0, c0 ); stbir__simdfX_madd( o1, o1, r1, c0 ); stbir__simdfX_madd( o2, o2, r2, c0 ); stbir__simdfX_madd( o3, o3, r3, c0 );
9895 stbir__simdfX_store( output0, o0 ); stbir__simdfX_store( output0+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output0+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output0+(3*stbir__simdfX_float_count), o3 ); )
9896 stbIF1( stbir__simdfX_load( o0, output1 ); stbir__simdfX_load( o1, output1+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output1+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output1+(3*stbir__simdfX_float_count) );
9897 stbir__simdfX_madd( o0, o0, r0, c1 ); stbir__simdfX_madd( o1, o1, r1, c1 ); stbir__simdfX_madd( o2, o2, r2, c1 ); stbir__simdfX_madd( o3, o3, r3, c1 );
9898 stbir__simdfX_store( output1, o0 ); stbir__simdfX_store( output1+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output1+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output1+(3*stbir__simdfX_float_count), o3 ); )
9899 stbIF2( stbir__simdfX_load( o0, output2 ); stbir__simdfX_load( o1, output2+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output2+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output2+(3*stbir__simdfX_float_count) );
9900 stbir__simdfX_madd( o0, o0, r0, c2 ); stbir__simdfX_madd( o1, o1, r1, c2 ); stbir__simdfX_madd( o2, o2, r2, c2 ); stbir__simdfX_madd( o3, o3, r3, c2 );
9901 stbir__simdfX_store( output2, o0 ); stbir__simdfX_store( output2+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output2+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output2+(3*stbir__simdfX_float_count), o3 ); )
9902 stbIF3( stbir__simdfX_load( o0, output3 ); stbir__simdfX_load( o1, output3+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output3+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output3+(3*stbir__simdfX_float_count) );
9903 stbir__simdfX_madd( o0, o0, r0, c3 ); stbir__simdfX_madd( o1, o1, r1, c3 ); stbir__simdfX_madd( o2, o2, r2, c3 ); stbir__simdfX_madd( o3, o3, r3, c3 );
9904 stbir__simdfX_store( output3, o0 ); stbir__simdfX_store( output3+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output3+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output3+(3*stbir__simdfX_float_count), o3 ); )
9905 stbIF4( stbir__simdfX_load( o0, output4 ); stbir__simdfX_load( o1, output4+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output4+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output4+(3*stbir__simdfX_float_count) );
9906 stbir__simdfX_madd( o0, o0, r0, c4 ); stbir__simdfX_madd( o1, o1, r1, c4 ); stbir__simdfX_madd( o2, o2, r2, c4 ); stbir__simdfX_madd( o3, o3, r3, c4 );
9907 stbir__simdfX_store( output4, o0 ); stbir__simdfX_store( output4+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output4+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output4+(3*stbir__simdfX_float_count), o3 ); )
9908 stbIF5( stbir__simdfX_load( o0, output5 ); stbir__simdfX_load( o1, output5+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output5+(2*stbir__simdfX_float_count)); stbir__simdfX_load( o3, output5+(3*stbir__simdfX_float_count) );
9909 stbir__simdfX_madd( o0, o0, r0, c5 ); stbir__simdfX_madd( o1, o1, r1, c5 ); stbir__simdfX_madd( o2, o2, r2, c5 ); stbir__simdfX_madd( o3, o3, r3, c5 );
9910 stbir__simdfX_store( output5, o0 ); stbir__simdfX_store( output5+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output5+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output5+(3*stbir__simdfX_float_count), o3 ); )
9911 stbIF6( stbir__simdfX_load( o0, output6 ); stbir__simdfX_load( o1, output6+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output6+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output6+(3*stbir__simdfX_float_count) );
9912 stbir__simdfX_madd( o0, o0, r0, c6 ); stbir__simdfX_madd( o1, o1, r1, c6 ); stbir__simdfX_madd( o2, o2, r2, c6 ); stbir__simdfX_madd( o3, o3, r3, c6 );
9913 stbir__simdfX_store( output6, o0 ); stbir__simdfX_store( output6+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output6+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output6+(3*stbir__simdfX_float_count), o3 ); )
9914 stbIF7( stbir__simdfX_load( o0, output7 ); stbir__simdfX_load( o1, output7+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output7+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output7+(3*stbir__simdfX_float_count) );
9915 stbir__simdfX_madd( o0, o0, r0, c7 ); stbir__simdfX_madd( o1, o1, r1, c7 ); stbir__simdfX_madd( o2, o2, r2, c7 ); stbir__simdfX_madd( o3, o3, r3, c7 );
9916 stbir__simdfX_store( output7, o0 ); stbir__simdfX_store( output7+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output7+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output7+(3*stbir__simdfX_float_count), o3 ); )
9917 #else
9918 stbIF0( stbir__simdfX_mult( o0, r0, c0 ); stbir__simdfX_mult( o1, r1, c0 ); stbir__simdfX_mult( o2, r2, c0 ); stbir__simdfX_mult( o3, r3, c0 );
9919 stbir__simdfX_store( output0, o0 ); stbir__simdfX_store( output0+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output0+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output0+(3*stbir__simdfX_float_count), o3 ); )
9920 stbIF1( stbir__simdfX_mult( o0, r0, c1 ); stbir__simdfX_mult( o1, r1, c1 ); stbir__simdfX_mult( o2, r2, c1 ); stbir__simdfX_mult( o3, r3, c1 );
9921 stbir__simdfX_store( output1, o0 ); stbir__simdfX_store( output1+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output1+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output1+(3*stbir__simdfX_float_count), o3 ); )
9922 stbIF2( stbir__simdfX_mult( o0, r0, c2 ); stbir__simdfX_mult( o1, r1, c2 ); stbir__simdfX_mult( o2, r2, c2 ); stbir__simdfX_mult( o3, r3, c2 );
9923 stbir__simdfX_store( output2, o0 ); stbir__simdfX_store( output2+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output2+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output2+(3*stbir__simdfX_float_count), o3 ); )
9924 stbIF3( stbir__simdfX_mult( o0, r0, c3 ); stbir__simdfX_mult( o1, r1, c3 ); stbir__simdfX_mult( o2, r2, c3 ); stbir__simdfX_mult( o3, r3, c3 );
9925 stbir__simdfX_store( output3, o0 ); stbir__simdfX_store( output3+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output3+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output3+(3*stbir__simdfX_float_count), o3 ); )
9926 stbIF4( stbir__simdfX_mult( o0, r0, c4 ); stbir__simdfX_mult( o1, r1, c4 ); stbir__simdfX_mult( o2, r2, c4 ); stbir__simdfX_mult( o3, r3, c4 );
9927 stbir__simdfX_store( output4, o0 ); stbir__simdfX_store( output4+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output4+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output4+(3*stbir__simdfX_float_count), o3 ); )
9928 stbIF5( stbir__simdfX_mult( o0, r0, c5 ); stbir__simdfX_mult( o1, r1, c5 ); stbir__simdfX_mult( o2, r2, c5 ); stbir__simdfX_mult( o3, r3, c5 );
9929 stbir__simdfX_store( output5, o0 ); stbir__simdfX_store( output5+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output5+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output5+(3*stbir__simdfX_float_count), o3 ); )
9930 stbIF6( stbir__simdfX_mult( o0, r0, c6 ); stbir__simdfX_mult( o1, r1, c6 ); stbir__simdfX_mult( o2, r2, c6 ); stbir__simdfX_mult( o3, r3, c6 );
9931 stbir__simdfX_store( output6, o0 ); stbir__simdfX_store( output6+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output6+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output6+(3*stbir__simdfX_float_count), o3 ); )
9932 stbIF7( stbir__simdfX_mult( o0, r0, c7 ); stbir__simdfX_mult( o1, r1, c7 ); stbir__simdfX_mult( o2, r2, c7 ); stbir__simdfX_mult( o3, r3, c7 );
9933 stbir__simdfX_store( output7, o0 ); stbir__simdfX_store( output7+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output7+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output7+(3*stbir__simdfX_float_count), o3 ); )
9934 #endif
9936 input += (4*stbir__simdfX_float_count);
9937 stbIF0( output0 += (4*stbir__simdfX_float_count); ) stbIF1( output1 += (4*stbir__simdfX_float_count); ) stbIF2( output2 += (4*stbir__simdfX_float_count); ) stbIF3( output3 += (4*stbir__simdfX_float_count); ) stbIF4( output4 += (4*stbir__simdfX_float_count); ) stbIF5( output5 += (4*stbir__simdfX_float_count); ) stbIF6( output6 += (4*stbir__simdfX_float_count); ) stbIF7( output7 += (4*stbir__simdfX_float_count); )
9938 }
9939 STBIR_SIMD_NO_UNROLL_LOOP_START
9940 while ( ( (char*)input_end - (char*) input ) >= 16 )
9941 {
9942 stbir__simdf o0, r0;
9943 STBIR_SIMD_NO_UNROLL(output0);
9945 stbir__simdf_load( r0, input );
9947 #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
9948 stbIF0( stbir__simdf_load( o0, output0 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c0 ) ); stbir__simdf_store( output0, o0 ); )
9949 stbIF1( stbir__simdf_load( o0, output1 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c1 ) ); stbir__simdf_store( output1, o0 ); )
9950 stbIF2( stbir__simdf_load( o0, output2 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c2 ) ); stbir__simdf_store( output2, o0 ); )
9951 stbIF3( stbir__simdf_load( o0, output3 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c3 ) ); stbir__simdf_store( output3, o0 ); )
9952 stbIF4( stbir__simdf_load( o0, output4 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c4 ) ); stbir__simdf_store( output4, o0 ); )
9953 stbIF5( stbir__simdf_load( o0, output5 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c5 ) ); stbir__simdf_store( output5, o0 ); )
9954 stbIF6( stbir__simdf_load( o0, output6 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c6 ) ); stbir__simdf_store( output6, o0 ); )
9955 stbIF7( stbir__simdf_load( o0, output7 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c7 ) ); stbir__simdf_store( output7, o0 ); )
9956 #else
9957 stbIF0( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c0 ) ); stbir__simdf_store( output0, o0 ); )
9958 stbIF1( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c1 ) ); stbir__simdf_store( output1, o0 ); )
9959 stbIF2( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c2 ) ); stbir__simdf_store( output2, o0 ); )
9960 stbIF3( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c3 ) ); stbir__simdf_store( output3, o0 ); )
9961 stbIF4( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c4 ) ); stbir__simdf_store( output4, o0 ); )
9962 stbIF5( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c5 ) ); stbir__simdf_store( output5, o0 ); )
9963 stbIF6( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c6 ) ); stbir__simdf_store( output6, o0 ); )
9964 stbIF7( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c7 ) ); stbir__simdf_store( output7, o0 ); )
9965 #endif
9967 input += 4;
9968 stbIF0( output0 += 4; ) stbIF1( output1 += 4; ) stbIF2( output2 += 4; ) stbIF3( output3 += 4; ) stbIF4( output4 += 4; ) stbIF5( output5 += 4; ) stbIF6( output6 += 4; ) stbIF7( output7 += 4; )
9969 }
9970 }
9971 #else
9972 STBIR_NO_UNROLL_LOOP_START
9973 while ( ( (char*)input_end - (char*) input ) >= 16 )
9974 {
9975 float r0, r1, r2, r3;
9976 STBIR_NO_UNROLL(input);
9978 r0 = input[0], r1 = input[1], r2 = input[2], r3 = input[3];
9980 #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
9981 stbIF0( output0[0] += ( r0 * c0s ); output0[1] += ( r1 * c0s ); output0[2] += ( r2 * c0s ); output0[3] += ( r3 * c0s ); )
9982 stbIF1( output1[0] += ( r0 * c1s ); output1[1] += ( r1 * c1s ); output1[2] += ( r2 * c1s ); output1[3] += ( r3 * c1s ); )
9983 stbIF2( output2[0] += ( r0 * c2s ); output2[1] += ( r1 * c2s ); output2[2] += ( r2 * c2s ); output2[3] += ( r3 * c2s ); )
9984 stbIF3( output3[0] += ( r0 * c3s ); output3[1] += ( r1 * c3s ); output3[2] += ( r2 * c3s ); output3[3] += ( r3 * c3s ); )
9985 stbIF4( output4[0] += ( r0 * c4s ); output4[1] += ( r1 * c4s ); output4[2] += ( r2 * c4s ); output4[3] += ( r3 * c4s ); )
9986 stbIF5( output5[0] += ( r0 * c5s ); output5[1] += ( r1 * c5s ); output5[2] += ( r2 * c5s ); output5[3] += ( r3 * c5s ); )
9987 stbIF6( output6[0] += ( r0 * c6s ); output6[1] += ( r1 * c6s ); output6[2] += ( r2 * c6s ); output6[3] += ( r3 * c6s ); )
9988 stbIF7( output7[0] += ( r0 * c7s ); output7[1] += ( r1 * c7s ); output7[2] += ( r2 * c7s ); output7[3] += ( r3 * c7s ); )
9989 #else
9990 stbIF0( output0[0] = ( r0 * c0s ); output0[1] = ( r1 * c0s ); output0[2] = ( r2 * c0s ); output0[3] = ( r3 * c0s ); )
9991 stbIF1( output1[0] = ( r0 * c1s ); output1[1] = ( r1 * c1s ); output1[2] = ( r2 * c1s ); output1[3] = ( r3 * c1s ); )
9992 stbIF2( output2[0] = ( r0 * c2s ); output2[1] = ( r1 * c2s ); output2[2] = ( r2 * c2s ); output2[3] = ( r3 * c2s ); )
9993 stbIF3( output3[0] = ( r0 * c3s ); output3[1] = ( r1 * c3s ); output3[2] = ( r2 * c3s ); output3[3] = ( r3 * c3s ); )
9994 stbIF4( output4[0] = ( r0 * c4s ); output4[1] = ( r1 * c4s ); output4[2] = ( r2 * c4s ); output4[3] = ( r3 * c4s ); )
9995 stbIF5( output5[0] = ( r0 * c5s ); output5[1] = ( r1 * c5s ); output5[2] = ( r2 * c5s ); output5[3] = ( r3 * c5s ); )
9996 stbIF6( output6[0] = ( r0 * c6s ); output6[1] = ( r1 * c6s ); output6[2] = ( r2 * c6s ); output6[3] = ( r3 * c6s ); )
9997 stbIF7( output7[0] = ( r0 * c7s ); output7[1] = ( r1 * c7s ); output7[2] = ( r2 * c7s ); output7[3] = ( r3 * c7s ); )
9998 #endif
10000 input += 4;
10001 stbIF0( output0 += 4; ) stbIF1( output1 += 4; ) stbIF2( output2 += 4; ) stbIF3( output3 += 4; ) stbIF4( output4 += 4; ) stbIF5( output5 += 4; ) stbIF6( output6 += 4; ) stbIF7( output7 += 4; )
10002 }
10003 #endif
10004 STBIR_NO_UNROLL_LOOP_START
10005 while ( input < input_end )
10006 {
10007 float r = input[0];
10008 STBIR_NO_UNROLL(output0);
10010 #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
10011 stbIF0( output0[0] += ( r * c0s ); )
10012 stbIF1( output1[0] += ( r * c1s ); )
10013 stbIF2( output2[0] += ( r * c2s ); )
10014 stbIF3( output3[0] += ( r * c3s ); )
10015 stbIF4( output4[0] += ( r * c4s ); )
10016 stbIF5( output5[0] += ( r * c5s ); )
10017 stbIF6( output6[0] += ( r * c6s ); )
10018 stbIF7( output7[0] += ( r * c7s ); )
10019 #else
10020 stbIF0( output0[0] = ( r * c0s ); )
10021 stbIF1( output1[0] = ( r * c1s ); )
10022 stbIF2( output2[0] = ( r * c2s ); )
10023 stbIF3( output3[0] = ( r * c3s ); )
10024 stbIF4( output4[0] = ( r * c4s ); )
10025 stbIF5( output5[0] = ( r * c5s ); )
10026 stbIF6( output6[0] = ( r * c6s ); )
10027 stbIF7( output7[0] = ( r * c7s ); )
10028 #endif
10030 ++input;
10031 stbIF0( ++output0; ) stbIF1( ++output1; ) stbIF2( ++output2; ) stbIF3( ++output3; ) stbIF4( ++output4; ) stbIF5( ++output5; ) stbIF6( ++output6; ) stbIF7( ++output7; )
10032 }
10035static void STBIR_chans( stbir__vertical_gather_with_,_coeffs)( float * outputp, float const * vertical_coefficients, float const ** inputs, float const * input0_end )
10037 float STBIR_SIMD_STREAMOUT_PTR( * ) output = outputp;
10039 stbIF0( float const * input0 = inputs[0]; float c0s = vertical_coefficients[0]; )
10040 stbIF1( float const * input1 = inputs[1]; float c1s = vertical_coefficients[1]; )
10041 stbIF2( float const * input2 = inputs[2]; float c2s = vertical_coefficients[2]; )
10042 stbIF3( float const * input3 = inputs[3]; float c3s = vertical_coefficients[3]; )
10043 stbIF4( float const * input4 = inputs[4]; float c4s = vertical_coefficients[4]; )
10044 stbIF5( float const * input5 = inputs[5]; float c5s = vertical_coefficients[5]; )
10045 stbIF6( float const * input6 = inputs[6]; float c6s = vertical_coefficients[6]; )
10046 stbIF7( float const * input7 = inputs[7]; float c7s = vertical_coefficients[7]; )
10048#if ( STBIR__vertical_channels == 1 ) && !defined(STB_IMAGE_RESIZE_VERTICAL_CONTINUE)
10049 // check single channel one weight
10050 if ( ( c0s >= (1.0f-0.000001f) ) && ( c0s <= (1.0f+0.000001f) ) )
10051 {
10052 STBIR_MEMCPY( output, input0, (char*)input0_end - (char*)input0 );
10053 return;
10054 }
10055#endif
10057 #ifdef STBIR_SIMD
10058 {
10059 stbIF0(stbir__simdfX c0 = stbir__simdf_frepX( c0s ); )
10060 stbIF1(stbir__simdfX c1 = stbir__simdf_frepX( c1s ); )
10061 stbIF2(stbir__simdfX c2 = stbir__simdf_frepX( c2s ); )
10062 stbIF3(stbir__simdfX c3 = stbir__simdf_frepX( c3s ); )
10063 stbIF4(stbir__simdfX c4 = stbir__simdf_frepX( c4s ); )
10064 stbIF5(stbir__simdfX c5 = stbir__simdf_frepX( c5s ); )
10065 stbIF6(stbir__simdfX c6 = stbir__simdf_frepX( c6s ); )
10066 stbIF7(stbir__simdfX c7 = stbir__simdf_frepX( c7s ); )
10068 STBIR_SIMD_NO_UNROLL_LOOP_START
10069 while ( ( (char*)input0_end - (char*) input0 ) >= (16*stbir__simdfX_float_count) )
10070 {
10071 stbir__simdfX o0, o1, o2, o3, r0, r1, r2, r3;
10072 STBIR_SIMD_NO_UNROLL(output);
10074 // prefetch four loop iterations ahead (doesn't affect much for small resizes, but helps with big ones)
10075 stbIF0( stbir__prefetch( input0 + (16*stbir__simdfX_float_count) ); )
10076 stbIF1( stbir__prefetch( input1 + (16*stbir__simdfX_float_count) ); )
10077 stbIF2( stbir__prefetch( input2 + (16*stbir__simdfX_float_count) ); )
10078 stbIF3( stbir__prefetch( input3 + (16*stbir__simdfX_float_count) ); )
10079 stbIF4( stbir__prefetch( input4 + (16*stbir__simdfX_float_count) ); )
10080 stbIF5( stbir__prefetch( input5 + (16*stbir__simdfX_float_count) ); )
10081 stbIF6( stbir__prefetch( input6 + (16*stbir__simdfX_float_count) ); )
10082 stbIF7( stbir__prefetch( input7 + (16*stbir__simdfX_float_count) ); )
10084 #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
10085 stbIF0( stbir__simdfX_load( o0, output ); stbir__simdfX_load( o1, output+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output+(3*stbir__simdfX_float_count) );
10086 stbir__simdfX_load( r0, input0 ); stbir__simdfX_load( r1, input0+stbir__simdfX_float_count ); stbir__simdfX_load( r2, input0+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( r3, input0+(3*stbir__simdfX_float_count) );
10087 stbir__simdfX_madd( o0, o0, r0, c0 ); stbir__simdfX_madd( o1, o1, r1, c0 ); stbir__simdfX_madd( o2, o2, r2, c0 ); stbir__simdfX_madd( o3, o3, r3, c0 ); )
10088 #else
10089 stbIF0( stbir__simdfX_load( r0, input0 ); stbir__simdfX_load( r1, input0+stbir__simdfX_float_count ); stbir__simdfX_load( r2, input0+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( r3, input0+(3*stbir__simdfX_float_count) );
10090 stbir__simdfX_mult( o0, r0, c0 ); stbir__simdfX_mult( o1, r1, c0 ); stbir__simdfX_mult( o2, r2, c0 ); stbir__simdfX_mult( o3, r3, c0 ); )
10091 #endif
10093 stbIF1( stbir__simdfX_load( r0, input1 ); stbir__simdfX_load( r1, input1+stbir__simdfX_float_count ); stbir__simdfX_load( r2, input1+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( r3, input1+(3*stbir__simdfX_float_count) );
10094 stbir__simdfX_madd( o0, o0, r0, c1 ); stbir__simdfX_madd( o1, o1, r1, c1 ); stbir__simdfX_madd( o2, o2, r2, c1 ); stbir__simdfX_madd( o3, o3, r3, c1 ); )
10095 stbIF2( stbir__simdfX_load( r0, input2 ); stbir__simdfX_load( r1, input2+stbir__simdfX_float_count ); stbir__simdfX_load( r2, input2+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( r3, input2+(3*stbir__simdfX_float_count) );
10096 stbir__simdfX_madd( o0, o0, r0, c2 ); stbir__simdfX_madd( o1, o1, r1, c2 ); stbir__simdfX_madd( o2, o2, r2, c2 ); stbir__simdfX_madd( o3, o3, r3, c2 ); )
10097 stbIF3( stbir__simdfX_load( r0, input3 ); stbir__simdfX_load( r1, input3+stbir__simdfX_float_count ); stbir__simdfX_load( r2, input3+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( r3, input3+(3*stbir__simdfX_float_count) );
10098 stbir__simdfX_madd( o0, o0, r0, c3 ); stbir__simdfX_madd( o1, o1, r1, c3 ); stbir__simdfX_madd( o2, o2, r2, c3 ); stbir__simdfX_madd( o3, o3, r3, c3 ); )
10099 stbIF4( stbir__simdfX_load( r0, input4 ); stbir__simdfX_load( r1, input4+stbir__simdfX_float_count ); stbir__simdfX_load( r2, input4+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( r3, input4+(3*stbir__simdfX_float_count) );
10100 stbir__simdfX_madd( o0, o0, r0, c4 ); stbir__simdfX_madd( o1, o1, r1, c4 ); stbir__simdfX_madd( o2, o2, r2, c4 ); stbir__simdfX_madd( o3, o3, r3, c4 ); )
10101 stbIF5( stbir__simdfX_load( r0, input5 ); stbir__simdfX_load( r1, input5+stbir__simdfX_float_count ); stbir__simdfX_load( r2, input5+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( r3, input5+(3*stbir__simdfX_float_count) );
10102 stbir__simdfX_madd( o0, o0, r0, c5 ); stbir__simdfX_madd( o1, o1, r1, c5 ); stbir__simdfX_madd( o2, o2, r2, c5 ); stbir__simdfX_madd( o3, o3, r3, c5 ); )
10103 stbIF6( stbir__simdfX_load( r0, input6 ); stbir__simdfX_load( r1, input6+stbir__simdfX_float_count ); stbir__simdfX_load( r2, input6+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( r3, input6+(3*stbir__simdfX_float_count) );
10104 stbir__simdfX_madd( o0, o0, r0, c6 ); stbir__simdfX_madd( o1, o1, r1, c6 ); stbir__simdfX_madd( o2, o2, r2, c6 ); stbir__simdfX_madd( o3, o3, r3, c6 ); )
10105 stbIF7( stbir__simdfX_load( r0, input7 ); stbir__simdfX_load( r1, input7+stbir__simdfX_float_count ); stbir__simdfX_load( r2, input7+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( r3, input7+(3*stbir__simdfX_float_count) );
10106 stbir__simdfX_madd( o0, o0, r0, c7 ); stbir__simdfX_madd( o1, o1, r1, c7 ); stbir__simdfX_madd( o2, o2, r2, c7 ); stbir__simdfX_madd( o3, o3, r3, c7 ); )
10108 stbir__simdfX_store( output, o0 ); stbir__simdfX_store( output+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output+(3*stbir__simdfX_float_count), o3 );
10109 output += (4*stbir__simdfX_float_count);
10110 stbIF0( input0 += (4*stbir__simdfX_float_count); ) stbIF1( input1 += (4*stbir__simdfX_float_count); ) stbIF2( input2 += (4*stbir__simdfX_float_count); ) stbIF3( input3 += (4*stbir__simdfX_float_count); ) stbIF4( input4 += (4*stbir__simdfX_float_count); ) stbIF5( input5 += (4*stbir__simdfX_float_count); ) stbIF6( input6 += (4*stbir__simdfX_float_count); ) stbIF7( input7 += (4*stbir__simdfX_float_count); )
10111 }
10113 STBIR_SIMD_NO_UNROLL_LOOP_START
10114 while ( ( (char*)input0_end - (char*) input0 ) >= 16 )
10115 {
10116 stbir__simdf o0, r0;
10117 STBIR_SIMD_NO_UNROLL(output);
10119 #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
10120 stbIF0( stbir__simdf_load( o0, output ); stbir__simdf_load( r0, input0 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c0 ) ); )
10121 #else
10122 stbIF0( stbir__simdf_load( r0, input0 ); stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c0 ) ); )
10123 #endif
10124 stbIF1( stbir__simdf_load( r0, input1 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c1 ) ); )
10125 stbIF2( stbir__simdf_load( r0, input2 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c2 ) ); )
10126 stbIF3( stbir__simdf_load( r0, input3 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c3 ) ); )
10127 stbIF4( stbir__simdf_load( r0, input4 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c4 ) ); )
10128 stbIF5( stbir__simdf_load( r0, input5 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c5 ) ); )
10129 stbIF6( stbir__simdf_load( r0, input6 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c6 ) ); )
10130 stbIF7( stbir__simdf_load( r0, input7 ); stbir__simdf_madd( o0, o0, r0, stbir__if_simdf8_cast_to_simdf4( c7 ) ); )
10132 stbir__simdf_store( output, o0 );
10133 output += 4;
10134 stbIF0( input0 += 4; ) stbIF1( input1 += 4; ) stbIF2( input2 += 4; ) stbIF3( input3 += 4; ) stbIF4( input4 += 4; ) stbIF5( input5 += 4; ) stbIF6( input6 += 4; ) stbIF7( input7 += 4; )
10135 }
10136 }
10137 #else
10138 STBIR_NO_UNROLL_LOOP_START
10139 while ( ( (char*)input0_end - (char*) input0 ) >= 16 )
10140 {
10141 float o0, o1, o2, o3;
10142 STBIR_NO_UNROLL(output);
10143 #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
10144 stbIF0( o0 = output[0] + input0[0] * c0s; o1 = output[1] + input0[1] * c0s; o2 = output[2] + input0[2] * c0s; o3 = output[3] + input0[3] * c0s; )
10145 #else
10146 stbIF0( o0 = input0[0] * c0s; o1 = input0[1] * c0s; o2 = input0[2] * c0s; o3 = input0[3] * c0s; )
10147 #endif
10148 stbIF1( o0 += input1[0] * c1s; o1 += input1[1] * c1s; o2 += input1[2] * c1s; o3 += input1[3] * c1s; )
10149 stbIF2( o0 += input2[0] * c2s; o1 += input2[1] * c2s; o2 += input2[2] * c2s; o3 += input2[3] * c2s; )
10150 stbIF3( o0 += input3[0] * c3s; o1 += input3[1] * c3s; o2 += input3[2] * c3s; o3 += input3[3] * c3s; )
10151 stbIF4( o0 += input4[0] * c4s; o1 += input4[1] * c4s; o2 += input4[2] * c4s; o3 += input4[3] * c4s; )
10152 stbIF5( o0 += input5[0] * c5s; o1 += input5[1] * c5s; o2 += input5[2] * c5s; o3 += input5[3] * c5s; )
10153 stbIF6( o0 += input6[0] * c6s; o1 += input6[1] * c6s; o2 += input6[2] * c6s; o3 += input6[3] * c6s; )
10154 stbIF7( o0 += input7[0] * c7s; o1 += input7[1] * c7s; o2 += input7[2] * c7s; o3 += input7[3] * c7s; )
10155 output[0] = o0; output[1] = o1; output[2] = o2; output[3] = o3;
10156 output += 4;
10157 stbIF0( input0 += 4; ) stbIF1( input1 += 4; ) stbIF2( input2 += 4; ) stbIF3( input3 += 4; ) stbIF4( input4 += 4; ) stbIF5( input5 += 4; ) stbIF6( input6 += 4; ) stbIF7( input7 += 4; )
10158 }
10159 #endif
10160 STBIR_NO_UNROLL_LOOP_START
10161 while ( input0 < input0_end )
10162 {
10163 float o0;
10164 STBIR_NO_UNROLL(output);
10165 #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
10166 stbIF0( o0 = output[0] + input0[0] * c0s; )
10167 #else
10168 stbIF0( o0 = input0[0] * c0s; )
10169 #endif
10170 stbIF1( o0 += input1[0] * c1s; )
10171 stbIF2( o0 += input2[0] * c2s; )
10172 stbIF3( o0 += input3[0] * c3s; )
10173 stbIF4( o0 += input4[0] * c4s; )
10174 stbIF5( o0 += input5[0] * c5s; )
10175 stbIF6( o0 += input6[0] * c6s; )
10176 stbIF7( o0 += input7[0] * c7s; )
10177 output[0] = o0;
10178 ++output;
10179 stbIF0( ++input0; ) stbIF1( ++input1; ) stbIF2( ++input2; ) stbIF3( ++input3; ) stbIF4( ++input4; ) stbIF5( ++input5; ) stbIF6( ++input6; ) stbIF7( ++input7; )
10180 }
10183#undef stbIF0
10184#undef stbIF1
10185#undef stbIF2
10186#undef stbIF3
10187#undef stbIF4
10188#undef stbIF5
10189#undef stbIF6
10190#undef stbIF7
10191#undef STB_IMAGE_RESIZE_DO_VERTICALS
10192#undef STBIR__vertical_channels
10193#undef STB_IMAGE_RESIZE_DO_HORIZONTALS
10194#undef STBIR_strs_join24
10195#undef STBIR_strs_join14
10196#undef STBIR_chans
10197#ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
10198#undef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
10199#endif
10201#else // !STB_IMAGE_RESIZE_DO_VERTICALS
10203#define STBIR_chans( start, end ) STBIR_strs_join1(start,STBIR__horizontal_channels,end)
10205#ifndef stbir__2_coeff_only
10206#define stbir__2_coeff_only() \
10207 stbir__1_coeff_only(); \
10208 stbir__1_coeff_remnant(1);
10209#endif
10211#ifndef stbir__2_coeff_remnant
10212#define stbir__2_coeff_remnant( ofs ) \
10213 stbir__1_coeff_remnant(ofs); \
10214 stbir__1_coeff_remnant((ofs)+1);
10215#endif
10217#ifndef stbir__3_coeff_only
10218#define stbir__3_coeff_only() \
10219 stbir__2_coeff_only(); \
10220 stbir__1_coeff_remnant(2);
10221#endif
10223#ifndef stbir__3_coeff_remnant
10224#define stbir__3_coeff_remnant( ofs ) \
10225 stbir__2_coeff_remnant(ofs); \
10226 stbir__1_coeff_remnant((ofs)+2);
10227#endif
10229#ifndef stbir__3_coeff_setup
10230#define stbir__3_coeff_setup()
10231#endif
10233#ifndef stbir__4_coeff_start
10234#define stbir__4_coeff_start() \
10235 stbir__2_coeff_only(); \
10236 stbir__2_coeff_remnant(2);
10237#endif
10239#ifndef stbir__4_coeff_continue_from_4
10240#define stbir__4_coeff_continue_from_4( ofs ) \
10241 stbir__2_coeff_remnant(ofs); \
10242 stbir__2_coeff_remnant((ofs)+2);
10243#endif
10245#ifndef stbir__store_output_tiny
10246#define stbir__store_output_tiny stbir__store_output
10247#endif
10249static void STBIR_chans( stbir__horizontal_gather_,_channels_with_1_coeff)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10251 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10252 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10253 STBIR_SIMD_NO_UNROLL_LOOP_START
10254 do {
10255 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10256 float const * hc = horizontal_coefficients;
10257 stbir__1_coeff_only();
10258 stbir__store_output_tiny();
10259 } while ( output < output_end );
10262static void STBIR_chans( stbir__horizontal_gather_,_channels_with_2_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10264 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10265 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10266 STBIR_SIMD_NO_UNROLL_LOOP_START
10267 do {
10268 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10269 float const * hc = horizontal_coefficients;
10270 stbir__2_coeff_only();
10271 stbir__store_output_tiny();
10272 } while ( output < output_end );
10275static void STBIR_chans( stbir__horizontal_gather_,_channels_with_3_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10277 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10278 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10279 STBIR_SIMD_NO_UNROLL_LOOP_START
10280 do {
10281 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10282 float const * hc = horizontal_coefficients;
10283 stbir__3_coeff_only();
10284 stbir__store_output_tiny();
10285 } while ( output < output_end );
10288static void STBIR_chans( stbir__horizontal_gather_,_channels_with_4_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10290 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10291 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10292 STBIR_SIMD_NO_UNROLL_LOOP_START
10293 do {
10294 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10295 float const * hc = horizontal_coefficients;
10296 stbir__4_coeff_start();
10297 stbir__store_output();
10298 } while ( output < output_end );
10301static void STBIR_chans( stbir__horizontal_gather_,_channels_with_5_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10303 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10304 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10305 STBIR_SIMD_NO_UNROLL_LOOP_START
10306 do {
10307 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10308 float const * hc = horizontal_coefficients;
10309 stbir__4_coeff_start();
10310 stbir__1_coeff_remnant(4);
10311 stbir__store_output();
10312 } while ( output < output_end );
10315static void STBIR_chans( stbir__horizontal_gather_,_channels_with_6_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10317 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10318 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10319 STBIR_SIMD_NO_UNROLL_LOOP_START
10320 do {
10321 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10322 float const * hc = horizontal_coefficients;
10323 stbir__4_coeff_start();
10324 stbir__2_coeff_remnant(4);
10325 stbir__store_output();
10326 } while ( output < output_end );
10329static void STBIR_chans( stbir__horizontal_gather_,_channels_with_7_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10331 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10332 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10333 stbir__3_coeff_setup();
10334 STBIR_SIMD_NO_UNROLL_LOOP_START
10335 do {
10336 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10337 float const * hc = horizontal_coefficients;
10339 stbir__4_coeff_start();
10340 stbir__3_coeff_remnant(4);
10341 stbir__store_output();
10342 } while ( output < output_end );
10345static void STBIR_chans( stbir__horizontal_gather_,_channels_with_8_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10347 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10348 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10349 STBIR_SIMD_NO_UNROLL_LOOP_START
10350 do {
10351 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10352 float const * hc = horizontal_coefficients;
10353 stbir__4_coeff_start();
10354 stbir__4_coeff_continue_from_4(4);
10355 stbir__store_output();
10356 } while ( output < output_end );
10359static void STBIR_chans( stbir__horizontal_gather_,_channels_with_9_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10361 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10362 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10363 STBIR_SIMD_NO_UNROLL_LOOP_START
10364 do {
10365 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10366 float const * hc = horizontal_coefficients;
10367 stbir__4_coeff_start();
10368 stbir__4_coeff_continue_from_4(4);
10369 stbir__1_coeff_remnant(8);
10370 stbir__store_output();
10371 } while ( output < output_end );
10374static void STBIR_chans( stbir__horizontal_gather_,_channels_with_10_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10376 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10377 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10378 STBIR_SIMD_NO_UNROLL_LOOP_START
10379 do {
10380 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10381 float const * hc = horizontal_coefficients;
10382 stbir__4_coeff_start();
10383 stbir__4_coeff_continue_from_4(4);
10384 stbir__2_coeff_remnant(8);
10385 stbir__store_output();
10386 } while ( output < output_end );
10389static void STBIR_chans( stbir__horizontal_gather_,_channels_with_11_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10391 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10392 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10393 stbir__3_coeff_setup();
10394 STBIR_SIMD_NO_UNROLL_LOOP_START
10395 do {
10396 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10397 float const * hc = horizontal_coefficients;
10398 stbir__4_coeff_start();
10399 stbir__4_coeff_continue_from_4(4);
10400 stbir__3_coeff_remnant(8);
10401 stbir__store_output();
10402 } while ( output < output_end );
10405static void STBIR_chans( stbir__horizontal_gather_,_channels_with_12_coeffs)( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10407 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10408 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10409 STBIR_SIMD_NO_UNROLL_LOOP_START
10410 do {
10411 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10412 float const * hc = horizontal_coefficients;
10413 stbir__4_coeff_start();
10414 stbir__4_coeff_continue_from_4(4);
10415 stbir__4_coeff_continue_from_4(8);
10416 stbir__store_output();
10417 } while ( output < output_end );
10420static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod0 )( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10422 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10423 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10424 STBIR_SIMD_NO_UNROLL_LOOP_START
10425 do {
10426 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10427 int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 4 + 3 ) >> 2;
10428 float const * hc = horizontal_coefficients;
10430 stbir__4_coeff_start();
10431 STBIR_SIMD_NO_UNROLL_LOOP_START
10432 do {
10433 hc += 4;
10434 decode += STBIR__horizontal_channels * 4;
10435 stbir__4_coeff_continue_from_4( 0 );
10436 --n;
10437 } while ( n > 0 );
10438 stbir__store_output();
10439 } while ( output < output_end );
10442static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod1 )( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10444 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10445 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10446 STBIR_SIMD_NO_UNROLL_LOOP_START
10447 do {
10448 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10449 int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 5 + 3 ) >> 2;
10450 float const * hc = horizontal_coefficients;
10452 stbir__4_coeff_start();
10453 STBIR_SIMD_NO_UNROLL_LOOP_START
10454 do {
10455 hc += 4;
10456 decode += STBIR__horizontal_channels * 4;
10457 stbir__4_coeff_continue_from_4( 0 );
10458 --n;
10459 } while ( n > 0 );
10460 stbir__1_coeff_remnant( 4 );
10461 stbir__store_output();
10462 } while ( output < output_end );
10465static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod2 )( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10467 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10468 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10469 STBIR_SIMD_NO_UNROLL_LOOP_START
10470 do {
10471 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10472 int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 6 + 3 ) >> 2;
10473 float const * hc = horizontal_coefficients;
10475 stbir__4_coeff_start();
10476 STBIR_SIMD_NO_UNROLL_LOOP_START
10477 do {
10478 hc += 4;
10479 decode += STBIR__horizontal_channels * 4;
10480 stbir__4_coeff_continue_from_4( 0 );
10481 --n;
10482 } while ( n > 0 );
10483 stbir__2_coeff_remnant( 4 );
10485 stbir__store_output();
10486 } while ( output < output_end );
10489static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod3 )( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width )
10491 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
10492 float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
10493 stbir__3_coeff_setup();
10494 STBIR_SIMD_NO_UNROLL_LOOP_START
10495 do {
10496 float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
10497 int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 7 + 3 ) >> 2;
10498 float const * hc = horizontal_coefficients;
10500 stbir__4_coeff_start();
10501 STBIR_SIMD_NO_UNROLL_LOOP_START
10502 do {
10503 hc += 4;
10504 decode += STBIR__horizontal_channels * 4;
10505 stbir__4_coeff_continue_from_4( 0 );
10506 --n;
10507 } while ( n > 0 );
10508 stbir__3_coeff_remnant( 4 );
10510 stbir__store_output();
10511 } while ( output < output_end );
10514static stbir__horizontal_gather_channels_func * STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_funcs)[4]=
10516 STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod0),
10517 STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod1),
10518 STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod2),
10519 STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod3),
10520};
10522static stbir__horizontal_gather_channels_func * STBIR_chans(stbir__horizontal_gather_,_channels_funcs)[12]=
10524 STBIR_chans(stbir__horizontal_gather_,_channels_with_1_coeff),
10525 STBIR_chans(stbir__horizontal_gather_,_channels_with_2_coeffs),
10526 STBIR_chans(stbir__horizontal_gather_,_channels_with_3_coeffs),
10527 STBIR_chans(stbir__horizontal_gather_,_channels_with_4_coeffs),
10528 STBIR_chans(stbir__horizontal_gather_,_channels_with_5_coeffs),
10529 STBIR_chans(stbir__horizontal_gather_,_channels_with_6_coeffs),
10530 STBIR_chans(stbir__horizontal_gather_,_channels_with_7_coeffs),
10531 STBIR_chans(stbir__horizontal_gather_,_channels_with_8_coeffs),
10532 STBIR_chans(stbir__horizontal_gather_,_channels_with_9_coeffs),
10533 STBIR_chans(stbir__horizontal_gather_,_channels_with_10_coeffs),
10534 STBIR_chans(stbir__horizontal_gather_,_channels_with_11_coeffs),
10535 STBIR_chans(stbir__horizontal_gather_,_channels_with_12_coeffs),
10536};
10538#undef STBIR__horizontal_channels
10539#undef STB_IMAGE_RESIZE_DO_HORIZONTALS
10540#undef stbir__1_coeff_only
10541#undef stbir__1_coeff_remnant
10542#undef stbir__2_coeff_only
10543#undef stbir__2_coeff_remnant
10544#undef stbir__3_coeff_only
10545#undef stbir__3_coeff_remnant
10546#undef stbir__3_coeff_setup
10547#undef stbir__4_coeff_start
10548#undef stbir__4_coeff_continue_from_4
10549#undef stbir__store_output
10550#undef stbir__store_output_tiny
10551#undef STBIR_chans
10553#endif // HORIZONALS
10555#undef STBIR_strs_join2
10556#undef STBIR_strs_join1
10558#endif // STB_IMAGE_RESIZE_DO_HORIZONTALS/VERTICALS/CODERS
10560/*
10561------------------------------------------------------------------------------
10562This software is available under 2 licenses -- choose whichever you prefer.
10563------------------------------------------------------------------------------
10564ALTERNATIVE A - MIT License
10565Copyright (c) 2017 Sean Barrett
10566Permission is hereby granted, free of charge, to any person obtaining a copy of
10567this software and associated documentation files (the "Software"), to deal in
10568the Software without restriction, including without limitation the rights to
10569use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
10570of the Software, and to permit persons to whom the Software is furnished to do
10571so, subject to the following conditions:
10572The above copyright notice and this permission notice shall be included in all
10573copies or substantial portions of the Software.
10574THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10575IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
10576FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
10577AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
10578LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
10579OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
10580SOFTWARE.
10581------------------------------------------------------------------------------
10582ALTERNATIVE B - Public Domain (www.unlicense.org)
10583This is free and unencumbered software released into the public domain.
10584Anyone is free to copy, modify, publish, use, compile, sell, or distribute this
10585software, either in source code form or as a compiled binary, for any purpose,
10586commercial or non-commercial, and by any means.
10587In jurisdictions that recognize copyright laws, the author or authors of this
10588software dedicate any and all copyright interest in the software to the public
10589domain. We make this dedication for the benefit of the public at large and to
10590the detriment of our heirs and successors. We intend this dedication to be an
10591overt act of relinquishment in perpetuity of all present and future rights to
10592this software under copyright law.
10593THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10594IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
10595FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
10596AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
10597ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
10598WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
10599------------------------------------------------------------------------------
10600*/
index : raylib-jai
Bindings from https://solarium.technology